“Đánh giá một thẩm phán LLM”: Khung đánh giá hai lớp để cải tiến liên tục việc đánh giá đơn xin cấp bằng LLM | Daniel Khoa Le

[ad_1]

🔹 Việc triển khai khuôn khổ này tập trung vào khái niệm cấp cao mà không đi sâu vào chi tiết.

Không có thư viện hoặc nền tảng đánh giá LLM nào (ví dụ: LangChain, LangSmith, LangFuse, v.v.). Việc triển khai mã có mức độ trừu tượng thấp, cho phép người đọc dễ dàng theo dõi quy trình làm việc mà không bị lạc vào các chi tiết phức tạp của quá trình thiết lập.

Kể từ khi tham khảo Thẩm phán LLM và Thẩm phán tối cao LLM có thể khó theo dõi, chúng ta hãy chỉ định các vai trò danh nghĩa cho các thành phần trong thiết lập đánh giá:

LLM Software ➡️The Scholar
LLM Choose ➡️ The Instructor
Supreme LLM Choose ➡️ The Reviewer

💥 Mã đầy đủ có thể được tìm thấy trong kho lưu trữ này.

# LLM-Software
def trigger_llm_app(context: str, query: str):
fmt_context_and_question = f"""Context: {context}nQuestion: {query}"""
messages = (
llm_app_prompt,{"position": "consumer",
"content material": fmt_context_and_question}
)
response = openai_client.chat.completions.create(messages=messages,
mannequin="gpt-3.5-turbo")
return response.decisions(0).message.content material
# LLM-Choose
def eval_llm_app(context: str, query: str, predicted_answer: str):
fmt_input = f"""Context: {context}nQuestion: {
query}nStudent's Reply: {predicted_answer}"""
messages = (
llm_judge_prompt,
{"position": "consumer",
"content material": fmt_input}
)
response = openai_client.chat.completions.create(messages=messages,
mannequin="gpt-3.5-turbo")
return response.decisions(0).message.content material
# Superior LLM-Choose
def eval_llm_judge(context: str, query: str, student_answer: str, teacher_grading: str):
fmt_input = f"""Context: {context}nQuestion: {query}nStudent's Reply: {
student_answer}nTeacher's Grading: {teacher_grading}"""
messages = (
supreme_llm_judge_prompt,
{"position": "consumer",
"content material": fmt_input}
)
response = openai_client.chat.completions.create(messages=messages,
mannequin="gpt-4")
return response.decisions(0).message.content material

Một quyết định tinh tế nhưng quan trọng trong thiết kế của thí nghiệm là sử dụng GPT-4 làm Supreme LLM Choose, trong khi LLM Software và LLM Choose sử dụng GPT-3.5-turbo. Điều này đảm bảo rằng các đánh giá của Supreme LLM Choose mạnh mẽ và đáng tin cậy hơn (đọc thêm về so sánh đây).

Các lời nhắc cho từng thành phần trong thí nghiệm này như sau. Bạn có thể thấy rằng tôi đã sử dụng kỹ thuật nhắc nhở ít lần để cải thiện tính nhất quán của kết quả đánh giá.


llm_app_prompt = {"position": "system",
"content material": """You're a useful assistant. Please use step-by-step reasoning to deal with questions based mostly on the precise context supplied.."""}llm_judge_prompt = {
"position": "system",
"content material": """You're a math trainer tasked with grading a pupil's reply.
Consider the coed's response contemplating the context of the query, the correctness of the reply, and the reasoning supplied.
Conclude with a grade: assign '0' if the reply and the reasoning is inaccurate and '1' whether it is appropriate.
Your grading output must be strictly on this format (no different phrases allowed): 'Grade: 0' or 'Grade: 1'.
Beneath are examples in your reference:
Instance:
Query: How lengthy does it take to drive 100 kilometers at 50 kilometers per hour?
Scholar's Reply: To search out the time, divide the gap by the velocity: 100 km / 50 km/h = 2 hours.
Grade: 1
Instance:
Query: Calculate the world of a sq. with a facet size of 5 meters.
Scholar's Reply: On condition that the facet size of the sq. is 5 meters, the reply is: 5*4=20 sq. meters.
Grade: 0
Instance:
Query: What number of seconds are in an hour?
Scholar's Reply: 3600 seconds
Grade: 1
Instance:
Query: Given two units, Set A containing the weather 1, 2, and three, and Set B containing the weather 3, 4, and 5, what's the intersection of Set A and Set B?
Scholar's Reply: The factor that's widespread to each units is 1.
Grade: 0
"""
}
supreme_llm_judge_prompt = {
"position": "system",
"content material": """You're an examination reviewer tasked with evaluating lecturers' grading. Your job is to overview the grade given by the trainer to a pupil's reply and assess its correctness.
Essential: Your overview is of the trainer's grading, not the coed's reply.
Output Format: Your overview output must be strictly on this format (no different phrases allowed): 'Correctness: 0' or 'Correctness: 1'.
Beneath are examples in your reference:
Instance:
Query: How lengthy does it take to drive 100 kilometers at 50 kilometers per hour?
Scholar's Reply: To search out the time, divide the gap by the velocity: 100 km / 50 km/h = 2 hours.
Grade: 1
Correctness: 1
Instance:
Query: Calculate the world of a sq. with a facet size of 5 meters.
Scholar's Reply: On condition that the facet size of the sq. is 5 meters, the reply is: 5*4=20 sq. meters.
Grade: 0
Correctness: 1
Instance:
Query: What number of seconds are in an hour?
Scholar's Reply: 3600 seconds
Grade: 0
Correctness: 0
Instance:
Query: Given two units, Set A containing the weather 1, 2, and three, and Set B containing the weather 3, 4, and 5, what's the intersection of Set A and Set B?
Scholar's Reply: The factor that's widespread to each units is 1.
Grade: 1
Correctness: 0
"""
}

🔹 Câu hỏi chúng tôi hỏi ứng viên LLM:

Trong một nhóm 30 người có thể nói tiếng Anh hoặc tiếng Đức, 10 người có thể nói cả hai thứ tiếng và 25 người có thể nói tiếng Đức.
Có bao nhiêu người chỉ nói tiếng Anh?

Đơn xin cấp bằng LLM không chỉ phải cung cấp câu trả lời đúng mà còn phải giải thích lý do. Sau đó, Thẩm phán LLM sẽ đánh giá kết quả này (cả câu trả lời cuối cùng và lý do). Cuối cùng, Thẩm phán LLM tối cao sẽ đánh giá đánh giá do Thẩm phán LLM đưa ra.

Bạn có thể nhận thấy rằng tôi đã để lại thông tin dư thừa trong bối cảnh của câu hỏi này để phản biện đơn xin cấp bằng LLM.

🔹 Tôi đã chạy chu trình đánh giá này 100 lần (sử dụng cùng một câu hỏi) để kiểm tra độ chính xác của các giám khảo.

if __name__ == "__main__":
context = "In a gaggle of 30 individuals who can communicate both English or German, 10 can communicate each, and 25 can communicate German."
user_question = "What number of communicate solely English?"list_results = ()
for i in vary(100):
print(f"===> Iteration {i+1}")
list_results.append(consider(context, user_question))

[ad_2]

Source link

Giá cả của Styler, Ưu điểm Nhược điểm, Tính năng, Các lựa chọn thay thế

Phục vụ nhiều bộ điều hợp LoRA với vLLM | của Benjamin Marie | Tháng 8 năm 2024

Những cân nhắc thiết yếu để triển khai học máy | của Conal Henderson | Tháng 7 năm 2024

Sự khác biệt giữa ANN, CNN và RNN

Quy trình mua hàng để thanh toán & cách tối ưu hóa chu trình P2P

AI và Nguồn nhân lực: Chuyển đổi Tương lai của Quản lý Lực lượng lao động

Giá InVideo, Ưu điểm Nhược điểm, Tính năng, Các lựa chọn thay thế

Đi sâu vào AutoGen và Multi-Agent Frameworks | của Matthew Gunton | Tháng 6, 2024

Most Popular

Sự khác biệt giữa ANN, CNN và RNN

Quy trình mua hàng để thanh toán & cách tối ưu hóa chu trình P2P

AI và Nguồn nhân lực: Chuyển đổi Tương lai của Quản lý Lực lượng lao động

Our Picks

Google cuối cùng cũng hành động để hạn chế deepfake không có sự đồng thuận

Nghiên cứu đồng hành của Cognizant & Oxford Economics với Báo cáo “Công việc mới, Thế giới mới” cho thấy sự lạc quan thận trọng trong các doanh nghiệp áp dụng AI

Làm thế nào để truy cập mô hình GitHub trong vài bước?

“Đánh giá một thẩm phán LLM”: Khung đánh giá hai lớp để cải tiến liên tục việc đánh giá đơn xin cấp bằng LLM | Daniel Khoa Le | Tháng 7 năm 2024

Related Posts