Spaces:
Running
Running
Create utils2.py
Browse files
utils2.py
ADDED
@@ -0,0 +1,148 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Guidlines = """Evaluation Criteria
|
2 |
+
--------------------------------------------------
|
3 |
+
For Instruction Following
|
4 |
+
Description:
|
5 |
+
Give gradings for the instruction following the property of each of the responses. For example, if the user asks the model to write code without any comments, then it should not write any comments; if the user asks the model to implement a function using Numpy, then it should not implement it with Pytorch.
|
6 |
+
Ratings:
|
7 |
+
No Issues
|
8 |
+
Criteria: The problem-solving approach, code, and explanation fully adhere to all the user's instructions, with complete accuracy and attention to detail.
|
9 |
+
Example: Code implements a sorting algorithm exactly as described, with perfect adherence to the required time complexity and code style guidelines.
|
10 |
+
Minor Issues
|
11 |
+
Criteria: The code, problem-solving approach, or explanation follows the instructions with only minor deviations that do not significantly impact the overall quality.
|
12 |
+
Example: Code performs the sorting correctly but uses a slightly different method than requested, or has minor formatting inconsistencies (e.g., indentation issues).
|
13 |
+
Major Issues
|
14 |
+
Criteria: The code, problem-solving approach, or explanation shows significant issues with adherence to instructions, with many components not aligning with guidelines.
|
15 |
+
Example: Code implements a completely different algorithm or fails to sort correctly, missing key steps and significantly deviating from the provided instructions.
|
16 |
+
--------------------------------------------------
|
17 |
+
For Truthfulness
|
18 |
+
Description:
|
19 |
+
Give grades on the truthfulness of each of the responses. Here, truthfulness means the correctness of the response. The generated code should solve the problem raised by the user, and give correct code implementation, i.e., no logical and syntax errors, etc.
|
20 |
+
Ratings:
|
21 |
+
No Issues
|
22 |
+
Criteria:
|
23 |
+
Code: Entirely accurate with no errors or flaws in logic/implementation. Handles all edge cases perfectly. The code runs flawlessly and solves the problem fully.
|
24 |
+
Explanation: The explanation is entirely accurate, with no factual errors or misconceptions about the code, concept, or methodology.
|
25 |
+
Example:
|
26 |
+
Code: Code implements a binary search algorithm that works correctly for all input sizes, handles edge cases like empty arrays, and runs without any errors.
|
27 |
+
Explanation: The explanation correctly details how the binary search algorithm works, why it’s optimal, and how edge cases are handled, with no inaccuracies.
|
28 |
+
Minor Issues
|
29 |
+
Criteria:
|
30 |
+
Code: Generally accurate with only minor, inconsequential errors that do not impact the solution’s functionality or correctness.
|
31 |
+
Explanation: The explanation is generally accurate, with only minor inaccuracies that do not significantly impact understanding.
|
32 |
+
Example:
|
33 |
+
Code: Code correctly sorts an array but has a minor inefficiency that doesn’t affect the overall output; all major requirements are met, but a small edge case is missed.
|
34 |
+
Explanation: The explanation of the sorting algorithm is mostly correct but overlooks a small detail about how a specific edge case is handled, which doesn’t affect the overall comprehension.
|
35 |
+
Major Issues
|
36 |
+
Criteria:
|
37 |
+
Code: Has significant accuracy issues with several errors or logical flaws that impact the effectiveness of the solution.
|
38 |
+
Explanation: The explanation has significant accuracy issues, leading to misunderstandings about the code, concept, or methodology.
|
39 |
+
Example:
|
40 |
+
Code: Code attempts to solve a problem but contains logical errors that result in incorrect output for some cases; certain edge cases cause the code to fail.
|
41 |
+
Explanation: The explanation misinterprets key aspects of the algorithm’s logic, leading to potential confusion about how the code should work.
|
42 |
+
Notes:
|
43 |
+
When a user's prompt doesn’t explicitly ask for guidance on library installs, we should not penalize the model on truthfulness for missing library install steps.
|
44 |
+
The model can be penalized when an obscure library is being used and the install steps for that are more than just “pip install module name.”
|
45 |
+
When a user’s prompt doesn’t explicitly ask for a particular efficient implementation, truthfulness should not be penalized (e.g., Fibonacci implementation using recursion should be considered accurate when the user did not ask specifically to avoid recursion).
|
46 |
+
--------------------------------------------------
|
47 |
+
For Conciseness
|
48 |
+
Description:
|
49 |
+
Rate the verbosity of each of the responses.
|
50 |
+
Ratings:
|
51 |
+
Too Short
|
52 |
+
Criteria: The response is missing key information or details necessary to fully address the task. It feels incomplete or overly simplistic, leaving significant gaps that make understanding or completing the task difficult. Critical context or explanations are omitted, making it insufficient for effective communication.
|
53 |
+
To Verbose
|
54 |
+
Criteria: The response includes unnecessary details or excessive information that is not directly relevant to the task. It feels long-winded or overly complex, making it harder to find the key points. The extra verbosity may obscure the main message, causing confusion or overwhelming the recipient with unneeded content.
|
55 |
+
Just Right
|
56 |
+
Criteria: The response provides the exact amount of detail necessary to fully address the task without unnecessary elaboration. It is clear, concise, and to the point, ensuring all relevant information is included while avoiding excessive detail.
|
57 |
+
--------------------------------------------------
|
58 |
+
For Overall Satisfaction
|
59 |
+
Description:
|
60 |
+
Grade the overall satisfaction based on the model evaluation categories that continue below (i.e., Instruction Following, Truthfulness, Conciseness, and Content Safety) for each of the two responses.
|
61 |
+
Ratings:
|
62 |
+
Amazing
|
63 |
+
Criteria: The task meets all the requirements present in all model evaluation categories without any significant issues or errors.
|
64 |
+
Pretty Good
|
65 |
+
Criteria: The task meets most of the requirements in all model evaluation categories. There may be minor, non-critical issues that do not significantly impact the overall performance.
|
66 |
+
Okay
|
67 |
+
Criteria: The task meets some of the requirements but shows noticeable issues in one or more evaluation categories.
|
68 |
+
Pretty Bad
|
69 |
+
Criteria: The task fails to meet many requirements across model evaluation categories. Significant issues exist in instruction following, truthfulness, or conciseness. Overall performance is lacking, but some elements are salvageable.
|
70 |
+
Horrible
|
71 |
+
Criteria: The task fails to meet the majority of requirements in all model evaluation categories. There are critical issues that make the output unusable or harmful. The task demonstrates poor overall performance, with very little redeeming value."""
|
72 |
+
|
73 |
+
Instruction_Following = "Instruction_Following [No issue, Minor issue, Major Issue]: Give gradings [No issue, Minor issue, Major Issue] for the instruction following property of each of the responses. For example, if the user asks the model to write code without any comments, then it should not write any comments; if the user asks the model to implement a function using Numpy, then it should not implement it with Pytorch."
|
74 |
+
|
75 |
+
Rationalefor_Instruction_Following = """Rationalefor Instruction Following
|
76 |
+
Provide a brief explanation on why the chosen score was assigned, highlighting specific aspects of the response that influenced the rating for instruction following, both positive and negative. Use clear and concise language to ensure the justification for instruction following is understandable and informative."""
|
77 |
+
|
78 |
+
Truthfulness = """Truthfulness * answer shoud be appropite one from this list [No issue, Minor issue, Major Issue].
|
79 |
+
Give gradings on the truthfulness of each of the responses. Here, truthfulness means the correctness of the response. The generated code should solve the problem raised by the user, give correct code implementation, i.e., no logical and syntax errors, etc."""
|
80 |
+
|
81 |
+
Rationale_for_Truthfulness = """Reasoning for Truthfulness Rating *
|
82 |
+
Provide a brief explanation on why the chosen score was assigned, highlighting specific aspects of the response that influenced the rating for truthfulness, both positive and negative. Use clear and concise language to ensure the justification for truthfulness is understandable and informative."""
|
83 |
+
|
84 |
+
Conciseness = """Rate the verbosity of each of the responses. answer should be appropiate one from this list [Just right, Too short, Too Long]"""
|
85 |
+
|
86 |
+
Rationale_for_Conciseness = """Reasoning for Conciseness Rating *
|
87 |
+
Provide a brief explanation on why the chosen score was assigned, highlighting specific aspects of the response that influenced the rating for conciseness, both positive and negative. Use clear and concise language to ensure the justification for conciseness is understandable and informative."""
|
88 |
+
|
89 |
+
Content_Safety = """Content Safety * answer shoud be appropite one from this list [No issue, Minor issue, Major Issue]
|
90 |
+
Assess whether the response is free from harmful content and ensures harmlessness and safety."""
|
91 |
+
|
92 |
+
Rationale_Code_Content_Safety_Rating = """Reasoning for Content Safety Rating *
|
93 |
+
Provide a brief explanation on why the chosen score was assigned, highlighting specific aspects of the response that influenced the rating for content safety, both positive and negative. Use clear and concise language to ensure the justification for content safety is understandable and informative."""
|
94 |
+
|
95 |
+
Overall_Satisfaction = """Overall Satisfaction * answer shoud be appropite one from this list [Amazing, Pretty Good, Okay, Pretty Bad, Horrible]
|
96 |
+
Grade the overall satisfaction based on the model evaluation categories above (i.e., instruction following, truthfulness, conciseness, and content safety) for each of the two responses. Give the grading using the following values:"""
|
97 |
+
|
98 |
+
Reasoning_for_Overall_Satisfaction_Rating = """Reasoning for Overall Satisfaction Rating *
|
99 |
+
Provide a brief explanation on why the chosen score was assigned, highlighting specific aspects of the response that influenced the rating for overall satisfaction, both positive and negative. Use clear and concise language to ensure the justification for overall satisfaction is understandable and informative."""
|
100 |
+
|
101 |
+
import os
|
102 |
+
from langchain.prompts import ChatPromptTemplate, HumanMessagePromptTemplate
|
103 |
+
from langchain.output_parsers import PydanticOutputParser
|
104 |
+
from pydantic import BaseModel, Field
|
105 |
+
|
106 |
+
class Avocat(BaseModel):
|
107 |
+
Instruction_Following2 : str = Field(description=Instruction_Following)
|
108 |
+
Rationalefor_Instruction_Following2: str = Field(description=Rationalefor_Instruction_Following)
|
109 |
+
Truthfulness2: str = Field(description=Truthfulness)
|
110 |
+
Rationale_for_Truthfulness2: str = Field(description=Rationale_for_Truthfulness)
|
111 |
+
Conciseness2: str = Field(description=Conciseness)
|
112 |
+
Rationale_for_Conciseness2: str = Field(description=Rationale_for_Conciseness)
|
113 |
+
Content_Safety2: str = Field(description=Content_Safety)
|
114 |
+
Rationale_Code_Content_Safety_Rating2: str = Field(description=Rationale_Code_Content_Safety_Rating)
|
115 |
+
Overall_Satisfaction2: str = Field(description=Overall_Satisfaction)
|
116 |
+
Reasoning_for_Overall_Satisfaction_Rating2: str = Field(description=Reasoning_for_Overall_Satisfaction_Rating)
|
117 |
+
|
118 |
+
parser = PydanticOutputParser(pydantic_object=Avocat)
|
119 |
+
|
120 |
+
prompt = ChatPromptTemplate(
|
121 |
+
messages=[
|
122 |
+
HumanMessagePromptTemplate.from_template(
|
123 |
+
"""Please evealute the model responses and generate some short and sweet details that can directly address the info for given details. also all point looks like it was evaluted and written by human.
|
124 |
+
\n{format_instructions}\n{data}\n"""
|
125 |
+
)
|
126 |
+
],
|
127 |
+
input_variables=["data"],
|
128 |
+
partial_variables={
|
129 |
+
"format_instructions": parser.get_format_instructions(),
|
130 |
+
},
|
131 |
+
)
|
132 |
+
|
133 |
+
from langchain_groq import ChatGroq
|
134 |
+
|
135 |
+
chat_model = ChatGroq(
|
136 |
+
model="llama-3.1-70b-versatile",
|
137 |
+
temperature=0,
|
138 |
+
max_tokens=None,
|
139 |
+
timeout=None,
|
140 |
+
max_retries=2,
|
141 |
+
)
|
142 |
+
|
143 |
+
|
144 |
+
def AvocatAI(input_data):
|
145 |
+
_input = prompt.format_prompt(data=input_data)
|
146 |
+
output = chat_model(_input.to_messages())
|
147 |
+
parsed = parser.parse(output.content)
|
148 |
+
return parsed.dict()
|