Commit
•
c351234
1
Parent(s):
ee30b9f
Update README.md
Browse files
README.md
CHANGED
@@ -109,16 +109,16 @@ This fine-tuning approach allowed us to significantly reduce memory usage and co
|
|
109 |
## Evaluation results
|
110 |
|
111 |
To evaluate the performance of our model, we translated [70 questions](https://github.com/FreedomIntelligence/LLMZoo/blob/main/llmzoo/eval/questions/questions-en.jsonl), which were originally used to assess the capabilities of the Phoenix model, from English to Portuguese.
|
112 |
-
We then conducted their [automatic evaluation](https://github.com/FreedomIntelligence/LLMZoo) using GTP-3.5 as
|
113 |
This prompt was designed to elicit assessments of answers in terms of helpfulness, relevance, accuracy, and level of detail.
|
114 |
[Additional prompts](https://github.com/FreedomIntelligence/LLMZoo/blob/main/llmzoo/eval/prompts/order/prompt_all.json) are provided for assessing overall performance on different perspectives.
|
115 |
|
116 |
-
Follows the results against GPT-3.5 and
|
117 |
|
118 |
| | **Lose** | **Tie** | **Win** |
|
119 |
|------------------------|----------|---------|---------|
|
120 |
| QUOKKA vs. **GPT-3.5** | 63.8% | 10.1% | 26.1% |
|
121 |
-
| QUOKKA vs. **
|
122 |
|
123 |
## Environmental impact
|
124 |
|
|
|
109 |
## Evaluation results
|
110 |
|
111 |
To evaluate the performance of our model, we translated [70 questions](https://github.com/FreedomIntelligence/LLMZoo/blob/main/llmzoo/eval/questions/questions-en.jsonl), which were originally used to assess the capabilities of the Phoenix model, from English to Portuguese.
|
112 |
+
We then conducted their [automatic evaluation](https://github.com/FreedomIntelligence/LLMZoo) using GTP-3.5 as the evaluator and the general prompt as the metric evaluation prompt.
|
113 |
This prompt was designed to elicit assessments of answers in terms of helpfulness, relevance, accuracy, and level of detail.
|
114 |
[Additional prompts](https://github.com/FreedomIntelligence/LLMZoo/blob/main/llmzoo/eval/prompts/order/prompt_all.json) are provided for assessing overall performance on different perspectives.
|
115 |
|
116 |
+
Follows the results against GPT-3.5 and Falcon, one of the highest performing open-source models at the moment:
|
117 |
|
118 |
| | **Lose** | **Tie** | **Win** |
|
119 |
|------------------------|----------|---------|---------|
|
120 |
| QUOKKA vs. **GPT-3.5** | 63.8% | 10.1% | 26.1% |
|
121 |
+
| QUOKKA vs. **Falcon** | 17.4% | 1.4% | 81.2% |
|
122 |
|
123 |
## Environmental impact
|
124 |
|