Update README.md
Browse files
README.md
CHANGED
@@ -153,13 +153,11 @@ The SFT model was trained using [Llama-Factory](https://github.com/hiyouga/LLaMA
|
|
153 |
| capybara-claude-15k-ita | [Link](https://huggingface.co/datasets/efederici/capybara-claude-15k-ita) | 0 | 0 | 15,000 |
|
154 |
| Wildchat | [Link](https://huggingface.co/datasets/allenai/WildChat-1M) | 0 | 0 | 5,000 |
|
155 |
| GPT4_INST | [Link](https://huggingface.co/datasets/DeepMount00/GPT-4o-ITA-INSTRUCT) | 0 | 0 | 10,000 |
|
156 |
-
| Safety Italian |
|
157 |
-
| Italian Conversations |
|
158 |
|
159 |
For more details, please check [our tech report](https://nlp.uniroma1.it/minerva/blog#from-a-base-model-to-an-instruct-model).
|
160 |
|
161 |
-
* Please contact navigli@diag.uniroma1.it for the "Safety Italian" and "Italian Conversations" datasets.
|
162 |
-
|
163 |
### Online DPO Training
|
164 |
|
165 |
This model card is for our DPO model. Direct Preference Optimization (DPO) is a method that refines models based on user feedback, similar to Reinforcement Learning from Human Feedback (RLHF), but without the complexity of reinforcement learning. Online DPO further improves this by allowing real-time adaptation during training, continuously refining the model with new feedback. For training this model, we used the [Hugging Face TRL](https://github.com/huggingface/trl) library and Online DPO, with the [Skywork/Skywork-Reward-Llama-3.1-8B-v0.2](https://huggingface.co/Skywork/Skywork-Reward-Llama-3.1-8B-v0.2) model as the judge to evaluate and guide optimization. For this stage we used just the prompts from HuggingFaceH4/ultrafeedback_binarized (English), efederici/evol-dpo-ita (Italian) and Babelscape/ALERT translated to Italian, with additional manually curated data for safety.
|
|
|
153 |
| capybara-claude-15k-ita | [Link](https://huggingface.co/datasets/efederici/capybara-claude-15k-ita) | 0 | 0 | 15,000 |
|
154 |
| Wildchat | [Link](https://huggingface.co/datasets/allenai/WildChat-1M) | 0 | 0 | 5,000 |
|
155 |
| GPT4_INST | [Link](https://huggingface.co/datasets/DeepMount00/GPT-4o-ITA-INSTRUCT) | 0 | 0 | 10,000 |
|
156 |
+
| Safety Italian | - | 0 | 0 | 21,426 |
|
157 |
+
| Italian Conversations | - | 0 | 0 | 4,843 |
|
158 |
|
159 |
For more details, please check [our tech report](https://nlp.uniroma1.it/minerva/blog#from-a-base-model-to-an-instruct-model).
|
160 |
|
|
|
|
|
161 |
### Online DPO Training
|
162 |
|
163 |
This model card is for our DPO model. Direct Preference Optimization (DPO) is a method that refines models based on user feedback, similar to Reinforcement Learning from Human Feedback (RLHF), but without the complexity of reinforcement learning. Online DPO further improves this by allowing real-time adaptation during training, continuously refining the model with new feedback. For training this model, we used the [Hugging Face TRL](https://github.com/huggingface/trl) library and Online DPO, with the [Skywork/Skywork-Reward-Llama-3.1-8B-v0.2](https://huggingface.co/Skywork/Skywork-Reward-Llama-3.1-8B-v0.2) model as the judge to evaluate and guide optimization. For this stage we used just the prompts from HuggingFaceH4/ultrafeedback_binarized (English), efederici/evol-dpo-ita (Italian) and Babelscape/ALERT translated to Italian, with additional manually curated data for safety.
|