edobobo commited on
Commit
b7bffce
1 Parent(s): 3a0c660

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -4
README.md CHANGED
@@ -153,13 +153,11 @@ The SFT model was trained using [Llama-Factory](https://github.com/hiyouga/LLaMA
153
  | capybara-claude-15k-ita | [Link](https://huggingface.co/datasets/efederici/capybara-claude-15k-ita) | 0 | 0 | 15,000 |
154
  | Wildchat | [Link](https://huggingface.co/datasets/allenai/WildChat-1M) | 0 | 0 | 5,000 |
155
  | GPT4_INST | [Link](https://huggingface.co/datasets/DeepMount00/GPT-4o-ITA-INSTRUCT) | 0 | 0 | 10,000 |
156
- | Safety Italian | _Upon Request_* | 0 | 0 | 21,426 |
157
- | Italian Conversations | _Upon Request_* | 0 | 0 | 4,843 |
158
 
159
  For more details, please check [our tech report](https://nlp.uniroma1.it/minerva/blog#from-a-base-model-to-an-instruct-model).
160
 
161
- * Please contact navigli@diag.uniroma1.it for the "Safety Italian" and "Italian Conversations" datasets.
162
-
163
  ### Online DPO Training
164
 
165
  This model card is for our DPO model. Direct Preference Optimization (DPO) is a method that refines models based on user feedback, similar to Reinforcement Learning from Human Feedback (RLHF), but without the complexity of reinforcement learning. Online DPO further improves this by allowing real-time adaptation during training, continuously refining the model with new feedback. For training this model, we used the [Hugging Face TRL](https://github.com/huggingface/trl) library and Online DPO, with the [Skywork/Skywork-Reward-Llama-3.1-8B-v0.2](https://huggingface.co/Skywork/Skywork-Reward-Llama-3.1-8B-v0.2) model as the judge to evaluate and guide optimization. For this stage we used just the prompts from HuggingFaceH4/ultrafeedback_binarized (English), efederici/evol-dpo-ita (Italian) and Babelscape/ALERT translated to Italian, with additional manually curated data for safety.
 
153
  | capybara-claude-15k-ita | [Link](https://huggingface.co/datasets/efederici/capybara-claude-15k-ita) | 0 | 0 | 15,000 |
154
  | Wildchat | [Link](https://huggingface.co/datasets/allenai/WildChat-1M) | 0 | 0 | 5,000 |
155
  | GPT4_INST | [Link](https://huggingface.co/datasets/DeepMount00/GPT-4o-ITA-INSTRUCT) | 0 | 0 | 10,000 |
156
+ | Safety Italian | - | 0 | 0 | 21,426 |
157
+ | Italian Conversations | - | 0 | 0 | 4,843 |
158
 
159
  For more details, please check [our tech report](https://nlp.uniroma1.it/minerva/blog#from-a-base-model-to-an-instruct-model).
160
 
 
 
161
  ### Online DPO Training
162
 
163
  This model card is for our DPO model. Direct Preference Optimization (DPO) is a method that refines models based on user feedback, similar to Reinforcement Learning from Human Feedback (RLHF), but without the complexity of reinforcement learning. Online DPO further improves this by allowing real-time adaptation during training, continuously refining the model with new feedback. For training this model, we used the [Hugging Face TRL](https://github.com/huggingface/trl) library and Online DPO, with the [Skywork/Skywork-Reward-Llama-3.1-8B-v0.2](https://huggingface.co/Skywork/Skywork-Reward-Llama-3.1-8B-v0.2) model as the judge to evaluate and guide optimization. For this stage we used just the prompts from HuggingFaceH4/ultrafeedback_binarized (English), efederici/evol-dpo-ita (Italian) and Babelscape/ALERT translated to Italian, with additional manually curated data for safety.