Training details

by anakin87 - opened May 17, 2024

May 17, 2024

Hello and thanks for the good model!

If I understood well, after DPO on an English dataset, the model has been trained on Italian data.
Can you share more details about this step? I can't find the related script on GitHub...

m-polignano-uniba

SWAP Research Group@UNIBA org May 17, 2024

Hi, you can find the DPO script here: https://github.com/marcopoli/LLaMAntino-3-ANITA/blob/main/model_adaptation/dpo_llama3.py
and the SFT script here: https://github.com/marcopoli/LLaMAntino-3-ANITA/blob/main/model_adaptation/finetune_llama3.py
Just change "model_name" and "dataset" accordingly. For the adaptation on the Italian language, just use the SFT script on a small portion of an Italian Data (e.g., gsarti/clean_mc4_it) using plain text without chat template, i.e. (<|begin_of_text|> {text} <|eot_id|><|end_of_text|>)

anakin87

May 17, 2024

Thanks.
Very informative!

e-palmisano

May 31, 2024

Hi @m-polignano-uniba ,

Is fine-tuning with the Italian language performed with QLoRA/LoRa or without?

m-polignano-uniba

SWAP Research Group@UNIBA org May 31, 2024

Yes, we used QLoRA through Unsloth:

load_in_4bit=True, r = 64, lora_alpha = 16, ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]

anakin87

Jun 7, 2024

During the language adaptation phase, can you share a rough idea of the peak GPU VRAM usage?

In the paper, I read you used an Nvidia H100 64GB GPU but further details would be much appreciated.

m-polignano-uniba

SWAP Research Group@UNIBA org Jun 10, 2024

Unfortunately, we use an HPC Cluster that does not allow us to check VRAM usage during training (mostly because the GPU are shared). Just a small correction: the graphics card is a custom NVIDIA A100-SXM-64GB (https://www.nvidia.com/it-it/data-center/a100/)

antoniox2dos

Jun 13, 2024

Hi,

Thanks for the great model!

It's unclear to me what pipeline did you follow. Based on the above message, it looks like you fine-tuned the llama3-instruct model on raw Italian text, but based on the readme it looks like you actually used Italian instruction data. Then, you fine-tuned with DPO on the English dataset. Is this correct or I am missing something? Thanks!

anakin87

Jun 14, 2024

@antoniox2dos you can find this information in the paper https://arxiv.org/abs/2405.07101

In short (copy-pasting from a recent post of mine):

⚙️ The 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐩𝐫𝐨𝐜𝐞𝐬𝐬 is quite original and interesting
1️⃣ Built on 🦙 Llama-3-8B-Instruct (not a base model)
2️⃣ Fine-tuned on a mix of English instruction datasets (100K prompts, Chat-Error/wizard_alpaca_dolly_orca)
3️⃣ Direct Preference Optimization on Maxime Labonne's orpo-dpo-mix-40k (a good collection of English preference datasets, mainly by Argilla)
4️⃣ 🇮🇹 Italian Adaptation: further fine-tuning on 100k examples from clean_mc4_it by Gabriele Sarti
🛠️ All training steps utilized QLoRA (Quantized Low-Rank Adaptation) with Unsloth AI and Hugging Face TRL.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment