I am happy to release two new language models for the Italian Language!
๐ช Gemma 2 9B Neogenesis ITA anakin87/gemma-2-9b-neogenesis-ita Building on the impressive work by VAGO Solutions, I applied Direct Preference Optimization with a mix of Italian and English data. Using Spectrum, I trained 20% of model layers.
๐ Evaluated on the Open ITA LLM leaderboard (mii-llm/open_ita_llm_leaderboard), this model achieves strong performance. To beat it on this benchmark, you'd need a 27B model ๐
๐ค Gemma 2 2B Neogenesis ITA anakin87/gemma-2-2b-neogenesis-ita This smaller variant is fine-tuned from the original Gemma 2 2B it by Google. Through a combination of Supervised Fine-Tuning and Direct Preference Optimization, I trained 25% of the layers using Spectrum.
๐ Compared to the original model, it shows improved Italian proficiency, good for its small size.
Hey, it has been a while... I was busy participating in ๐ ๐๐๐ฆ๐ฆ๐ ๐๐จ๐ฆ๐ฉ๐๐ญ๐ข๐ญ๐ข๐จ๐ง!
Here's the idea: Gemma open models have a large vocabulary size (256K), so improving them for a specific language or cultural context should be pretty affordable - no need for continued pre-training.
In this notebook, I show how I improve the performance of Gemma 2 2B on Italian via Post-Training. I believe this method is adaptable to other languages and model sizes.
๐๐ฆ๐บ ๐๐ต๐ฆ๐ฑ๐ด ๐ Choose reference metrics ๐งโ๐ฌ Data curation for Instruction Fine Tuning: identify existing datasets + generate synthetic data ๐๏ธโโ๏ธ Efficient Instruction Fine Tuning with Spectrum ๐งโ๐ฌ Data curation for Preference Tuning: identify existing datasets + generate synthetic data ๐๐ Efficient Direct Preference Optimization with Spectrum ๐ Evaluation
I'm also planning a ๐ Gemma Giveaway (on LinkedIn - https://www.linkedin.com/in/stefano-fiorucci) in the next few days - sharing techniques, datasets, and models I used for my project... so stay tuned! ๐ป