LLAMA4 release highlight the importance of political and social bias. According to their own evaluation described in the release blog post: - Refusals on contentious prompts dropped from 7% (hashtag#LLAMA 3.3) to under 2% - Unequal response refusals are now under 1% - Political lean bias is said to be halved compared to hashtag#LLaMA 3.3 and comparable to Grok
In the chart below, we evaluated multiple leading models on the basis of ratings across a range of prompts designed to expose ideological leanings.
Despite Metaโs stated neutrality goals, LLAMA4 ranks at the very top in terms of total ratings aligned with a clear ideological bias. The models were tested on their ability to respond even-handedly to politically sensitive prompts. LLaMA 4 scored even higher than models known for strong alignment policies like GPT-4o.
LLMs may be refusing less, but they still show bias through content framing. This suggests that refusal rates alone are not a sufficient measure of ideological bias. Relying solely on internal evaluations from AI labs also raises concerns about transparency and objectivity.
At this very moment, as shown in the screenshot, mii-llm/maestrale-chat-v0.4-beta is ranked 8th right between ChatGPT-4.5 and ChatGPT-4o.
It's likely that for several months, the best Italian speaking LLM has been an open source 7B model created by open source contributors and hardly anyone knew it.
@ mii-llm with @efederici@mferraretto@FinancialSupport and @DeepMount00 we just released #Propaganda a framework designed to evaluate and train LLMs on political opinions and bias. We aim to analyze both open-source and closed-source LLMs to understand the political positions and biases expressed in their outputs. Moreover we provide a set of recipes to enforce political positions into the models by creating ad hoc curated datasets and by applying fine tuning techniques. By releasing our work in the open, we hope to foster contributions: https://github.com/mii-llm/propaganda
This framework offers opportunities for expansion in various directions and could become the standard reference for evaluating LLMs on political topics, particularly those that influence public opinion.
@FinancialSupport and I just released a new version of the Italian LLMs leaderboard https://huggingface.co/spaces/FinancialSupport/open_ita_llm_leaderboard using the super useful demo-leaderboard template from @clefourrier. Weโve evaluated over 50 models (base, merged, fine-tuned, etc.) from: - Major companies like Meta, Mistral, Google ... - University groups such as sapienzanlp or swap-uniba - Italian Companies like MoxoffSpA , FairMind or raicrits - Various communities and individuals All models were tested on #Italian benchmarks #mmlu #arc-c #hellaswag, which we contributed to the opensource lm-evaluation-harness library from EleutherAI. Plus, you can now submit your model for automatic evaluation, thanks to to seeweb sponsored computation. Curious about the top Italian models? Check out the leaderboard and submit your model!
I created a Capybara-inspired Italian dataset by translating the initial instruction and running it through a pipeline to generate conversations. I used Claude Sonnet for translation and instruction generation, and Opus for generating the answers.
I hope this dataset proves useful for people working on ๐ฎ๐น language models.
@mik3ml just released ReDiX/wikipediaQA-ita an interesting synthetic dataset originated from wikipedia using a fine tuned version of mistral-7B specific for the Italian language ๐ฎ๐น .
On evaluating fine tuned 7B Italian open source LLMs I have collected many data points and I created a super simple explorative analyses. My hypothesis based on data are:
- mmlu is hard to improve when fine tuning a base model on a different language - fine tuning also on single GPUs can improve by 5% to 10% the base model on common tasks but a lot more on specific cases with the right training time and data - fine tuning can specialize well but at cost of loosing some foundational knowledge.
Based on the work of @mrinaldi and @ruggsea we just released the biggest - ready for training - conversational dataset based on Usenet data in the Italian language ๐ฎ๐น๐ฎ๐น๐ฎ๐น๐ฎ๐น๐ฎ๐น๐ฎ๐น๐ฎ๐น. It contains about 9 millions of conversations made by real humans.
It is based on lm-evaluation-harness and at the moment , mainly, on 7 billion models. In the next weeks we will add more models. If you have suggestion or need explanations join our community discord https://discord.gg/a26cRkBCNH
The dataset contributes to the mii-community project, aimed at advancing the creation of Italian open-source Language Models (LLMs).๐ฎ๐น ๐ค About 10-20 billion token, probably the best conversational open source dataset in the Italian language. ๐ฎ๐น๐ฎ๐น๐ฎ๐น๐ฎ๐น๐ฎ๐น๐ฎ๐น๐ฎ๐น
Introducing ๐๐ง๐ข๐ฏ๐๐ซ๐ฌ๐๐ฅ ๐๐๐ซ ๐๐จ๐ซ ๐๐ญ๐๐ฅ๐ข๐๐ง ๐๐๐ง๐ ๐ฎ๐๐ ๐, a revolutionary Named Entity Recognition (NER) model evolved from the GliNer architecture and meticulously tailored for the Italian language. This advanced model is a beacon of efficiency and versatility, engineered to ๐ซ๐๐๐จ๐ ๐ง๐ข๐ณ๐ ๐๐ง๐ฒ ๐๐ง๐ญ๐ข๐ญ๐ฒ ๐ญ๐ฒ๐ฉ๐ within the rich nuances of Italian, using a bidirectional transformer encoder. It stands out as an ideal solution for those navigating the challenges of resource-limited environments or seeking an efficient alternative to the cumbersome Large Language Models (LLMs). ๐๐ฎ๐ง๐ฌ ๐๐๐ฌ๐ญ ๐๐ฅ๐ฌ๐จ ๐จ๐ง ๐๐๐!