clefourrier (Clémentine Fourrier)

posted an update 23 days ago

Post

2043

Gemma3 family is out! Reading the tech report, and this section was really interesting to me from a methods/scientific fairness pov.

Instead of doing over-hyped comparisons, they clearly state that **results are reported in a setup which is advantageous to their models**.
(Which everybody does, but people usually don't say)

For a tech report, it makes a lot of sense to report model performance when used optimally!
On leaderboards on the other hand, comparison will be apples to apples, but in a potentially unoptimal way for a given model family (like some user interact sub-optimally with models)

Also contains a cool section (6) on training data memorization rate too! Important to see if your model will output the training data it has seen as such: always an issue for privacy/copyright/... but also very much for evaluation!

Because if your model knows its evals by heart, you're not testing for generalization.

reacted to m-ric's post with 🔥 about 2 months ago

Post

3143

Now you can launch a code agent directly from your terminal!
✨ 𝚜𝚖𝚘𝚕𝚊𝚐𝚎𝚗𝚝 "𝚈𝚘𝚞𝚛 𝚝𝚊𝚜𝚔" directly launches a CodeAgent
▶️ This also works with web agents (replace 𝚜𝚖𝚘𝚕𝚊𝚐𝚎𝚗𝚝 with 𝚠𝚎𝚋𝚊𝚐𝚎𝚗𝚝) thanks to @merve !

💾 Another treat from smolagents release 1.7.0:
Now agents have a memory mechanism, enabling many possibilities like replaying the last run with 𝚊𝚐𝚎𝚗𝚝.𝚛𝚎𝚙𝚕𝚊𝚢(), thank you @clefourrier !

Check the release notes here 👉 https://github.com/huggingface/smolagents/releases/tag/v1.7.0

reacted to BrigitteTousi's post with ❤️ 3 months ago

Post

1327

Community fine-tuned models are more carbon efficient than the models they are derived from! 🥳🌿

@alozowski @clefourrier @SaylorTwift @albertvillanova evaluated CO₂ emissions associated with model inference for over 3000 models on the Open LLM Leaderboard. Interesting trends and new insights emerged...👀

Blog Post: https://huggingface.co/blog/leaderboard-emissions-analysis

Leaderboard: open-llm-leaderboard/open_llm_leaderboard

reacted to fdaudens's post with ❤️ 4 months ago

Post

1806

Keeping up with open-source AI in 2024 = overwhelming.

Here's help: We're launching our Year in Review on what actually matters, starting today!

Fresh content dropping daily until year end. Come along for the ride - first piece out now with @clem 's predictions for 2025.

Think of it as your end-of-year AI chocolate calendar.

Kudos to @BrigitteTousi @clefourrier @Wauplin @thomwolf for making it happen. We teamed up with aiworld.eu for awesome visualizations to make this digestible—it's a charm to work with their team.

Check it out: huggingface/open-source-ai-year-in-review-2024

reacted to thomwolf's post with 🧠🔥 4 months ago

Post

1784

Interesting long read from @evanmiller-anthropic on having a better founded statistical approach to Language Model Evaluations:
https://www.anthropic.com/research/statistical-approach-to-model-evals

Worth a read if you're into LLM evaluations!

Cc @clefourrier

1 reply

·

reacted to malhajar's post with 🔥 4 months ago

Post

4884

🇫🇷 Lancement officiel de l'OpenLLM French Leaderboard : initiative open-source pour référencer l’évaluation des LLMs francophones

Après beaucoup d’efforts et de sueurs avec Alexandre Lavallee, nous sommes ravis d’annoncer que le OpenLLMFrenchLeaderboard est en ligne sur Hugging Face (space url: le-leadboard/OpenLLMFrenchLeaderboard) la toute première plateforme dédiée à l’évaluation des grands modèles de langage (LLM) en français. 🇫🇷✨

Ce projet de longue haleine est avant tout une œuvre de passion mais surtout une nécessité absolue. Il devient urgent et vital d'oeuvrer à plus de transparence dans ce domaine stratégique des LLM dits multilingues. La première pièce à l'édifice est donc la mise en place d'une évaluation systématique et systémique des modèles actuels et futurs.

Votre modèle IA français est-il prêt à se démarquer ? Soumettez le dans notre espace, et voyez comment vous vous comparez par rapport aux autres modèles.

❓ Comment ça marche :
Soumettez votre LLM français pour évaluation, et nous le testerons sur des benchmarks de référence spécifiquement adaptés pour la langue française — notre suite de benchmarks comprend :

- BBH-fr : Raisonnement complexe
- IFEval-fr : Suivi d'instructions
- GPQA-fr : Connaissances avancées
- MUSR-fr : Raisonnement narratif
- MATH_LVL5-fr : Capacités mathématiques
- MMMLU-fr : Compréhension multitâche

Le processus est encore manuel, mais nous travaillons sur son automatisation, avec le soutien de la communauté Hugging Face.

@clem , on se prépare pour une mise à niveau de l’espace ? 😏👀

Ce n'est pas qu'une question de chiffres—il s'agit de créer une IA qui reflète vraiment notre langue, notre culture et nos valeurs. OpenLLMFrenchLeaderboard est notre contribution personnelle pour façonner l'avenir des LLM en France.

1 reply

·

reacted to fdaudens's post with ❤️ 9 months ago

Post

1892

Look at that 👀

Actual benchmarks have become too easy for recent models, much like grading high school students on middle school problems makes little sense. So the team worked on a new version of the Open LLM Leaderboard with new benchmarks.

Stellar work from @clefourrier @SaylorTwift and the team!

👉 Read the blog post: open-llm-leaderboard/blog
👉 Explore the leaderboard: open-llm-leaderboard/open_llm_leaderboard

1 reply

·

reacted to alvdansen's post with 👍 10 months ago

Post

6897

I had a backlog of LoRA model weights for SDXL that I decided to prioritize this weekend and publish. I know many are using SD3 right now, however if you have the time to try them, I hope you enjoy them.

I intend to start writing more fully on the thought process behind my approach to curating and training style and subject finetuning, beginning this next week.

Thank you for reading this post! You can find the models on my page and I'll drop a few previews here.

4 replies

·

reacted to fffiloni's post with ❤️ 10 months ago

Post

19541

🇫🇷
Quel impact de l’IA sur les filières du cinéma, de l’audiovisuel et du jeu vidéo?
Etude prospective à destination des professionnels
— CNC & BearingPoint | 09/04/2024

Si l’Intelligence Artificielle (IA) est utilisée de longue date dans les secteurs du cinéma, de l’audiovisuel et du jeu vidéo, les nouvelles applications de l’IA générative bousculent notre vision de ce dont est capable une machine et possèdent un potentiel de transformation inédit. Elles impressionnent par la qualité de leurs productions et suscitent par conséquent de nombreux débats, entre attentes et appréhensions.

Le CNC a donc décider de lancer un nouvel Observatoire de l’IA Afin de mieux comprendre les usages de l’IA et ses impacts réels sur la filière de l’image. Dans le cadre de cet Observatoire, le CNC a souhaité dresser un premier état des lieux à travers la cartographie des usages actuels ou potentiels de l’IA à chaque étape du processus de création et de diffusion d’une œuvre, en identifiant les opportunités et risques associés, notamment en termes de métiers et d’emploi. Cette étude CNC / Bearing Point en a présenté les principaux enseignements le 6 mars, lors de la journée CNC « Créer, produire, diffuser à l’heure de l’intelligence artificielle ».

Le CNC publie la version augmentée de la cartographie des usages de l’IA dans les filières du cinéma, de l’audiovisuel et du jeu vidéo.

Lien vers la cartographie complète: https://www.cnc.fr/documents/36995/2097582/Cartographie+des+usages+IA_rapport+complet.pdf/96532829-747e-b85e-c74b-af313072cab7?t=1712309387891

4 replies

·

reacted to alielfilali01's post with 🔥 11 months ago

Post

1478

Just passed the 25 models milestone on the OALL/Open-Arabic-LLM-Leaderboard 🥳

And now meta-llama/Meta-Llama-3-70B-Instruct is the new hero of the leaderboard beating CohereForAI/c4ai-command-r-v01 by 5.43 points 🔥

Almost another 80 models are still PENDING ! So this might change very fast in the upcoming days

reacted to giux78's post with ❤️ 11 months ago

Post

1504

@FinancialSupport and I just released a new version of the Italian LLMs leaderboard https://huggingface.co/spaces/FinancialSupport/open_ita_llm_leaderboard
using the super useful

demo-leaderboard template from @clefourrier .
We’ve evaluated over 50 models (base, merged, fine-tuned, etc.) from:
- Major companies like Meta, Mistral, Google ...
- University groups such as

sapienzanlp or

swap-uniba
- Italian Companies like MoxoffSpA ,

FairMind or

raicrits
- Various communities and individuals
All models were tested on #Italian benchmarks #mmlu #arc-c #hellaswag, which we contributed to the opensource lm-evaluation-harness library from

EleutherAI .
Plus, you can now submit your model for automatic evaluation, thanks to to

seeweb sponsored computation.
Curious about the top Italian models? Check out the leaderboard and submit your model!

https://huggingface.co/spaces/FinancialSupport/open_ita_llm_leaderboard

reacted to singhsidhukuldeep's post with ❤️ 11 months ago

Post

1335

🎉 A new LLM is launched! 🚀
After checking if it's open-source or not, 🤔
you rush to see the benchmarks... 🏃‍♂️💨

Which benchmark does everyone check first? 🔍

MMLU (Massive Multitask Language Understanding)? 📚

Benchmarks like MMLU reaching saturation... most of the time the performance does not translate to real-world use cases! 🌐❗

Meet MMLU-Pro, released by TIGER-Lab on @huggingface ! 🐯🌍

🧪 12,217 questions across biology, business, chemistry, computer science, economics, engineering, health, history, law, mathematics, philosophy, physics, and psychology carefully validated by humans 🧑‍🔬

🔟 Goes to 10 options per question instead of 4, this increase in options will make the evaluation more realistic and reduce random guessing 🎯

📊 56% of questions come from MMLU, 34% from STEM websites, and the rest from TheoremQA and SciBench 📈

🤖 LLMs with weak chain-of-thought reasoning tend to perform lower, indicating it is more challenging and representative of real-world expectations 🧠💡

Any guess who tops it and who bombs it? 🤔📉📈

GPT-4o drops by 17% (from 0.887 to 0.7149) 📉
Llama-3-70B drops by 27% (from 0.820 to 0.5541) 📉

🔗 TIGER-Lab/MMLU-Pro

2 replies

·

reacted to georgewritescode's post with ❤️🔥 11 months ago

Post

2348

Excited to bring our benchmarking leaderboard of >100 LLM API endpoints to HF!

Speed and price are often just as important as quality when building applications with LLMs. We bring together all the data you need to consider all three when you need to pick a model and API provider.

Coverage:
‣ Quality (Index of evals, MMLU, Chatbot Arena, HumanEval, MT-Bench)
‣ Throughput (tokens/s: median, P5, P25, P75, P95)
‣ Latency (TTFT: median, P5, P25, P75, P95)
‣ Context window
‣ OpenAI library compatibility

Link to Space: ArtificialAnalysis/LLM-Performance-Leaderboard

Blog post: https://huggingface.co/blog/leaderboard-artificial-analysis

reacted to osanseviero's post with 🔥 12 months ago

Post

12628

Diaries of Open Source. Part 15 🤗

🕵️‍♀️Idefics 2 is out, a multimodal open-source model with very nice capabilities
Models, demo, and datasets: HuggingFaceM4/idefics2-661d1971b7c50831dd3ce0fe
Blog: https://hf.co/blog/idefics2

💾Snowflake released snowflake-arctic-embed, a family of powerful small embedding models
Model: Snowflake/snowflake-arctic-embed-m
Blog: https://www.snowflake.com/blog/introducing-snowflake-arctic-embed-snowflakes-state-of-the-art-text-embedding-family-of-models/

✨Pile-T5, EleutherAI's T5 model trained on 2T tokens
Blog: https://blog.eleuther.ai/pile-t5/
Models: EleutherAI/pile-t5-65a76a0d0022dd270b385a66
GitHub: https://github.com/EleutherAI/improved-t5

🤖CodeQwen1.5-7B base and chat models. Models trained on 3T tokens strong benchmark results for code generation, editing and SQL
Blog post: https://qwenlm.github.io/blog/codeqwen1.5/
Demo: Qwen/CodeQwen1.5-7b-Chat-demo
Models: Qwen/CodeQwen1.5-7B and Qwen/CodeQwen1.5-7B-Chat

Misc
🦉 DocOwl1.5: Unified Stucture Learning for OCR-free Document Understanding mPLUG/DocOwl
👀Cerule - a tiny Vision LM model Tensoic/Cerule-v0.1
ChemLLM - a LLM for chemistry and molecule science ⚗️https://hf.co/AI4Chem/ChemLLM-7B-Chat-1.5-DPO
Distil Whisper Large
📝New pdf/OCR datasets with 19 samples pixparse/pdf-document-ocr-datasets-660701430b0346f97c4bc628
🔥Gretel AI high quality text-to-sql synthetic dataset gretelai/synthetic_text_to_sql

5 replies

·

reacted to isidentical's post with ❤️ 12 months ago

Post

2117

Happy to announce https://imgsys.org -- a sister project to Chatbot Arena by lmsys -- for comparing different text guided image generation models models. Try it natively on HuggingFace: https://huggingface.co/spaces/fal-ai/imgsys

1 reply

·

reacted to abidlabs's post with ❤️ 12 months ago

Post

3648

Open Models vs. Closed APIs for Software Engineers
-----------------------------------------------------------------------

If you're an ML researcher / scientist, you probably don't need much convincing to use open models instead of closed APIs -- open models give you reproducibility and let you deeply investigate the model's behavior.

But what if you are a software engineer building products on top of LLMs? I'd argue that open models are a much better option even if you are using them as APIs. For at least 3 reasons:

1) The most obvious reason is reliability of your product. Relying on a closed API means that your product has a single point-of-failure. On the other hand, there are at least 7 different API providers that offer Llama3 70B already. As well as libraries that abstract on top of these API providers so that you can make a single request that goes to different API providers depending on availability / latency.

2) Another benefit is eventual consistency going local. If your product takes off, it will be more economical and lower latency to have a dedicated inference endpoint running on your VPC than to call external APIs. If you've started with an open-source model, you can always deploy the same model locally. You don't need to modify prompts or change any surrounding logic to get consistent behavior. Minimize your technical debt from the beginning.

3) Finally, open models give you much more flexibility. Even if you keep using APIs, you might want to tradeoff latency vs. cost, or use APIs that support batches of inputs, etc. Because different API providers have different infrastructure, you can use the API provider that makes the most sense for your product -- or you can even use multiple API providers for different users (free vs. paid) or different parts of your product (priority features vs. nice-to-haves)

reacted to alozowski's post with 🔥❤️ 12 months ago

Post

2935

Do I need to make it a tradition to post here every Friday? Well, here we are again!

This week, I'm happy to share that we have two official Mistral models on the Leaderboard! 🔥 You can check them out: mistralai/Mixtral-8x22B-Instruct-v0.1 and mistralai/Mixtral-8x22B-v0.1

The most exciting thing here? mistralai/Mixtral-8x22B-Instruct-v0.1 model got a first place among pretrained models with an impressive average score of 79.15!🥇 Not far behind is the Mixtral-8x22B-v0.1, achieving second place with an average score of 74.47! Well done, Mistral AI! 👏

Check out my screenshot here or explore it yourself at the https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

The second news is that CohereForAI/c4ai-command-r-plus model in 4-bit quantization got a great average score of 70.08. Cool stuff, Cohere! 😎 (and I also have the screenshot for this, don't miss it)

The last news, which might seem small but is still significant, the Leaderboard frontpage now supports Python 3.12.1. This means we're on our way to speed up the Leaderboard's performance! 🚀

If you have any comments or suggestions, feel free to also tag me on X (Twitter), I'll try to help – [at]ailozovskaya

Have a nice weekend! ✨

2 replies

·

Clémentine Fourrier

AI & ML interests

Organizations

clefourrier's activity