Aymeric Roucher

m-ric

AI & ML interests

MLE at Hugging Face ๐Ÿค— LLMs, Agents, RAG, Multimodal.

Articles

Organizations

m-ric's activity

posted an update 10 days ago
view post
Post
2509
๐Ÿ’ฐโŒ ๐‘๐ž๐ฌ๐ž๐š๐ซ๐œ๐ก ๐Ÿ๐จ๐ซ ๐ญ๐ก๐ž ๐ฏ๐ž๐ซ๐ฒ ๐†๐๐” ๐๐จ๐จ๐ซ - ๐’๐œ๐š๐ฅ๐ข๐ง๐  ๐ฅ๐š๐ฐ๐ฌ ๐ซ๐ž๐ฉ๐ฅ๐ข๐œ๐š๐ญ๐ข๐จ๐ง

๐ŸŽ† Good news: ๐˜†๐—ผ๐˜‚ ๐—ฐ๐—ฎ๐—ป ๐—ฑ๐—ผ ๐—ฐ๐˜‚๐˜๐˜๐—ถ๐—ป๐—ด-๐—ฒ๐—ฑ๐—ด๐—ฒ ๐—ฟ๐—ฒ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต ๐˜„๐—ถ๐˜๐—ต ๐—ฎ ๐—ฐ๐—ฎ๐—น๐—ฐ๐˜‚๐—น๐—ฎ๐˜๐—ผ๐—ฟ ๐—ฎ๐—ป๐—ฑ ๐— ๐—ถ๐—ฐ๐—ฟ๐—ผ๐˜€๐—ผ๐—ณ๐˜ ๐—ฃ๐—ฎ๐—ถ๐—ป๐˜ ๐Ÿฎ๐Ÿฌ๐Ÿฌ๐Ÿฒ!

The Chinchilla experiments (by Google DeepMind) ran hundreds of pre-trainings with models >1B parameters (I do not want to imagine how much that cost) to ๐—ณ๐—ถ๐—ป๐—ฑ ๐˜๐—ต๐—ฒ ๐—ผ๐—ฝ๐˜๐—ถ๐—บ๐—ฎ๐—น ๐—ฟ๐—ฎ๐˜๐—ถ๐—ผ ๐—ผ๐—ณ ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น ๐˜€๐—ถ๐˜‡๐—ฒ ๐˜ƒ๐˜€ ๐˜๐—ฟ๐—ฎ๐—ถ๐—ป๐—ถ๐—ป๐—ด ๐˜๐—ผ๐—ธ๐—ฒ๐—ป๐˜€. Why is this question so important?
Well, you only ever have access to a fixed compute, counted in FLOPs (floating point operations). So if your model is bigger, you will have less compute to train on many tokens, and if you want to train on more tokens, your model will be smaller. When model trainings cost million, you absolutely need to get this right.

The new paper "Chinchilla Scaling: A replication attempt" by Epoch AI sets on on the ambitious goal of reproducing this.

But since the authors do not have infinite money, they decided to directly run their computations from DeepMind's own experiments! They took the figure from the last experiment (cf slide below), measured point positions, picked color codes, and ended up reconstructing the underlying data.

๐Ÿ’ฅ They then just fit the scaling laws proposed by the Chinchilla Authors, but arrived at wildly different results! They find that as a rough rule of thumb, you should use 20 training tokens for each parameter in your model, instead of the 70 obtained in the original paper. They also point out inconsistencies in the paper, and unrealistically narrow confidence intervals.

โžก๏ธ This only contradicts the results from the last (out of 3) experiments in the Chinchilla paper. And the model trained at the end of the Chinchilla paper still seems properly scaled.

โœ… But it does show that a tiny bit more theoretical work can go a long way, especially given the huge financial costs that such an error can have!
posted an update 24 days ago
view post
Post
2290
๐๐š๐ฉ๐ž๐ซ ๐‘๐ž๐ฏ๐ข๐ž๐ฐ: ๐‘๐ก๐จ-๐Ÿ - ๐ƒ๐จ ๐ง๐จ๐ญ ๐ฎ๐ฌ๐ž ๐š๐ฅ๐ฅ ๐ญ๐จ๐ค๐ž๐ง๐ฌ ๐ž๐ช๐ฎ๐š๐ฅ๐ฅ๐ฒ ๐ข๐ง ๐ฒ๐จ๐ฎ๐ซ ๐ญ๐ซ๐š๐ข๐ง๐ข๐ง๐ ! โš–๏ธโ›”๏ธ

A new paper topping Daily papers questions a hidden assumption in LLM training:

๐Ÿค” ๐™Ž๐™๐™ค๐™ช๐™ก๐™™ ๐™ฌ๐™š ๐™ง๐™š๐™–๐™ก๐™ก๐™ฎ ๐™ช๐™จ๐™š ๐™–๐™ก๐™ก ๐™ฉ๐™ค๐™ ๐™š๐™ฃ๐™จ ๐™š๐™ฆ๐™ช๐™–๐™ก๐™ก๐™ฎ ๐™ž๐™ฃ ๐™ค๐™ช๐™ง ๐™‡๐™‡๐™ˆ'๐™จ ๐™ฉ๐™ง๐™–๐™ž๐™ฃ๐™ž๐™ฃ๐™œ ?

Some tokens are more relevant than others, and some are mostly noise (just look up the history of ๐˜š๐˜ฐ๐˜ญ๐˜ช๐˜ฅ๐˜Ž๐˜ฐ๐˜ญ๐˜ฅ๐˜”๐˜ข๐˜จ๐˜ช๐˜ฌ๐˜ข๐˜ณ๐˜ฑ).

So this paper introduces ๐—ฆ๐—ฒ๐—น๐—ฒ๐—ฐ๐˜๐—ถ๐˜ƒ๐—ฒ ๐—Ÿ๐—ฎ๐—ป๐—ด๐˜‚๐—ฎ๐—ด๐—ฒ ๐— ๐—ผ๐—ฑ๐—ฒ๐—น๐—ถ๐—ป๐—ด, which is actually really simple:
โžก๏ธ A specific metric measures the relevance of each token. Then during training, only the top k% tokens for this relevance metric count in the loss calculation.

Authors test this method by training models on the difficult MATH dataset (only competition mathematics problems).

โžก๏ธ Their technique seems like a new must-do in LLM training: Training is much faster and reaches an impressive performance!

๐‘๐ž๐ฌ๐ฎ๐ฅ๐ญ๐ฌ:
โ—† โฑ๏ธ Training is x5 to x10 faster to reach equivalent performance compared to standard language modeling.
โ—† ๐Ÿ’ช Their 1B model achieves close to GPT4 Chain-of-Thought performance on MATH!
โ—† ๐Ÿš€ Their 7B model match performance of the state-of-the-art DeepSeek for the same size, while trained on only 3% of tokens

๐€๐๐๐ข๐ญ๐ข๐จ๐ง๐š๐ฅ ๐ข๐ง๐ฌ๐ข๐ ๐ก๐ญ๐ฌ ๐Ÿ’ก
โ—† Datasets used for pre-training, even after pre-filtering, still contain a large proportion of noisy tokens ๐Ÿ˜–
โ—† Authors show that when you reduce loss on noisy tokens, you actually reduce accuracy (Figure 7). So Selective Language Modeling seems fundamental! โœ…

Find great reads in @akhaliq 's Daily Papers ๐Ÿ‘‰ https://huggingface.co/papers
Paper added to my collection ๐Ÿ‘‰ m-ric/spinning-up-in-llms-659e698f9dd5a71bd3f579a7
posted an update 30 days ago
view post
Post
2067
๐—ก๐—ฒ๐˜„ ๐—ฆ๐—ฝ๐—ฎ๐—ฐ๐—ฒ: ๐˜ผ๐™„ ๐™๐™ง๐™–๐™ซ๐™š๐™ก ๐™ฅ๐™ก๐™–๐™ฃ๐™ฃ๐™š๐™ง ๐Ÿ—บ๏ธ๐Ÿ•๏ธ Plan your next vacation in a few minutes!

I wanted to try out if a powerful LLM like Mixtral-8x7b had geographical reasoning capabilities.
So I built a small space that prompts the LLM to provide a JSON list of places based on a user input.

And the result was impressive! ๐Ÿคฏ

โ‡’ ๐—œ๐˜ ๐˜€๐—ฒ๐—ฒ๐—บ๐˜€ ๐—น๐—ถ๐—ธ๐—ฒ ๐— ๐—ถ๐˜…๐˜๐—ฟ๐—ฎ๐—น ๐—ต๐—ฎ๐˜€ ๐—ฎ ๐—ด๐—ฟ๐—ฎ๐˜€๐—ฝ ๐—ผ๐—ณ ๐—ด๐—ฒ๐—ผ๐—ด๐—ฟ๐—ฎ๐—ฝ๐—ต๐—ถ๐—ฐ๐—ฎ๐—น ๐—ฐ๐—ผ๐—ป๐—ฐ๐—ฒ๐—ฝ๐˜๐˜€ ๐—น๐—ถ๐—ธ๐—ฒ ๐—ก๐—ผ๐—ฟ๐˜๐—ต - ๐—ฆ๐—ผ๐˜‚๐˜๐—ต, ๐—ผ๐—ฟ ๐˜€๐—ฝ๐—ฎ๐˜๐—ถ๐—ฎ๐—น ๐—ฎ๐—น๐—ถ๐—ด๐—ป๐—บ๐—ฒ๐—ป๐˜.๐Ÿงญ Not just describing these concepts, but really applying them in practice, for instance to successfully answer "give me 4 European cities that are aligned on the map". This is a ๐—ป๐—ถ๐—ฐ๐—ฒ ๐—ฒ๐˜…๐—ฎ๐—บ๐—ฝ๐—น๐—ฒ ๐—ผ๐—ณ ๐—ฎ๐—ป ๐—ฒ๐—บ๐—ฒ๐—ฟ๐—ด๐—ฒ๐—ป๐˜ ๐—ฐ๐—ฎ๐—ฝ๐—ฎ๐—ฏ๐—ถ๐—น๐—ถ๐˜๐˜†, since nothing in the LLM's training data should prepare it for this specific task.

Anyway, I added API calls and a nice visualization on top of the LLM, streaming output, caching for the answers and locations... and ta-da! โœจ I got the ๐—”๐—œ ๐—ง๐—ฟ๐—ฎ๐˜ƒ๐—ฒ๐—น ๐—ฃ๐—น๐—ฎ๐—ป๐—ป๐—ฒ๐—ฟ.

๐™”๐™ค๐™ช ๐™˜๐™–๐™ฃ ๐™™๐™š๐™จ๐™˜๐™ง๐™ž๐™—๐™š ๐™ž๐™ฉ ๐™ฎ๐™ค๐™ช๐™ง ๐™ฉ๐™ง๐™ž๐™ฅ, ๐™–๐™ฃ๐™™ ๐™ž๐™ฉ ๐™ฌ๐™ž๐™ก๐™ก ๐™˜๐™ค๐™ข๐™š ๐™ช๐™ฅ ๐™ฌ๐™ž๐™ฉ๐™ ๐™ฃ๐™ž๐™˜๐™š ๐™–๐™ฃ๐™™ ๐™˜๐™ค๐™ฃ๐™ซ๐™š๐™ฃ๐™ž๐™š๐™ฃ๐™ฉ ๐™ก๐™ค๐™˜๐™–๐™ฉ๐™ž๐™ค๐™ฃ๐™จ!

๐™๐™ง๐™ฎ ๐™ž๐™ฉ ๐™๐™š๐™ง๐™š ๐Ÿ‘‰ m-ric/ai-travel-planner

Thank you @freddyaboulton for the ๐š๐š›๐šŠ๐š๐š’๐š˜_๐š๐š˜๐š•๐š’๐šž๐š– component, and @clem , @pngwn , @abidlabs for your ideas and support!
  • 1 reply
ยท
posted an update about 1 month ago
view post
Post
2035
[๐๐ž๐ฐ ๐๐š๐ฉ๐ž๐ซ] ๐€๐ฅ๐ฅ ๐ญ๐จ๐ค๐ž๐ง๐ฌ ๐ฌ๐ก๐จ๐ฎ๐ฅ๐ ๐ง๐จ๐ญ ๐ซ๐ž๐ช๐ฎ๐ข๐ซ๐ž ๐ญ๐ก๐ž ๐ฌ๐š๐ฆ๐ž ๐ž๐Ÿ๐Ÿ๐จ๐ซ๐ญ ๐ญ๐จ ๐œ๐จ๐ฆ๐ฉ๐ฎ๐ญ๐ž! โ‡’ ๐Œ๐ข๐ฑ๐ญ๐ฎ๐ซ๐ž ๐จ๐Ÿ ๐๐ž๐ฉ๐ญ๐ก๐ฌ ๐Ÿซง๐Ÿ 

Google Researchers were unhappy with the way current decoding generally works: all tokens go through the same layers, thus requiring exactly the same effort to compute.

Whereas in reality, completing the answer to a difficult math problem for instance should be more computationally intense than completing the text of the Declaration of Independence: ๐—ป๐—ผ๐˜ ๐—ฎ๐—น๐—น ๐˜๐—ผ๐—ธ๐—ฒ๐—ป๐˜€ ๐—ฎ๐—ฟ๐—ฒ ๐—ฐ๐—ฟ๐—ฒ๐—ฎ๐˜๐—ฒ๐—ฑ ๐—ฒ๐—พ๐˜‚๐—ฎ๐—น!

โžก๏ธ ๐—ง๐—ต๐—ฒ๐˜† ๐—ต๐—ฎ๐—ฑ ๐˜๐—ต๐—ถ๐˜€ ๐—ด๐—ฒ๐—ป๐—ถ๐˜‚๐˜€ ๐—ถ๐—ฑ๐—ฒ๐—ฎ: ๐Ÿ’ก ๐—ต๐—ฎ๐˜ƒ๐—ถ๐—ป๐—ด ๐—ฎ ๐˜๐—ผ๐—ธ๐—ฒ๐—ป ๐—ด๐—ผ ๐˜๐—ต๐—ฟ๐—ผ๐˜‚๐—ด๐—ต ๐—ฎ ๐—ฏ๐—น๐—ผ๐—ฐ๐—ธ ๐˜€๐—ต๐—ผ๐˜‚๐—น๐—ฑ ๐—ฏ๐—ฒ ๐—ผ๐—ฝ๐˜๐—ถ๐—ผ๐—ป๐—ฎ๐—น. The token can go through the block (thus undergoing expensive self-attention computation) or avoid it through a skip connection.
The routing decision is taken on the block level: each block selects from the total sequence the top-k tokens that will go through it, and the others tokens will skip it. ๐˜›๐˜ฉ๐˜ช๐˜ด ๐˜ข๐˜ญ๐˜ญ๐˜ฐ๐˜ธ๐˜ด ๐˜ต๐˜ฐ ๐˜ค๐˜ฉ๐˜ฐ๐˜ฐ๐˜ด๐˜ฆ ๐˜ต๐˜ฉ๐˜ฆ ๐˜ฆ๐˜น๐˜ข๐˜ค๐˜ต ๐™˜๐™–๐™ฅ๐™–๐™˜๐™ž๐™ฉ๐™ฎ ๐˜ฐ๐˜ง ๐˜ข ๐˜ฃ๐˜ญ๐˜ฐ๐˜ค๐˜ฌ, ๐˜ช.๐˜ฆ. ๐˜ต๐˜ฉ๐˜ฆ ๐˜ฑ๐˜ณ๐˜ฐ๐˜ฑ๐˜ฐ๐˜ณ๐˜ต๐˜ช๐˜ฐ๐˜ฏ ๐˜ฐ๐˜ง ๐˜ต๐˜ฐ๐˜ฌ๐˜ฆ๐˜ฏ๐˜ด ๐˜ต๐˜ฉ๐˜ข๐˜ต ๐˜จ๐˜ฐ ๐˜ต๐˜ฉ๐˜ณ๐˜ฐ๐˜ถ๐˜จ๐˜ฉ ๐˜ช๐˜ต, ๐˜ธ๐˜ฉ๐˜ช๐˜ค๐˜ฉ ๐˜ฅ๐˜ช๐˜ณ๐˜ฆ๐˜ค๐˜ต๐˜ญ๐˜บ ๐˜ช๐˜ฏ๐˜ง๐˜ญ๐˜ถ๐˜ฆ๐˜ฏ๐˜ค๐˜ฆ๐˜ด ๐˜ต๐˜ฉ๐˜ฆ ๐˜ค๐˜ฐ๐˜ฎ๐˜ฑ๐˜ถ๐˜ต๐˜ข๐˜ต๐˜ช๐˜ฐ๐˜ฏ๐˜ข๐˜ญ ๐˜ช๐˜ฏ๐˜ต๐˜ฆ๐˜ฏ๐˜ด๐˜ช๐˜ต๐˜บ ๐˜ฐ๐˜ง ๐˜ต๐˜ฉ๐˜ฆ ๐˜ง๐˜ฐ๐˜ณ๐˜ธ๐˜ข๐˜ณ๐˜ฅ ๐˜ฑ๐˜ข๐˜ด๐˜ด.

This yields Mixture-of-Depths (MoD), with spectacular results.

โœจ ๐—ฅ๐—ฒ๐˜€๐˜‚๐—น๐˜๐˜€:
๐ŸŽš๏ธ ๐—–๐—ฎ๐—ฝ๐—ฎ๐—ฐ๐—ถ๐˜๐˜† ๐—ฐ๐—ฎ๐—ป ๐—ฏ๐—ฒ ๐˜๐˜‚๐—ป๐—ฒ๐—ฑ ๐—ฎ๐—น๐—น ๐˜๐—ต๐—ฒ ๐˜„๐—ฎ๐˜† ๐—ฑ๐—ผ๐˜„๐—ป ๐˜๐—ผ ๐Ÿญ๐Ÿฎ.๐Ÿฑ% for every second block: thus 87.5% of tokens just skip the block!
๐Ÿš€ For the same training time and performance, >๐Ÿฒ๐Ÿฌ% ๐—ถ๐—ป๐—ณ๐—ฒ๐—ฟ๐—ฒ๐—ป๐—ฐ๐—ฒ ๐˜€๐—ฝ๐—ฒ๐—ฒ๐—ฑ!
๐Ÿค ๐—–๐—ฎ๐—ป ๐—ฏ๐—ฒ ๐—ฐ๐—ผ๐—บ๐—ฏ๐—ถ๐—ป๐—ฒ๐—ฑ ๐˜„๐—ถ๐˜๐—ต ๐— ๐—ถ๐˜…๐˜๐˜‚๐—ฟ๐—ฒ-๐—ผ๐—ณ-๐—˜๐˜…๐—ฝ๐—ฒ๐—ฟ๐˜๐˜€ for further improvements.

๐Ÿ“„ ๐—ฃ๐—ฎ๐—ฝ๐—ฒ๐—ฟ ๐—ต๐—ฒ๐—ฟ๐—ฒ ๐Ÿ‘‰ Mixture-of-Depths: Dynamically allocating compute in transformer-based language models (2404.02258)
๐Ÿ“š I added it to my paper collection ๐Ÿ‘‰ m-ric/spinning-up-in-llms-659e698f9dd5a71bd3f579a7
  • 1 reply
ยท
posted an update about 1 month ago
view post
Post
1805
๐Ÿ๐ŸŽ๐Ÿ๐Ÿ’, ๐ญ๐ก๐ž ๐ฒ๐ž๐š๐ซ ๐จ๐Ÿ ๐š๐ ๐ž๐ง๐ญ ๐ฐ๐จ๐ซ๐ค๐Ÿ๐ฅ๐จ๐ฐ๐ฌ ๐Ÿ”ง๐Ÿฆพ๐Ÿค–

I've just watched Andrew Ng's talk at Sequoia last week.
If you're interested in Agents, you should really watch it!

๐—ช๐—ต๐˜† ๐˜‚๐˜€๐—ฒ ๐—ฎ๐—ด๐—ฒ๐—ป๐˜ ๐˜„๐—ผ๐—ฟ๐—ธ๐—ณ๐—น๐—ผ๐˜„๐˜€?
The current LLM task solving workflow is not very intuitive:
We ask it โ€œwrite an essay all in one shot, without ever using backspace.โ€

Why not allow the LLM a more similar process to what we would do?
- โ€œWrite an essay outlineโ€
- โ€œDo you need wen research?โ€
- โ€œWrite a first draftโ€
- โ€œConsider improvementsโ€
โ€ฆ

This is called an Agentic workflow. Existing ones bring a huge performance boost. With HumanEval: GPT-4 zero-shot gets 67% score, agentic with either one of tool use or reflection goes over 90%, and the combination of the two scores even higher!

๐—”๐—ด๐—ฒ๐—ป๐˜๐—ถ๐—ฐ ๐—ฟ๐—ฒ๐—ฎ๐˜€๐—ผ๐—ป๐—ถ๐—ป๐—ด ๐—ฑ๐—ฒ๐˜€๐—ถ๐—ด๐—ป ๐—ฝ๐—ฎ๐˜๐˜๐—ฒ๐—ฟ๐—ป๐˜€
On the following two points, the tech is robust:

โš™๏ธ ๐—ฅ๐—ฒ๐—ณ๐—น๐—ฒ๐˜…๐—ถ๐—ผ๐—ป: For instance: add a critic step after the writing step
๐Ÿ› ๏ธ ๐—ง๐—ผ๐—ผ๐—น ๐˜‚๐˜€๐—ฒ: extends the capabilities of the LLM by allowing it to call tools, like search or calculator

The next two will be needed to go further, but the tech for them is more emerging and not reliable yet:
๐Ÿ—บ๏ธ ๐—ฃ๐—น๐—ฎ๐—ป๐—ป๐—ถ๐—ป๐—ด forward to decompose task into subtasks. This allows great behaviours like an AI Agent re-routing after a failure
๐Ÿ ๐— ๐˜‚๐—น๐˜๐—ถ-๐—ฎ๐—ด๐—ฒ๐—ป๐˜ ๐—ฐ๐—ผ๐—น๐—น๐—ฎ๐—ฏ๐—ผ๐—ฟ๐—ฎ๐˜๐—ถ๐—ผ๐—ป: Program a flock of agents with tasks.
Improving the two above points will unlock huge performance boosts!

Andrew NG says Research agents are already part of his workflow!

๐—–๐—น๐—ผ๐˜€๐—ถ๐—ป๐—ด ๐˜๐—ต๐—ผ๐˜‚๐—ด๐—ต๐˜๐˜€
Andrew speculates that through agentic workflows, maybe generating many tokens fast from a small LLM will give better results than slower throughput from a powerful LLM like GPT-5.

๐ŸŽฌ Watch the talk here ๐Ÿ‘‰ https://www.youtube.com/watch?v=sal78ACtGTc
๐Ÿ“š I've added his recommended reads to m-ric/agents-65ba776fbd9e29f771c07d4e
  • 1 reply
ยท
posted an update about 1 month ago
view post
Post
1768
๐“๐ก๐ž ๐ซ๐ž๐ญ๐ฎ๐ซ๐ง ๐จ๐Ÿ ๐ญ๐ก๐ž ๐‘๐๐๐ฌ โš” ๐๐ž๐ฐ ๐Œ๐š๐ฆ๐›๐š-๐›๐š๐ฌ๐ž๐ ๐š๐ซ๐œ๐ก๐ข๐ญ๐ž๐œ๐ญ๐ฎ๐ซ๐ž "๐‰๐š๐ฆ๐›๐š"

Since the release of BERT by Google in 2019, Transformers architecture have taken over machine learning thanks to their ๐—ฎ๐˜๐˜๐—ฒ๐—ป๐˜๐—ถ๐—ผ๐—ป ๐—บ๐—ฒ๐—ฐ๐—ต๐—ฎ๐—ป๐—ถ๐˜€๐—บ, that gives them the ability to focus on important points of the input. But ๐™–๐™ฉ๐™ฉ๐™š๐™ฃ๐™ฉ๐™ž๐™ค๐™ฃ ๐™˜๐™ค๐™ข๐™ฅ๐™ช๐™ฉ๐™–๐™ฉ๐™ž๐™ค๐™ฃ ๐™ž๐™จ ๐™ฆ๐™ช๐™–๐™™๐™ง๐™–๐™ฉ๐™ž๐™˜ ๐™ž๐™ฃ ๐™ฉ๐™๐™š ๐™ž๐™ฃ๐™ฅ๐™ช๐™ฉ ๐™ก๐™š๐™ฃ๐™œ๐™ฉ๐™.

๐Ÿ’ซ The Mamba paper, published in December 2023, announced the return of the RNNs: it has no attention, but integrates a selection mechanism, which should be able to reproduce the โ€œfocusโ€ ability of attention, in an architecture for which the compute requirements ๐—ด๐—ฟ๐—ผ๐˜„ ๐—ผ๐—ป๐—น๐˜† ๐—น๐—ถ๐—ป๐—ฒ๐—ฎ๐—ฟ๐—น๐˜† ๐—ถ๐—ป ๐—ถ๐—ป๐—ฝ๐˜‚๐˜ ๐—น๐—ฒ๐—ป๐—ด๐˜๐—ต!
๐Ÿค” Would this work? We had yet to see a large Mamba model recovering the performance of Attention-based Transformers.

๐Ÿ’ฅ But now it's done! A (Mamba + Transformers) hybrid just beat Transformers!

The AI21 Labs team just released Jamba.
They insert a few Transformer layers to inject some attention in a big pile of Mamba layers, thus getting the best of both worlds.

๐™๐™‡;๐˜ฟ๐™:
๐Ÿ—๏ธ ๐—ก๐—ฒ๐˜„ ๐— ๐—ผ๐—˜ ๐—ฎ๐—ฟ๐—ฐ๐—ต๐—ถ๐˜๐—ฒ๐—ฐ๐˜๐˜‚๐—ฟ๐—ฒ: 4 Jamba blocks, each of these being 7 Mamba layers for 1 Transformer.
๐Ÿ‹๏ธ ๐Ÿฑ๐Ÿฎ๐—• ๐—ฝ๐—ฎ๐—ฟ๐—ฎ๐—บ๐—ฒ๐˜๐—ฒ๐—ฟ๐˜€, ๐Ÿญ๐Ÿฎ๐—• ๐—ฎ๐—ฐ๐˜๐—ถ๐˜ƒ๐—ฒ ๐—ฎ๐˜ ๐—ถ๐—ป๐—ณ๐—ฒ๐—ฟ๐—ฒ๐—ป๐—ฐ๐—ฒ: This reduction is enabled by Mixture of Experts, and similar to Mixtral (47B parameters - 13B active).
๐ŸŽ๏ธ ๐—ฆ๐—ฝ๐—ฒ๐—ฒ๐—ฑ: ๐˜…๐Ÿฏ ๐˜๐—ต๐—ฟ๐—ผ๐˜‚๐—ด๐—ต๐—ฝ๐˜‚๐˜. Jamba is much faster than similar-sized Transformer models on long contexts.
๐Ÿ“ ๐—–๐—ผ๐—ป๐˜๐—ฒ๐˜…๐˜ ๐—น๐—ฒ๐—ป๐—ด๐˜๐—ต: ๐Ÿญ๐Ÿฐ๐Ÿฌ๐—ž ๐˜๐—ผ๐—ธ๐—ฒ๐—ป๐˜€ on a single 80GB A100!
๐Ÿ’ช ๐—ฃ๐—ฒ๐—ฟ๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐—ป๐—ฐ๐—ฒ: ๐˜€๐˜๐—ฎ๐˜๐—ฒ-๐—ผ๐—ณ-๐˜๐—ต๐—ฒ-๐—ฎ๐—ฟ๐˜ ๐—ณ๐—ผ๐—ฟ ๐˜๐—ต๐—ถ๐˜€ ๐˜€๐—ถ๐˜‡๐—ฒ. The small injection of attention seems sufficient since Jamba beats the open-source reference Mixtral-8x7B on many benchmarks!

Try it here ๐Ÿ‘‰ ai21labs/Jamba-v0.1
posted an update about 1 month ago
view post
Post
1664
๐—›๐—ผ๐˜„ ๐—ฑ๐—ผ๐—ฒ๐˜€ ๐—ฏ๐—ฒ๐—ฎ๐—บ ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต ๐—ฑ๐—ฒ๐—ฐ๐—ผ๐—ฑ๐—ถ๐—ป๐—ด ๐˜„๐—ผ๐—ฟ๐—ธ? โžก๏ธ ๐™‰๐™š๐™ฌ ๐™ซ๐™ž๐™จ๐™ช๐™–๐™ก๐™ž๐™ฏ๐™–๐™ฉ๐™ž๐™ค๐™ฃ ๐™ฉ๐™ค๐™ค๐™ก! ๐Ÿ‘€

In Decoder-type LLMs like GPT4 or Mistral-Large, the output is generated one token (=word part) at a time. That's why they're nicknamed "stochastic parrots": the "thinking" process only happens one step at a time, so it can seem really myopic.

๐’๐จ ๐ก๐จ๐ฐ ๐ข๐ฌ ๐ญ๐ก๐ž ๐ง๐ž๐ฑ๐ญ ๐ญ๐จ๐ค๐ž๐ง ๐ฌ๐ž๐ฅ๐ž๐œ๐ญ๐ž๐?

๐Ÿ“Š Given its input sentence like "๐˜ž๐˜ฉ๐˜ข๐˜ต ๐˜ช๐˜ด ๐˜ต๐˜ฉ๐˜ฆ 7๐˜ต๐˜ฉ ๐˜๐˜ช๐˜ฃ๐˜ฐ๐˜ฏ๐˜ข๐˜ค๐˜ค๐˜ช ๐˜ฏ๐˜ถ๐˜ฎ๐˜ฃ๐˜ฆ๐˜ณ? ๐˜›๐˜ฉ๐˜ฆ 7๐˜ต๐˜ฉ ๐˜๐˜ช๐˜ฃ๐˜ฐ๐˜ฏ๐˜ข๐˜ค๐˜ค๐˜ช ๐˜ฏ๐˜ถ๐˜ฎ๐˜ฃ๐˜ฆ๐˜ณ", the Decoder LLM generates, for each token in its vocabulary, a score that represents this token's probability of coming next.
For instance: "๐™ž๐™จ" gets score 0.56, and "๐™˜๐™–๐™ฃ" gets score 0.35.

๐Ÿค‘ ๐†๐ซ๐ž๐ž๐๐ฒ ๐๐ž๐œ๐จ๐๐ข๐ง๐  is the naive option where you simply take the next most probable token at each step. But this creates paths that maximize very short-term rewards, thus may overlook better paths for the long term (like this time when you played FIFA all evening and arrived unprepared to your school exam on the next day).
In our example, the next highest score token might be "๐™ž๐™จ", but this will strongly bias the LLM towards giving an hasty response. On the opposite, starting with "๐™˜๐™–๐™ฃ" could have been completed with "๐˜ฃ๐˜ฆ ๐˜ฐ๐˜ฃ๐˜ต๐˜ข๐˜ช๐˜ฏ๐˜ฆ๐˜ฅ ๐˜ง๐˜ณ๐˜ฐ๐˜ฎ ๐˜ค๐˜ฐ๐˜ฎ๐˜ฑ๐˜ถ๐˜ต๐˜ช๐˜ฏ๐˜จ ๐˜ฑ๐˜ณ๐˜ฆ๐˜ท๐˜ช๐˜ฐ๐˜ถ๐˜ด ๐˜๐˜ช๐˜ฃ๐˜ฐ๐˜ฏ๐˜ข๐˜ค๐˜ค๐˜ช ๐˜ฏ๐˜ถ๐˜ฎ๐˜ฃ๐˜ฆ๐˜ณ๐˜ด ๐˜ง๐˜ช๐˜ณ๐˜ด๐˜ต", which steers the LLM towards a correct reasoning!

๐Ÿ—บ๏ธ ๐๐ž๐š๐ฆ ๐ฌ๐ž๐š๐ซ๐œ๐ก improves on greedy decoding by generating at each step several paths - called beams - instead of one. This allows the generation to explore a much larger space, thus find better completions. In our example, both the "๐™ž๐™จ" and the "๐™˜๐™–๐™ฃ" completion could be tested. โœ…

๐Ÿ‘‰ I've created a tool to let you visualize it, thank you @joaogante for your great help!
๐™๐™ง๐™ฎ ๐™ž๐™ฉ ๐™๐™š๐™ง๐™š: m-ric/beam_search_visualizer
posted an update about 2 months ago
view post
Post
2011
๐—จ๐˜€๐—ถ๐—ป๐—ด ๐—Ÿ๐—Ÿ๐— -๐—ฎ๐˜€-๐—ฎ-๐—ท๐˜‚๐—ฑ๐—ด๐—ฒ ๐Ÿง‘โ€โš–๏ธ ๐—ณ๐—ผ๐—ฟ ๐—ฎ๐—ป ๐—ฎ๐˜‚๐˜๐—ผ๐—บ๐—ฎ๐˜๐—ฒ๐—ฑ ๐—ฎ๐—ป๐—ฑ ๐˜ƒ๐—ฒ๐—ฟ๐˜€๐—ฎ๐˜๐—ถ๐—น๐—ฒ ๐—ฒ๐˜ƒ๐—ฎ๐—น๐˜‚๐—ฎ๐˜๐—ถ๐—ผ๐—ป

Evaluating LLM outputs is often hard, since many tasks require open-ended answers for which no deterministic metrics work: for instance, when asking a model to summarize a text, there could be hundreds of correct ways to do it. The most versatile way to grade these outputs is then human evaluation, but it is very time-consuming, thus costly.

๐Ÿค” Then ๐˜„๐—ต๐˜† ๐—ป๐—ผ๐˜ ๐—ฎ๐˜€๐—ธ ๐—ฎ๐—ป๐—ผ๐˜๐—ต๐—ฒ๐—ฟ ๐—Ÿ๐—Ÿ๐—  ๐˜๐—ผ ๐—ฑ๐—ผ ๐˜๐—ต๐—ฒ ๐—ฒ๐˜ƒ๐—ฎ๐—น๐˜‚๐—ฎ๐˜๐—ถ๐—ผ๐—ป, by providing it relevant rating criteria? ๐Ÿ‘‰ This is the idea behind LLM-as-a-judge.

โš™๏ธ To implement a LLM judge correctly, you need a few tricks.
โœ… So ๐—œ'๐˜ƒ๐—ฒ ๐—ท๐˜‚๐˜€๐˜ ๐—ฝ๐˜‚๐—ฏ๐—น๐—ถ๐˜€๐—ต๐—ฒ๐—ฑ ๐—ฎ ๐—ป๐—ฒ๐˜„ ๐—ป๐—ผ๐˜๐—ฒ๐—ฏ๐—ผ๐—ผ๐—ธ ๐˜€๐—ต๐—ผ๐˜„๐—ถ๐—ป๐—ด ๐—ต๐—ผ๐˜„ ๐˜๐—ผ ๐—ถ๐—บ๐—ฝ๐—น๐—ฒ๐—บ๐—ฒ๐—ป๐˜ ๐—ถ๐˜ ๐—ฝ๐—ฟ๐—ผ๐—ฝ๐—ฒ๐—ฟ๐—น๐˜† ๐—ถ๐—ป ๐—ผ๐˜‚๐—ฟ ๐—›๐˜‚๐—ด๐—ด๐—ถ๐—ป๐—ด ๐—™๐—ฎ๐—ฐ๐—ฒ ๐—–๐—ผ๐—ผ๐—ธ๐—ฏ๐—ผ๐—ผ๐—ธ! (you can run it instantly in Google Colab)
โžก๏ธ ๐—Ÿ๐—Ÿ๐— -๐—ฎ๐˜€-๐—ฎ-๐—ท๐˜‚๐—ฑ๐—ด๐—ฒ ๐—ฐ๐—ผ๐—ผ๐—ธ๐—ฏ๐—ผ๐—ผ๐—ธ: https://huggingface.co/learn/cookbook/llm_judge

The Cookbook is a great collection of notebooks demonstrating recipes (thus the "cookbook") for common LLM usages. I recommend you to go take a look!
โžก๏ธ ๐—”๐—น๐—น ๐—ฐ๐—ผ๐—ผ๐—ธ๐—ฏ๐—ผ๐—ผ๐—ธ๐˜€: https://huggingface.co/learn/cookbook/index

Thank you @MariaK for your support!
  • 2 replies
ยท
posted an update about 2 months ago
view post
Post
Interesting paper: ๐†๐š๐‹๐จ๐ซ๐ž: ๐ญ๐ซ๐š๐ข๐ง ๐Ÿ•๐ ๐ฆ๐จ๐๐ž๐ฅ๐ฌ ๐จ๐ง ๐œ๐จ๐ง๐ฌ๐ฎ๐ฆ๐ž๐ซ-๐ ๐ซ๐š๐๐ž ๐†๐๐”๐ฌ ๐Ÿ’ช
It's now possible to ๐™›๐™ช๐™ก๐™ก๐™ฎ ๐™ฅ๐™ง๐™š-๐™ฉ๐™ง๐™–๐™ž๐™ฃ a 7B model on a consumer-grade GPU of 24Gb RAM, without any performance loss!

The memory usage of training models has always been an acute issue. For instance full pre-training of a 7B model used to eat ~50Gb of RAM!

The common workarounds to reduce memory load are:
- separate models on multiple GPUs ("sharding")
- quantize models: encode weights on fewer bits

Another technique is to ๐™ฅ๐™ง๐™ค๐™Ÿ๐™š๐™˜๐™ฉ ๐™ฉ๐™๐™š ๐™ฌ๐™š๐™ž๐™œ๐™๐™ฉ ๐™ข๐™–๐™ฉ๐™ง๐™ž๐™ญ ๐™ฉ๐™ค ๐™ก๐™ค๐™ฌ๐™š๐™ง-๐™ง๐™–๐™ฃ๐™  ๐™จ๐™ฅ๐™–๐™˜๐™š๐™จ, (since sometimes the weights do not really vary on all dimensions): this can save a lot of space!
This low-rank projection can be done on adapters to preserve the original weights (go check out LoRA), but it still generally hurts the performance too much for pre-training.

โžก๏ธ Enter the authors of ๐˜Ž๐˜ข๐˜“๐˜ฐ๐˜ณ๐˜ฆ: ๐˜”๐˜ฆ๐˜ฎ๐˜ฐ๐˜ณ๐˜บ-๐˜Œ๐˜ง๐˜ง๐˜ช๐˜ค๐˜ช๐˜ฆ๐˜ฏ๐˜ต ๐˜“๐˜“๐˜” ๐˜›๐˜ณ๐˜ข๐˜ช๐˜ฏ๐˜ช๐˜ฏ๐˜จ ๐˜ฃ๐˜บ ๐˜Ž๐˜ณ๐˜ข๐˜ฅ๐˜ช๐˜ฆ๐˜ฏ๐˜ต ๐˜“๐˜ฐ๐˜ธ-๐˜™๐˜ข๐˜ฏ๐˜ฌ ๐˜—๐˜ณ๐˜ฐ๐˜ซ๐˜ฆ๐˜ค๐˜ต๐˜ช๐˜ฐ๐˜ฏ. They gather (and prove) interesting insights:
โ›” The weight matrix does not reliably converge to lower ranks during training.
โœ… But the gradient matrix does!

Based on these insights, ๐˜๐—ต๐—ฒ๐˜† ๐—ฏ๐˜‚๐—ถ๐—น๐—ฑ ๐—š๐—ฎ๐—Ÿ๐—ผ๐—ฟ๐—ฒ, that projects the gradient to lower ranks.
๐Ÿ—บ๏ธ ๐—š๐—ฟ๐—ฒ๐—ฎ๐˜ ๐—ถ๐—ฑ๐—ฒ๐—ฎ: to leave the optimization free to explore more space, they periodically re-build the low-rank projection throughout the training (a nice illustration is in the paper).

๐Ÿค This method can even be combined with previous ones such as 8-bit Adam (quantizing the optimizer states to 8-bit).

โžก๏ธ ๐‘๐ž๐ฌ๐ฎ๐ฅ๐ญ๐ฌ:
๐Ÿ“‰ Of course, huge reduction in memory footprint allowing the training on consumer-grade GPU (cf figure).
๐Ÿ’ช No reduction in performance: this scales well up to 7B parameters (and was independently confirmed since) โ‡’ this is essential, it confirms that the method is viable!

Read the full paper here: GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection (2403.03507)
posted an update 3 months ago
view post
Post
๐Ÿ“š๐Ÿ”Ž If you're building RAG applications, you should check this out:

โš™๏ธ I've built a new space to let you visualize the chunks you get with different text splitting methods!

โžก๏ธ Visualize your chunks here:
m-ric/chunk_visualizer
  • 2 replies
ยท