Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
m-ricย 
posted an update 24 days ago
Post
2287
๐๐š๐ฉ๐ž๐ซ ๐‘๐ž๐ฏ๐ข๐ž๐ฐ: ๐‘๐ก๐จ-๐Ÿ - ๐ƒ๐จ ๐ง๐จ๐ญ ๐ฎ๐ฌ๐ž ๐š๐ฅ๐ฅ ๐ญ๐จ๐ค๐ž๐ง๐ฌ ๐ž๐ช๐ฎ๐š๐ฅ๐ฅ๐ฒ ๐ข๐ง ๐ฒ๐จ๐ฎ๐ซ ๐ญ๐ซ๐š๐ข๐ง๐ข๐ง๐ ! โš–๏ธโ›”๏ธ

A new paper topping Daily papers questions a hidden assumption in LLM training:

๐Ÿค” ๐™Ž๐™๐™ค๐™ช๐™ก๐™™ ๐™ฌ๐™š ๐™ง๐™š๐™–๐™ก๐™ก๐™ฎ ๐™ช๐™จ๐™š ๐™–๐™ก๐™ก ๐™ฉ๐™ค๐™ ๐™š๐™ฃ๐™จ ๐™š๐™ฆ๐™ช๐™–๐™ก๐™ก๐™ฎ ๐™ž๐™ฃ ๐™ค๐™ช๐™ง ๐™‡๐™‡๐™ˆ'๐™จ ๐™ฉ๐™ง๐™–๐™ž๐™ฃ๐™ž๐™ฃ๐™œ ?

Some tokens are more relevant than others, and some are mostly noise (just look up the history of ๐˜š๐˜ฐ๐˜ญ๐˜ช๐˜ฅ๐˜Ž๐˜ฐ๐˜ญ๐˜ฅ๐˜”๐˜ข๐˜จ๐˜ช๐˜ฌ๐˜ข๐˜ณ๐˜ฑ).

So this paper introduces ๐—ฆ๐—ฒ๐—น๐—ฒ๐—ฐ๐˜๐—ถ๐˜ƒ๐—ฒ ๐—Ÿ๐—ฎ๐—ป๐—ด๐˜‚๐—ฎ๐—ด๐—ฒ ๐— ๐—ผ๐—ฑ๐—ฒ๐—น๐—ถ๐—ป๐—ด, which is actually really simple:
โžก๏ธ A specific metric measures the relevance of each token. Then during training, only the top k% tokens for this relevance metric count in the loss calculation.

Authors test this method by training models on the difficult MATH dataset (only competition mathematics problems).

โžก๏ธ Their technique seems like a new must-do in LLM training: Training is much faster and reaches an impressive performance!

๐‘๐ž๐ฌ๐ฎ๐ฅ๐ญ๐ฌ:
โ—† โฑ๏ธ Training is x5 to x10 faster to reach equivalent performance compared to standard language modeling.
โ—† ๐Ÿ’ช Their 1B model achieves close to GPT4 Chain-of-Thought performance on MATH!
โ—† ๐Ÿš€ Their 7B model match performance of the state-of-the-art DeepSeek for the same size, while trained on only 3% of tokens

๐€๐๐๐ข๐ญ๐ข๐จ๐ง๐š๐ฅ ๐ข๐ง๐ฌ๐ข๐ ๐ก๐ญ๐ฌ ๐Ÿ’ก
โ—† Datasets used for pre-training, even after pre-filtering, still contain a large proportion of noisy tokens ๐Ÿ˜–
โ—† Authors show that when you reduce loss on noisy tokens, you actually reduce accuracy (Figure 7). So Selective Language Modeling seems fundamental! โœ…

Find great reads in @akhaliq 's Daily Papers ๐Ÿ‘‰ https://huggingface.co/papers
Paper added to my collection ๐Ÿ‘‰ m-ric/spinning-up-in-llms-659e698f9dd5a71bd3f579a7
In this post