Aymeric Roucher

m-ric

AI & ML interests

MLE at Hugging Face ๐Ÿค— LLMs, Agents, RAG, Multimodal.

Articles

Organizations

Posts 20

view post
Post
839
๐—”๐—ด๐—ฒ๐—ป๐˜๐—ถ๐—ฐ ๐——๐—ฎ๐˜๐—ฎ ๐—ฎ๐—ป๐—ฎ๐—น๐˜†๐˜€๐˜: ๐—ฑ๐—ฟ๐—ผ๐—ฝ ๐˜†๐—ผ๐˜‚๐—ฟ ๐—ฑ๐—ฎ๐˜๐—ฎ ๐—ณ๐—ถ๐—น๐—ฒ, ๐—น๐—ฒ๐˜ ๐˜๐—ต๐—ฒ ๐—Ÿ๐—Ÿ๐—  ๐—ฑ๐—ผ ๐˜๐—ต๐—ฒ ๐—ฎ๐—ป๐—ฎ๐—น๐˜†๐˜€๐—ถ๐˜€ ๐Ÿ“Šโš™๏ธ

Need to make quick exploratory data analysis? โžก๏ธ Get help from an agent.

I was impressed by Llama-3.1's capacity to derive insights from data. Given a csv file, it makes quick work of exploratory data analysis and can derive interesting insights.

On the data from the Kaggle titanic challenge, that records which passengers survived the Titanic wreckage, it was able by itself to derive interesting trends like "passengers that paid higher fares were more likely to survive" or "survival rate was much higher for women than men".

The cookbook even lets the agent built its own submission to the challenge, and it ranks under 3,000 out of 17,000 submissions: ๐Ÿ‘ not bad at all!

Try it for yourself in this Space demo ๐Ÿ‘‰ m-ric/agent-data-analyst
view post
Post
3078
๐๐ž๐ฐ ๐๐ž๐œ๐จ๐๐ข๐ง๐  ๐ญ๐ž๐œ๐ก๐ง๐ข๐ช๐ฎ๐ž ๐ข๐ง ๐ญ๐ซ๐š๐ง๐ฌ๐Ÿ๐จ๐ซ๐ฆ๐ž๐ซ๐ฌ ๐ฌ๐ข๐ ๐ง๐ข๐Ÿ๐ข๐œ๐š๐ง๐ญ๐ฅ๐ฒ ๐ซ๐ž๐๐ฎ๐œ๐ž๐ฌ ๐ก๐š๐ฅ๐ฅ๐ฎ๐œ๐ข๐ง๐š๐ญ๐ข๐จ๐ง๐ฌ ๐Ÿ‘

DoLa decoding, which made a conference paper at ICLR '24, has just been merged in Transformers by @joaogante and Yung-Sung Chuang.
This new decoding method is simple yet extremely impressive!

Reminder: Decoder LLMs (the GPT kind of LLM, the most common one) generate their outputs one token at a time: at each step, given a current text, they compute a logit for each token in their vocabulary that should represent the probability of this token coming next.

Then they either pick the highest logit token (greedy decoding) or sample one with a probability defined by the logits (sampling).

The authors of DoLa wanted to improve that simple method.

They knew this established fact that transformer LMs encode low-level info (like base syntax) in early layers and more high-level info like knowledge in the later layers.

๐Ÿ’ก This gave them their key idea: During decoding, rather than picking the token with the highest logit, ๐˜„๐—ต๐˜† ๐—ป๐—ผ๐˜ ๐—ฝ๐—ถ๐—ฐ๐—ธ ๐˜๐—ต๐—ฒ ๐˜๐—ผ๐—ธ๐—ฒ๐—ป ๐˜„๐—ถ๐˜๐—ต ๐˜๐—ต๐—ฒ ๐—บ๐—ผ๐˜€๐˜ ๐—ถ๐—บ๐—ฝ๐—ฟ๐—ฒ๐˜€๐˜€๐—ถ๐˜ƒ๐—ฒ ๐—ถ๐—ป๐—ฐ๐—ฟ๐—ฒ๐—ฎ๐˜€๐—ฒ ๐—ถ๐—ป ๐—น๐—ผ๐—ด๐—ถ๐˜ ๐—ฎ๐—ฐ๐—ฟ๐—ผ๐˜€๐˜€ ๐—น๐—ฎ๐˜†๐—ฒ๐—ฟ๐˜€?

This gives impressive results:
๐Ÿš€ ๐Ÿฑ% - ๐Ÿฎ๐Ÿฌ% ๐—ฏ๐—ฎ๐˜€๐—ฒ ๐—ฝ๐—ผ๐—ถ๐—ป๐˜๐˜€ ๐—ถ๐—ป๐—ฐ๐—ฟ๐—ฒ๐—ฎ๐˜€๐—ฒ ๐—ฎ๐—ฐ๐—ฟ๐—ผ๐˜€๐˜€ ๐˜๐—ต๐—ฒ ๐—ฏ๐—ฒ๐—ป๐—ฐ๐—ต๐—บ๐—ฎ๐—ฟ๐—ธ๐˜€
๐Ÿš€ For instance on TruthfulQA / Open-ended, across all model sizes the increase in truthfulness is 14 base points, which is ๐—ฎ๐—ฟ๐—ผ๐˜‚๐—ป๐—ฑ ๐Ÿฐ๐Ÿฌ% ๐—ถ๐—บ๐—ฝ๐—ฟ๐—ผ๐˜ƒ๐—ฒ๐—บ๐—ฒ๐—ป๐˜ ๐—ฐ๐—ผ๐—บ๐—ฝ๐—ฎ๐—ฟ๐—ฒ๐—ฑ ๐˜๐—ผ ๐˜€๐˜๐—ฎ๐—ป๐—ฑ๐—ฎ๐—ฟ๐—ฑ ๐—ฑ๐—ฒ๐—ฐ๐—ผ๐—ฑ๐—ถ๐—ป๐—ด!

๐Ÿค” Wouldn't decoding take longer because of this added contrasting step? ๐Ÿ‘‰ ๐—ง๐—ต๐—ฒ ๐—ฟ๐˜‚๐—ป๐˜๐—ถ๐—บ๐—ฒ ๐—ถ๐—ป๐—ฐ๐—ฟ๐—ฒ๐—ฎ๐˜€๐—ฒ ๐—ถ๐˜€ ๐—ป๐—ฒ๐—ด๐—น๐—ถ๐—ด๐—ถ๐—ฏ๐—น๐—ฒ, ๐Ÿญ ๐˜๐—ผ ๐Ÿด% ๐—ผ๐—ป๐—น๐˜†.

Paper added to my collection ๐Ÿ‘‰ m-ric/optimization-mechanics-661d543a5fc6ca1dc84284a0