Thomas Wolf's picture

Thomas Wolf PRO

thomwolf

·

https://thomwolf.io

AI & ML interests

NLP and open-source :-)

Articles

FineVideo: behind the scenes

Fine-tuning LLMs to 1.58bit: extreme quantization made easy

A failed experiment: Infini-Attention, and why we should keep trying?

Jack of All Trades, Master of Some, a Multi-Purpose Transformer Agent

Constitutional AI with Open LLMs

Open LLM Leaderboard: DROP deep dive

What's going on with the Open LLM Leaderboard?

Can foundation models label data like humans?

Organizations

Posts 7

Post

3701

Parents in the 1990: Teach the kids to code
Parents now: Teach the kids to fix the code when it starts walking around 🤖✨

Post

4354

[New crazy blog post alert] We are releasing an extensive blog post on the science of creating high quality web-scale datasets, detailing all the steps and learnings that came in our recent 15 trillion tokens 🍷FineWeb release

Inspired by the distill.pub interactive graphics papers, we settled to write the most extensive, enjoyable and in-depth tech report we could draft on so prepare for a 45-mmin read with interactive graphics and all.

And it's not all, in this article we also introduce 📚FineWeb-Edu a filtered subset of Common Crawl with 1.3T tokens containing only web pages with very high educational content. Up to our knowledge, FineWeb-Edu out-performs all openly release web-scale datasets by a significant margin on knowledge- and reasoning-intensive benchmarks like MMLU, ARC, and OpenBookQA

We also make a number of surprising observations on the "quality" of the internet it-self which may challenge some of the general assumptions on web data (not saying more, I'll let you draw your conclusions ;)

HuggingFaceFW/blogpost-fineweb-v1

Papers 25

arxiv:2406.17557

arxiv:2402.19173

arxiv:2311.12983

arxiv:2311.05640

spaces 10

My Argilla

Rocket Chat Demo

Chat

3d Bench Viz

Voice Chat With Mistral 7B

Hf Star History

models 6

thomwolf/act-sort3

thomwolf/codeparrot-small

Text Generation • Updated Jul 27, 2021 • 11

thomwolf/codeparrot

Text Generation • Updated Jul 21, 2021 • 11 • 1

thomwolf/codeparrot-small-vocabulary

Updated Jul 21, 2021

thomwolf/vqgan_imagenet_f16_1024

Updated Jun 8, 2021 • 11

thomwolf/test-model

Updated Jan 21, 2021

datasets 17

thomwolf/blue_sort

Updated May 31 • 202

thomwolf/data_test

Updated May 29 • 49

thomwolf/cameras_conditions

Viewer • Updated May 29 • 214 • 76

thomwolf/lerobot-sort2

Preview • Updated May 20 • 126 • 1

thomwolf/lerobot-sort3

Preview • Updated May 20 • 106

thomwolf/act-raw-sort3

Updated May 20 • 127

thomwolf/act-raw-sort2

Updated May 20 • 99

thomwolf/test-1.1

Updated Aug 27, 2023 • 6

thomwolf/test_queries

Updated Jul 10, 2023 • 9

thomwolf/very-good-dataset

Viewer • Updated Sep 17, 2021 • 6 • 15