Loubna Ben Allal

loubnabnl

AI & ML interests

LLMs, ML for code, Synthetic data

Articles

Organizations

loubnabnl's activity

posted an update about 2 months ago
view post
Post
3554
๐Ÿท FineWeb technical report is out and so is ๐Ÿ“š FineWeb-Edu, a 1.3 trillion tokens dataset that outperforms all other open web datasets, with remarkable improvements on educational benchmarksย such as MMLU, ARC, and OpenBookQA.

Technical report: HuggingFaceFW/blogpost-fineweb-v1
Dataset: HuggingFaceFW/fineweb-edu

We used Llama 3 generations to train an educational quality classifier, filtering the 15 trillion tokens of FineWeb to select only those with high educational value (an approach also used in Llama 3 and Phi-3 training datasets). We're releasing both FineWeb-Edu and the classifier, along with a larger, less heavily filtered version containing 5.4 trillion tokens.

You can find more details about the dataset and the experiments we ran in the FineWeb technical report, It's a 45-minute read but it contains all the secret sauce for building high quality web datasets.

Enjoy!
posted an update 4 months ago
view post
Post
5035
We've just published a detailed blog post on the creation of Cosmopedia dataset. We hope this will provide insights about generating synthetic data at scale for pre-training.
https://huggingface.co/blog/cosmopedia

Here are some key takeaways:
๐ŸŽฏ Prompt curation is crucial: we want to cover many topics with few duplicates.
๐Ÿ“š You can leverage various resources for diversity: using different seed data, generation formats, and target audiences.
โš™๏ธ The importance of a good technical stack: for scalable generations with tools like llm-swarm and fast model training and evaluation.

Have a good read!
  • 1 reply
ยท
posted an update 5 months ago
view post
Post
โญ Today weโ€™re releasing The Stack v2 & StarCoder2: a series of 3B, 7B & 15B code generation models trained on 3.3 to 4.5 trillion tokens of code:

- StarCoder2-15B matches or outperforms CodeLlama 34B, and approaches DeepSeek-33B on multiple benchmarks.
- StarCoder2-3B outperforms StarCoderBase-15B and similar sized models.
- The Stack v2 a 4x larger dataset than the Stack v1, resulting in 900B unique code tokens ๐Ÿš€
As always, we released everything from models and datasets to curation code. Enjoy!

๐Ÿ”— StarCoder2 collection: bigcode/starcoder2-65de6da6e87db3383572be1a
๐Ÿ”— Paper: https://drive.google.com/file/d/17iGn3c-sYNiLyRSY-A85QOzgzGnGiVI3/view
๐Ÿ”— BlogPost: https://huggingface.co/blog/starcoder2
๐Ÿ”— Code Leaderboard: bigcode/bigcode-models-leaderboard