Dan Saattrup Nielsen
saattrupdan
AI & ML interests
NLP for low-resource languages.
Recent Activity
reacted
to
davanstrien's
post
with π₯
about 4 hours ago
Introducing scandi-fine-web-cleaner https://huggingface.co/davanstrien/scandi-fine-web-cleaner, the first model trained on FineWeb-C community annotations!
FineWeb2 is a massive multilingual dataset for pre-training language models. Like any web-scale dataset, it contains low-quality content. How can we improve it?
Over the past months, an amazing community of 400+ annotators has been labelling content quality (using Argilla) across 23 languages through the FineWeb-C initiative.
Today, I'm happy to share the first classifier trained on this data.
π What we've built:
- A lightweight classifier that efficiently removes low-quality content
- 90%+ precision demonstrated on Danish & Swedish
- Can process the 43M+ documents in Danish FineWeb2 with minimal compute
π Why this matters: The approach can be reproduced for any of the 23 languages in FineWeb-C (https://huggingface.co/datasets/data-is-better-together/fineweb-c). We can improve training data quality at scale without massive compute resources by starting with community annotations and training small, efficient classifiers.
Want to build a classifier for your language? Check out the full blog post with code examples and implementation details: https://danielvanstrien.xyz/posts/2025/FineWeb-c/scandinavian-content-filtering-fineweb.html
liked
a model
about 4 hours ago
davanstrien/scandi-fine-web-cleaner
reacted
to
davanstrien's
post
with π€
about 4 hours ago
Introducing scandi-fine-web-cleaner https://huggingface.co/davanstrien/scandi-fine-web-cleaner, the first model trained on FineWeb-C community annotations!
FineWeb2 is a massive multilingual dataset for pre-training language models. Like any web-scale dataset, it contains low-quality content. How can we improve it?
Over the past months, an amazing community of 400+ annotators has been labelling content quality (using Argilla) across 23 languages through the FineWeb-C initiative.
Today, I'm happy to share the first classifier trained on this data.
π What we've built:
- A lightweight classifier that efficiently removes low-quality content
- 90%+ precision demonstrated on Danish & Swedish
- Can process the 43M+ documents in Danish FineWeb2 with minimal compute
π Why this matters: The approach can be reproduced for any of the 23 languages in FineWeb-C (https://huggingface.co/datasets/data-is-better-together/fineweb-c). We can improve training data quality at scale without massive compute resources by starting with community annotations and training small, efficient classifiers.
Want to build a classifier for your language? Check out the full blog post with code examples and implementation details: https://danielvanstrien.xyz/posts/2025/FineWeb-c/scandinavian-content-filtering-fineweb.html
Organizations
saattrupdan's activity
How to use ModernBERT with the AutoModelForQuestionAnswering class?
3
#15 opened 21 days ago
by
sraj
Add text generation pipeline tag
#8 opened 6 days ago
by
saattrupdan
Add text generation pipeline tag
#5 opened 6 days ago
by
saattrupdan
Set pipeline tag to text generation
#1 opened 6 days ago
by
saattrupdan
Change dtype to bf16
#1 opened 18 days ago
by
saattrupdan
Add pipeline tag to model
#1 opened 24 days ago
by
saattrupdan
Translation model used?
5
#2 opened 28 days ago
by
saattrupdan
Sentiment label source?
#5 opened 28 days ago
by
saattrupdan
Finetuning datasets used?
2
#6 opened about 1 month ago
by
saattrupdan
Is there a for the model to show more than 4 results?
3
#6 opened about 2 months ago
by
fmpapso
Vocab size does not match tokenizer config
8
#1 opened about 2 months ago
by
saattrupdan
Adding `safetensors` variant of this model
#2 opened about 2 months ago
by
SFconvertbot
Update tokenizer_config.json
#3 opened about 2 months ago
by
saattrupdan
Update tokenizer_config.json
#2 opened about 2 months ago
by
saattrupdan
Update README.md
1
#3 opened 2 months ago
by
ShobaD
Librarian Bot: Add language metadata for dataset
#2 opened 2 months ago
by
librarian-bot
Update README.md
#2 opened 3 months ago
by
AnnikaSimonsen
[bot] Conversion to Parquet
#1 opened 4 months ago
by
parquet-converter
Update README.md
1
#1 opened 4 months ago
by
mstepanovic
Update README.md
#1 opened 6 months ago
by
saattrupdan