AI & ML interests

Web as a corpus, Large Language Models, Machine Translation, Language Technologies, Natural Language Processing

Recent Activity

vmkhlv  updated a dataset 2 days ago
HPLT/HPLT2.0_cleaned
ltgoslo  updated a Space 3 days ago
HPLT/README
View all activity

HPLT's activity

ltgoslo 
updated a Space 3 days ago
davanstrien 
posted an update 4 days ago
view post
Post
1415
I've created a v1 dataset ( davanstrien/reasoning-required) and model ( davanstrien/ModernBERT-based-Reasoning-Required) to help curate "wild text" data for generating reasoning examples beyond the usual code/math/science domains.

- I developed a "Reasoning Required" dataset with a 0-4 scoring system for reasoning complexity
- I used educational content from HuggingFaceFW/fineweb-edu, adding annotations for domains, reasoning types, and example questions

My approach enables a more efficient workflow: filter text with small models first, then use LLMs only on high-value content.

This significantly reduces computation costs while expanding reasoning dataset domain coverage.