235 178 2956

Knut Jägersberg

KnutJaegersberg

jagersbergknut

AI & ML interests

NLP, opinion mining, narrative intelligence

Recent Activity

updated a model about 5 hours ago

KnutJaegersberg/reka-flash-3.1-Q8_0-GGUF

published a model about 5 hours ago

KnutJaegersberg/reka-flash-3.1-Q8_0-GGUF

liked a model about 5 hours ago

RekaAI/reka-flash-3.1

View all activity

Organizations

Posts 27

Post

973

Mining LLM Pretraining Data: Topics, Skills, and Cognitive Patterns

Summary
The technical blog post details an analysis of pretraining data from various Large Language Models (LLMs) like GPT-2, Falcon, and Gemma2. Using text mining techniques including embeddings, clustering, and LLM-based annotation on datasets like OpenWebText, The Pile, and C4, the study identified key patterns.

Findings show the data is dominated by topics like Technology, Politics, Health, Business, and Culture, originating from diverse sources including web scrapes, academic papers, code repositories, and news media. The data reflects the work of professionals primarily in Journalism/Media, Content Creation, Analysis/Research, Academia, and Tech/Engineering. Consequently, LLMs learn corresponding skills (e.g., Research, Critical Thinking, Communication, Domain Expertise) and task representations (e.g., Analysis, Content Creation, Compliance).

The analysis also uncovered distinct writing styles, underlying cognitive frameworks (beliefs, frames, schemas, memes), and common cognitive biases (like Confirmation Bias) embedded in the data. LLM capability progression appears linked to data scale and task frequency, following a power law. The study concludes that LLMs are powerful data-driven simulators whose capabilities and limitations are shaped by the composition and inherent biases of their pretraining corpora, highlighting the importance of data understanding and curation.

https://huggingface.co/blog/KnutJaegersberg/mining-llm-pretraining-data

View all Posts