FastText Model for Pretraining Data Curation
Collection
6 items
•
Updated
•
2
This is part of my fasttext classifier collection for curating pretraining dataset.
This classifier classifies a text into domains specified in m-a-p/FineFineWeb.
The classifier can be used for LLM pretraining data curation, to enhance capability in many domains.
It is ultra fast ⚡ with a throughtput of ~2000 doc/s with CPU.
Don't underestimate the "old" fasttext classiifer! It is indeed a good and scalable practice. For example, QWEN2.5-MATH leverages fasttext to curate pretraining data, althought its classifier is not open sourced.
from typing import List
import re
from huggingface_hub import hf_hub_download
import fasttext
model_hf = fasttext.load_model(hf_hub_download("kenhktsui/finefineweb-domain-fasttext-classifier", "model.bin"))
def replace_newlines(text: str) -> str:
return re.sub("\n+", " ", text)
def predict(text_list):
text_list = [replace_newlines(text) for text in text_list]
pred = model.predict(text_list)
return [{"label": l[0][9:], "score": s[0]}
for l, s in zip(*pred)]
predict(
[
"Arsenal is the best team in the world",
"Macroeconomics is a branch of economics that deals with the performance, structure, behavior, and decision-making of an economy as a whole.[1] This includes regional, national, and global economies.[2][3] Macroeconomists study topics such as output/GDP (gross domestic product) and national income, unemployment (including unemployment rates), price indices and inflation, consumption, saving, investment, energy, international trade, and international finance.",
"Quantum entanglement is the phenomenon of a group of particles being generated, interacting, or sharing spatial proximity in a manner such that the quantum state of each particle of the group cannot be described independently of the state of the others, including when the particles are separated by a large distance. The topic of quantum entanglement is at the heart of the disparity between classical physics and quantum physics: entanglement is a primary feature of quantum mechanics not present in classical mechanics.",
"Any program written in a high-level programming language must be translated to object code before it can be executed, so all programmers using such a language use a compiler or an interpreter, sometimes even both. Improvements to a compiler may lead to a large number of improved features in executable programs."
]
)
# [{'label': 'sports', 'score': 0.5640762},
# {'label': 'economics', 'score': 0.53133816},
# {'label': 'physics', 'score': 0.9524484},
# {'label': 'computer_science_and_technology', 'score': 0.41515663}]
full version
precision recall f1-score support
aerospace 0.69 0.72 0.71 10000
agronomy 0.68 0.74 0.71 10000
artistic 0.37 0.24 0.29 10000
astronomy 0.67 0.76 0.71 10000
atmospheric_science 0.82 0.92 0.87 10000
automotive 0.66 0.74 0.70 10000
beauty 0.82 0.86 0.84 10000
biology 0.44 0.45 0.45 10000
celebrity 0.69 0.81 0.75 10000
chemistry 0.51 0.49 0.50 10000
christianity 0.80 0.84 0.82 10000
civil_engineering 0.58 0.58 0.58 10000
communication_engineering 0.63 0.67 0.65 10000
computer_science_and_technology 0.63 0.59 0.61 10000
design 0.51 0.42 0.46 10000
drama_and_film 0.53 0.53 0.53 10000
economics 0.34 0.26 0.29 10000
electronic_science 0.42 0.35 0.38 10000
entertainment 0.43 0.29 0.34 10000
environmental_science 0.42 0.35 0.38 10000
fashion 0.72 0.77 0.74 10000
finance 0.49 0.52 0.50 10000
food 0.81 0.86 0.83 10000
gamble 0.78 0.93 0.85 10000
game 0.67 0.67 0.67 10000
geography 0.42 0.33 0.37 10000
health 0.43 0.29 0.34 10000
history 0.64 0.71 0.67 10000
hobby 0.45 0.37 0.41 10000
hydraulic_engineering 0.95 0.98 0.96 10000
instrument_science 0.48 0.50 0.49 10000
journalism_and_media_communication 0.26 0.11 0.16 10000
landscape_architecture 0.78 0.83 0.80 10000
law 0.50 0.55 0.53 10000
library 0.53 0.51 0.52 10000
literature 0.52 0.53 0.52 10000
materials_science 0.49 0.50 0.50 10000
mathematics 0.87 0.90 0.88 10000
mechanical_engineering 0.48 0.37 0.42 10000
medical 0.41 0.42 0.41 10000
mining_engineering 0.84 0.93 0.89 10000
movie 0.59 0.71 0.64 10000
music_and_dance 0.75 0.86 0.80 10000
news 0.23 0.13 0.16 10000
nuclear_science 0.92 0.96 0.94 10000
ocean_science 0.83 0.92 0.88 10000
optical_engineering 0.70 0.78 0.74 10000
painting 0.91 0.96 0.94 10000
pet 0.91 0.95 0.93 10000
petroleum_and_natural_gas_engineering 0.92 0.96 0.94 10000
philosophy 0.63 0.66 0.64 10000
photo 0.80 0.85 0.82 10000
physics 0.40 0.35 0.37 10000
politics 0.38 0.41 0.39 10000
psychology 0.62 0.66 0.64 10000
public_administration 0.35 0.33 0.34 10000
relationship 0.84 0.88 0.86 10000
sociology 0.46 0.50 0.48 10000
sports 0.66 0.82 0.73 10000
statistics 0.60 0.70 0.65 10000
systems_science 0.53 0.53 0.53 10000
textile_science 0.81 0.86 0.83 10000
topicality 0.97 0.99 0.98 10000
transportation_engineering 0.51 0.52 0.51 10000
travel 0.68 0.72 0.70 10000
urban_planning 0.56 0.62 0.59 10000
weapons_science 0.97 0.99 0.98 10000
accuracy 0.64 670000
macro avg 0.62 0.64 0.63 670000
weighted avg 0.62 0.64 0.63 670000
The classifier does not handle short text well, which might not be surprising.