finefineweb-domain-fasttext-classifier

This is part of my fasttext classifier collection for curating pretraining dataset. This classifier classifies a text into domains specified in m-a-p/FineFineWeb.
The classifier can be used for LLM pretraining data curation, to enhance capability in many domains.
It is ultra fast ⚡ with a throughtput of ~2000 doc/s with CPU.

Don't underestimate the "old" fasttext classiifer! It is indeed a good and scalable practice. For example, QWEN2.5-MATH leverages fasttext to curate pretraining data, althought its classifier is not open sourced.

🛠️Usage

from typing import List
import re
from huggingface_hub import hf_hub_download
import fasttext


model_hf = fasttext.load_model(hf_hub_download("kenhktsui/finefineweb-domain-fasttext-classifier", "model.bin"))


def replace_newlines(text: str) -> str:
  return re.sub("\n+", " ", text)


def predict(text_list):
  text_list = [replace_newlines(text) for text in text_list]
  pred = model.predict(text_list)
  return [{"label": l[0][9:], "score": s[0]}
           for l, s in zip(*pred)]


predict(
  [
      "Arsenal is the best team in the world",
      "Macroeconomics is a branch of economics that deals with the performance, structure, behavior, and decision-making of an economy as a whole.[1] This includes regional, national, and global economies.[2][3] Macroeconomists study topics such as output/GDP (gross domestic product) and national income, unemployment (including unemployment rates), price indices and inflation, consumption, saving, investment, energy, international trade, and international finance.",
      "Quantum entanglement is the phenomenon of a group of particles being generated, interacting, or sharing spatial proximity in a manner such that the quantum state of each particle of the group cannot be described independently of the state of the others, including when the particles are separated by a large distance. The topic of quantum entanglement is at the heart of the disparity between classical physics and quantum physics: entanglement is a primary feature of quantum mechanics not present in classical mechanics.",
      "Any program written in a high-level programming language must be translated to object code before it can be executed, so all programmers using such a language use a compiler or an interpreter, sometimes even both. Improvements to a compiler may lead to a large number of improved features in executable programs."
  ]
)

# [{'label': 'sports', 'score': 0.5640762},
# {'label': 'economics', 'score': 0.53133816},
# {'label': 'physics', 'score': 0.9524484},
# {'label': 'computer_science_and_technology', 'score': 0.41515663}]

📊Evaluation

full version

                                       precision    recall  f1-score   support

                            aerospace       0.69      0.72      0.71     10000
                             agronomy       0.68      0.74      0.71     10000
                             artistic       0.37      0.24      0.29     10000
                            astronomy       0.67      0.76      0.71     10000
                  atmospheric_science       0.82      0.92      0.87     10000
                           automotive       0.66      0.74      0.70     10000
                               beauty       0.82      0.86      0.84     10000
                              biology       0.44      0.45      0.45     10000
                            celebrity       0.69      0.81      0.75     10000
                            chemistry       0.51      0.49      0.50     10000
                         christianity       0.80      0.84      0.82     10000
                    civil_engineering       0.58      0.58      0.58     10000
            communication_engineering       0.63      0.67      0.65     10000
      computer_science_and_technology       0.63      0.59      0.61     10000
                               design       0.51      0.42      0.46     10000
                       drama_and_film       0.53      0.53      0.53     10000
                            economics       0.34      0.26      0.29     10000
                   electronic_science       0.42      0.35      0.38     10000
                        entertainment       0.43      0.29      0.34     10000
                environmental_science       0.42      0.35      0.38     10000
                              fashion       0.72      0.77      0.74     10000
                              finance       0.49      0.52      0.50     10000
                                 food       0.81      0.86      0.83     10000
                               gamble       0.78      0.93      0.85     10000
                                 game       0.67      0.67      0.67     10000
                            geography       0.42      0.33      0.37     10000
                               health       0.43      0.29      0.34     10000
                              history       0.64      0.71      0.67     10000
                                hobby       0.45      0.37      0.41     10000
                hydraulic_engineering       0.95      0.98      0.96     10000
                   instrument_science       0.48      0.50      0.49     10000
   journalism_and_media_communication       0.26      0.11      0.16     10000
               landscape_architecture       0.78      0.83      0.80     10000
                                  law       0.50      0.55      0.53     10000
                              library       0.53      0.51      0.52     10000
                           literature       0.52      0.53      0.52     10000
                    materials_science       0.49      0.50      0.50     10000
                          mathematics       0.87      0.90      0.88     10000
               mechanical_engineering       0.48      0.37      0.42     10000
                              medical       0.41      0.42      0.41     10000
                   mining_engineering       0.84      0.93      0.89     10000
                                movie       0.59      0.71      0.64     10000
                      music_and_dance       0.75      0.86      0.80     10000
                                 news       0.23      0.13      0.16     10000
                      nuclear_science       0.92      0.96      0.94     10000
                        ocean_science       0.83      0.92      0.88     10000
                  optical_engineering       0.70      0.78      0.74     10000
                             painting       0.91      0.96      0.94     10000
                                  pet       0.91      0.95      0.93     10000
petroleum_and_natural_gas_engineering       0.92      0.96      0.94     10000
                           philosophy       0.63      0.66      0.64     10000
                                photo       0.80      0.85      0.82     10000
                              physics       0.40      0.35      0.37     10000
                             politics       0.38      0.41      0.39     10000
                           psychology       0.62      0.66      0.64     10000
                public_administration       0.35      0.33      0.34     10000
                         relationship       0.84      0.88      0.86     10000
                            sociology       0.46      0.50      0.48     10000
                               sports       0.66      0.82      0.73     10000
                           statistics       0.60      0.70      0.65     10000
                      systems_science       0.53      0.53      0.53     10000
                      textile_science       0.81      0.86      0.83     10000
                           topicality       0.97      0.99      0.98     10000
           transportation_engineering       0.51      0.52      0.51     10000
                               travel       0.68      0.72      0.70     10000
                       urban_planning       0.56      0.62      0.59     10000
                      weapons_science       0.97      0.99      0.98     10000

                             accuracy                           0.64    670000
                            macro avg       0.62      0.64      0.63    670000
                         weighted avg       0.62      0.64      0.63    670000

⚠️Known Limitation

The classifier does not handle short text well, which might not be surprising.

kenhktsui
/

finefineweb-domain-fasttext-classifier

finefineweb-domain-fasttext-classifier

🛠️Usage

📊Evaluation

⚠️Known Limitation

Dataset used to train kenhktsui/finefineweb-domain-fasttext-classifier

Collection including kenhktsui/finefineweb-domain-fasttext-classifier

FastText Model for Pretraining Data Curation