metadata

license: odc-by
language:
  - en
library_name: fasttext
pipeline_tag: text-classification
datasets:
  - HuggingFaceFW/fineweb-edu-llama3-annotations

FineWeb-Edu FastText classifier

Model summary

This is a FastText classifier for judging the educational value of web pages based on training data fineweb-edu-llama3-annotations. There are two objectives:

⚡ throughput optimisation: It can classify more than 2000 examples per second in CPU, and so it can be used on-the-fly during pretraining/ to process huge data with CPU.
🧪fasttext vs transformer based model: How does this lightweight model with limited capacity compare to the original model HuggingFaceFW/fineweb-edu-classifier?

🛠️Usage

from typing import List
import re
from huggingface_hub import hf_hub_download
import fasttext


model_hf = fasttext.load_model(hf_hub_download("kenhktsui/fineweb-edu-fasttext-classifier", "model.bin"))


def replace_newlines(text: str) -> str:
  return re.sub("\n+", " ", text)


def predict(text_list: List[str]) -> List[dict]:
  text_list = [replace_newlines(text) for text in text_list]
  pred = model_hf.predict(text_list)
  return [{"label": int(l[0].lstrip("__label__")), "score": s[0]}
           for l, s in zip(*pred)]


predict(["Hi"])
# Output: [{'label': 0, 'score': 1.00001}]

📊Evaluation

The last 46867 samples are used as test data, but it's not the exact test data as in HuggingFaceFW/fineweb-edu-classifier

Classification Report

              precision    recall  f1-score   support

           0       0.72      0.44      0.55      5704
           1       0.73      0.87      0.80     26595
           2       0.52      0.49      0.50     10350
           3       0.48      0.33      0.39      3397
           4       0.69      0.03      0.06       819
           5       0.00      0.00      0.00         2

    accuracy                           0.68     46867
   macro avg       0.52      0.36      0.38     46867
weighted avg       0.67      0.68      0.66     46867

The below table compares FastText model vs transformer based model.

Label	This Model	HuggingFaceFW/fineweb-edu-classifier
0	0.55	0.59
1	0.80	0.81
2	0.50	0.59
3	0.39	0.53
4	0.06	0.44
5	0.00	0.02

Label 0, 1, 2 are comparable to the original model. The performance degradation starts to be noticeable in label 3, and widen further in 4, which is due to limited capacity of fasttext model.
So, this classifer can perform reasonably well in label 0, 1, 2, and also 3 with some degradation.

Confusion Matrix

       [ 2537  3098    65     4     0     0]
       [  944 23037  2491   123     0     0]
y_true [   26  4742  5048   533     1     0]
       [    4   434  1846  1105     8     0]
       [    0    38   213   544    24     0]
       [    0     0     0     0     2     0]
                       y_pred

The model has a accuracy of 68%, and it is more likely to underpredict educational value than overpredict so. The exhibited conservatism is good for filtering large amount of data.

Predicted - Actual Rating	Frequency	%
0	31751	67.7%
-1	8078	17.2%
+1	6130	13.1%
-2	673	1.4%
+2	189	0.4%
-3	42	0.1%
+3	4	0.0%

Alignment with HuggingFaceFW/fineweb-edu-classifier

Spearman rank-order correlation coefficient is 0.5832 in MiniPile test split.