anton-l HF staff commited on
Commit
63f8cb7
1 Parent(s): b20d047

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -4
README.md CHANGED
@@ -24,13 +24,13 @@ inputs = tokenizer(texts, return_tensors="pt", padding="longest", truncation=Tru
24
  outputs = model(**inputs)
25
  logits = outputs.logits.squeeze(-1).float().numpy()
26
  score = logits.item()
27
- record = {
28
  "text": text,
29
  "score": score,
30
  "int_score": int(round(max(0, min(score, 5))))
31
  }
32
 
33
- print(record)
34
  ```
35
 
36
  ## Training
@@ -50,7 +50,37 @@ We added a classification head with a single regression output to [Snowflake-arc
50
  - Epochs: 20
51
  - Learning Rate: 3e-4
52
  - Evaluation Metric: F1 score
53
- - Final F1 Score on validation set: 82%
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
 
55
 
56
  ## Limitations
@@ -60,4 +90,5 @@ While the FineWeb-Edu classifier performs well in distinguishing high-quality ed
60
  - Bias: The model's performance is dependent on the quality and representativeness of the training data and the LLM used for the annotation. Biases in both can affect the classifier's judgments. It might overfit to academic looking content for the higher scores and we recommend using score >= 3 as a threshold for data curation.
61
  - Context: The classifier evaluates individual web pages or extracts without considering broader context, which might impact its effectiveness in certain scenarios.
62
 
63
- The training and inference code is available on GitHub (to add).
 
 
24
  outputs = model(**inputs)
25
  logits = outputs.logits.squeeze(-1).float().numpy()
26
  score = logits.item()
27
+ result = {
28
  "text": text,
29
  "score": score,
30
  "int_score": int(round(max(0, min(score, 5))))
31
  }
32
 
33
+ print(result)
34
  ```
35
 
36
  ## Training
 
50
  - Epochs: 20
51
  - Learning Rate: 3e-4
52
  - Evaluation Metric: F1 score
53
+
54
+ **Classification report**
55
+
56
+ We treat the regression model's predictions as discrete classes to calculate the metrics on a hold-out set of 46867 Llama3-annotated samples.
57
+ ```
58
+ precision recall f1-score support
59
+
60
+ 0 0.75 0.49 0.59 5694
61
+ 1 0.78 0.84 0.81 26512
62
+ 2 0.57 0.61 0.59 10322
63
+ 3 0.56 0.50 0.53 3407
64
+ 4 0.58 0.35 0.44 807
65
+ 5 0.33 0.01 0.02 125
66
+
67
+ accuracy 0.71 46867
68
+ macro avg 0.60 0.47 0.50 46867
69
+ weighted avg 0.71 0.71 0.71 46867
70
+ ```
71
+
72
+ **Confusion matrix**
73
+
74
+ We verify that the predicted educational scores are indeed close to their ground truth, and are mostry impacted by the noisy annotation.
75
+ ```
76
+ 2791 2858 45 0 0 0
77
+ 919 22343 3180 69 1 0
78
+ y_true 3 3225 6330 757 7 0
79
+ 1 66 1473 1694 173 0
80
+ 0 4 98 420 283 2
81
+ 0 0 18 85 21 1
82
+ y_pred
83
+ ```
84
 
85
 
86
  ## Limitations
 
90
  - Bias: The model's performance is dependent on the quality and representativeness of the training data and the LLM used for the annotation. Biases in both can affect the classifier's judgments. It might overfit to academic looking content for the higher scores and we recommend using score >= 3 as a threshold for data curation.
91
  - Context: The classifier evaluates individual web pages or extracts without considering broader context, which might impact its effectiveness in certain scenarios.
92
 
93
+ The training and inference code is available on GitHub
94
+ https://github.com/huggingface/cosmopedia/tree/main/classification