NB Linguistic Quality Regressor

Introduction

This model is designed to rate the quality of Norwegian training corpora based on linguistic quality. It predicts a continuous score (float from 0 to 5), assessing the linguistic quality of Norwegian texts. The model is inspired by the classifiers used in the FineWeb project and is trained mainly on Norwegian content.

Model Architecture

It is trained on top of the nb-bert-base model and utilizes code from CosmoPedia.

Training Data

The dataset used for training is derived from GlotCC and has been annotated using Gemini 1.5 Flash.

Purpose

The performance of large language models (LLMs) heavily depends on the quality and size of their pretraining datasets. This regressor aims to assess and enhance the linguistic quality of Norwegian textual data, contributing to better-performing Norwegian LLMs.

This model is part of a pair; the other is the NB Education Quality Regressor, which focuses on educational content.

Using the Model

For convenience we also provide the run_regressor_bert.py script. This is also based on run_edu_bert.py from Cosmopedia. You can modify this script to annotate HuggingFace datasets directly. Cosmopedia also provides slurm-scripts here. We have not included these since we have had the opportunity to test them.

Training and Evaluation Procedure

The following command where used for training. Please note that train_regressor_bert.py has a few minor changes to the original train_edu_bert.py:

 python train_regressor_bert.py --base_model_name="NbAiLab/nb-bert-base" --dataset_name="user/linguistic-annotations" --target_column="score" --checkpoint_dir="/home/user/checkpoints/scandinavian_bert/"

The following script where used for evaluation.

 python eval_regressor_bert.py --checkpoint_dir="/home/user/checkpoints/scandinavian_bert/final/" --dataset_name="user/linguistic-annotations"

Classification Report

Class	Precision	Recall	F1-score	Support
0	0.84	0.60	0.70	12209
1	0.70	0.72	0.71	24316
2	0.41	0.49	0.44	10499
3	0.38	0.51	0.43	5833
4	0.10	0.24	0.14	1342
5	0.87	0.39	0.54	5656

Overall Metrics

Metric	Value
Accuracy	0.59
Macro Avg
- Precision	0.55
- Recall	0.49
- F1-score	0.50
Weighted Avg
- Precision	0.65
- Recall	0.59
- F1-score	0.61
Support	59855

Confusion Matrix

	Predicted 0	Predicted 1	Predicted 2	Predicted 3	Predicted 4	Predicted 5
Actual 0	7318	4278	529	63	19	2
Actual 1	1364	17602	4414	785	135	16
Actual 2	38	2615	5130	2289	369	58
Actual 3	10	333	1726	2952	664	148
Actual 4	3	83	350	476	324	106
Actual 5	6	98	479	1205	1639	2229

Evaluation Metrics

Metric	Value
Eval Loss	0.673861563205719
Eval Precision	0.5502142676492386
Eval Recall	0.49225148166352145
Eval F1 Macro	0.49616318856882935
Eval Accuracy	0.5940188789574806
Eval Runtime	285.9726
Eval Samples per Second	209.303
Eval Steps per Second	3.273
Epoch	19.96

Training Runtime

Metric	Value
Train Runtime	105056.8322
Train Samples per Second	102.552
Train Steps per Second	1.603
Train Loss	0.6785072675819606
Epoch	20.0

Run Summary

Metric	Value
Eval Accuracy	0.59402
Eval F1 Macro	0.49616
Eval Loss	0.67386
Eval Precision	0.55021
Eval Recall	0.49225
Eval Runtime	285.9726
Eval Samples per Second	209.303
Eval Steps per Second	3.273
Total FLOPs	2.8346790572921083e+18
Train Epoch	20.0
Train Global Step	168360
Train Grad Norm	2.77268
Train Learning Rate	0.0
Train Loss	0.6201
Train Loss (Final)	0.67851
Train Runtime	105056.8322
Train Samples per Second	102.552
Train Steps per Second	1.603