Automatic Evaluation Models for Textual Data Quality (NL & CL)

Automatically assess the quality of textual data using a clear and intuitive scale, adapted for both natural language (NL) and code language (CL).
We compare two distinct approaches:

A unified model that handles both NL and CL jointly: EuroBERT-210m-Quality
A dual-model approach that treats NL and CL separately:
- EuroBERT-210m-Quality-NL for natural language
- EuroBERT-210m-Quality-CL for code language.

Classification Categories:

Harmful: Harmful data, potentially incorrect or dangerous.
Low: Low-quality data with major issues.
Medium: Medium quality, improvable but acceptable.
High: Good to very good quality data, ready for use without reservation.

Supported Languages:

Natural Language: French 🇫🇷, English 🇬🇧, Spanish 🇪🇸
Code Language: Python 🐍, Java ☕, JavaScript 📜, C/C++ ⚙️

Performance

f1-score: Unified Model (NL + CL)

Catégorie	Global (NL + CL)	NL	CL
Harmfull	0.86	0.93	0.79
Low	0.62	0.81	0.40
Medium	0.63	0.78	0.50
High	0.77	0.81	0.74
Accuracy	0.73	0.83	0.62

f1-score: Separate Models

Catégorie	Global (NL + CL)	NL	CL
Harmfull	0.83	0.93	0.72
Low	0.64	0.76	0.53
Medium	0.63	0.76	0.52
High	0.79	0.81	0.76
Accuracy	0.73	0.82	0.63

Key Performance Metrics:

Unified Model (NL + CL):
- Overall accuracy: ~73%
- High reliability on harmful data (f1-score: 0.86)
Separate Models:
- Natural Language (NL): ~82% accuracy
  - Excellent performance on harmful data (f1-score: 0.93)
- Code Language (CL): ~63% accuracy
  - Good detection of harmful data (f1-score: 0.72)

Training Dataset:

Public dataset available: TempestTeam/dataset-quality

Common Use Cases:

Automatic validation of text corpora before integration into NLP or code generation pipelines.
Quality assessment of community contributions (forums, Stack Overflow, GitHub).
Automated pre-processing to enhance NLP or code generation system performance.

Recommendations:

For specialized contexts, use the separate NL and CL models for optimal results.
The unified model is suitable for quick assessments when the data context is unknown or mixed.

Citation

Please cite or link back to this model on Hugging Face Hub if used in your projects.

TempestTeam
/

EuroBERT-210m-Quality

Automatic Evaluation Models for Textual Data Quality (NL & CL)

Classification Categories:

Supported Languages:

Performance

Key Performance Metrics:

Training Dataset:

Common Use Cases:

Recommendations:

Citation

Model tree for TempestTeam/EuroBERT-210m-Quality

Dataset used to train TempestTeam/EuroBERT-210m-Quality

Collection including TempestTeam/EuroBERT-210m-Quality

Text & Code Data — Quality Quantification