Automatic Evaluation Models for Textual Data Quality (NL & CL)

Automatically assess the quality of textual data using a clear and intuitive scale, adapted for both natural language (NL) and code language (CL).
We compare two distinct approaches:

Classification Categories:

  • Harmful: Harmful data, potentially incorrect or dangerous.
  • Low: Low-quality data with major issues.
  • Medium: Medium quality, improvable but acceptable.
  • High: Good to very good quality data, ready for use without reservation.

Supported Languages:

  • Natural Language: French 🇫🇷, English 🇬🇧, Spanish 🇪🇸
  • Code Language: Python 🐍, Java ☕, JavaScript 📜, C/C++ ⚙️

Performance

  • f1-score: Unified Model (NL + CL)
Catégorie Global (NL + CL) NL CL
Harmfull 0.81 0.87 0.75
Low 0.60 0.72 0.44
Medium 0.60 0.74 0.49
High 0.74 0.77 0.72
Accuracy 0.70 0.78 0.62
  • f1-score: Separate Models
Catégorie Global (NL + CL) NL CL
Harmfull 0.83 0.89 0.78
Low 0.59 0.71 0.46
Medium 0.63 0.77 0.49
High 0.76 0.79 0.73
Accuracy 0.71 0.80 0.63

Key Performance Metrics:

  • Unified Model (NL + CL):

    • Overall accuracy: ~69%
    • High reliability on harmful data (f1-score: 0.81)
  • Separate Models:

    • Natural Language (NL): ~79% accuracy
      • Excellent performance on harmful data (f1-score: 0.89)
    • Code Language (CL): ~63% accuracy
      • Good detection of harmful data (f1-score: 0.78)

Training Dataset:

Common Use Cases:

  • Automatic validation of text corpora before integration into NLP or code generation pipelines.
  • Quality assessment of community contributions (forums, Stack Overflow, GitHub).
  • Automated pre-processing to enhance NLP or code generation system performance.

Recommendations:

  • For specialized contexts, use the separate NL and CL models for optimal results.
  • The unified model is suitable for quick assessments when the data context is unknown or mixed.

Citation

Please cite or link back to this dataset on Hugging Face Hub if used in your projects.

Downloads last month
21
Safetensors
Model size
212M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The HF Inference API does not support model that require custom code execution.

Model tree for TempestTeam/EuroBERT-210m-Quality

Finetuned
(25)
this model

Dataset used to train TempestTeam/EuroBERT-210m-Quality

Collection including TempestTeam/EuroBERT-210m-Quality