Text & Code Data — Quality Quantification
Collection
A suite of models and a dataset designed to assess the quality of natural language and code data.
•
4 items
•
Updated
Automatically assess the quality of textual data using a clear and intuitive scale, adapted for both natural language (NL) and code language (CL).
We compare two distinct approaches:
Catégorie | Global (NL + CL) | NL | CL |
---|---|---|---|
Harmfull | 0.81 | 0.87 | 0.75 |
Low | 0.60 | 0.72 | 0.44 |
Medium | 0.60 | 0.74 | 0.49 |
High | 0.74 | 0.77 | 0.72 |
Accuracy | 0.70 | 0.78 | 0.62 |
Catégorie | Global (NL + CL) | NL | CL |
---|---|---|---|
Harmfull | 0.83 | 0.89 | 0.78 |
Low | 0.59 | 0.71 | 0.46 |
Medium | 0.63 | 0.77 | 0.49 |
High | 0.76 | 0.79 | 0.73 |
Accuracy | 0.71 | 0.80 | 0.63 |
Unified Model (NL + CL):
Separate Models:
Please cite or link back to this dataset on Hugging Face Hub if used in your projects.
Base model
EuroBERT/EuroBERT-210m