|
--- |
|
language: es |
|
license: cc-by-4.0 |
|
tags: |
|
- spanish |
|
- roberta |
|
- bertin |
|
pipeline_tag: text-classification |
|
widget: |
|
- text: Las Líneas de Nazca son una serie de marcas trazadas en el suelo, cuya anchura oscila entre los 40 y los 110 centímetros. |
|
- text: Hace mucho tiempo, en el gran océano que baña las costas del Perú no había peces. |
|
--- |
|
|
|
# Readability ES Paragraphs for three classes |
|
|
|
Model based on the Roberta architecture finetuned on [BERTIN](https://huggingface.co/bertin-project/bertin-roberta-base-spanish) for readability assessment of Spanish texts. |
|
|
|
## Description and performance |
|
|
|
This version of the model was trained on a mix of datasets, using sentence-level granularity when possible. The model performs classification among three complexity levels: |
|
- Basic. |
|
- Intermediate. |
|
- Advanced. |
|
|
|
The relationship of these categories with the Common European Framework of Reference for Languages is described in [our report](https://wandb.ai/readability-es/readability-es/reports/Texts-Readability-Analysis-for-Spanish--VmlldzoxNzU2MDUx). |
|
|
|
This model achieves a F1 macro average score of 0.7881, measured on the validation set. |
|
|
|
## Model variants |
|
|
|
- [`readability-es-sentences`](https://huggingface.co/hackathon-pln-es/readability-es-sentences). Two classes, sentence-based dataset. |
|
- [`readability-es-paragraphs`](https://huggingface.co/hackathon-pln-es/readability-es-paragraphs). Two classes, paragraph-based dataset. |
|
- [`readability-es-3class-sentences`](https://huggingface.co/hackathon-pln-es/readability-es-3class-sentences). Three classes, sentence-based dataset. |
|
- `readability-es-3class-paragraphs` (this model). Three classes, paragraph-based dataset. |
|
|
|
## Datasets |
|
|
|
- [`readability-es-hackathon-pln-public`](https://huggingface.co/datasets/hackathon-pln-es/readability-es-hackathon-pln-public), composed of: |
|
* coh-metrix-esp corpus. |
|
* Various text resources scraped from websites. |
|
- Other non-public datasets: newsela-es, simplext. |
|
|
|
## Training details |
|
|
|
Please, refer to [this training run](https://wandb.ai/readability-es/readability-es/runs/22apaysv/overview) for full details on hyperparameters and training regime. |
|
|
|
## Biases and Limitations |
|
|
|
- Due to the scarcity of data and the lack of a reliable gold test set, performance metrics are reported on the validation set. |
|
- One of the datasets involved is the Spanish version of newsela, which is frequently used as a reference. However, it was created by translating previous datasets, and therefore it may contain somewhat unnatural phrases. |
|
- Some of the datasets used cannot be publicly disseminated, making it more difficult to assess the existence of biases or mistakes. |
|
- Language might be biased towards the Spanish dialect spoken in Spain. Other regional variants might be sub-represented. |
|
- No effort has been performed to alleviate the shortcomings and biases described in the [original implementation of BERTIN](https://huggingface.co/bertin-project/bertin-roberta-base-spanish#bias-examples-spanish). |
|
|
|
## Authors |
|
|
|
- [Laura Vásquez-Rodríguez](https://lmvasque.github.io/) |
|
- [Pedro Cuenca](https://twitter.com/pcuenq) |
|
- [Sergio Morales](https://www.fireblend.com/) |
|
- [Fernando Alva-Manchego](https://feralvam.github.io/) |
|
|
|
|