Model card.
Browse files
README.md
ADDED
@@ -0,0 +1,61 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: es
|
3 |
+
license: cc-by-4.0
|
4 |
+
tags:
|
5 |
+
- spanish
|
6 |
+
- roberta
|
7 |
+
- bertin
|
8 |
+
pipeline_tag: text-classification
|
9 |
+
widget:
|
10 |
+
- text: Las Líneas de Nazca son una serie de marcas trazadas en el suelo, cuya anchura oscila entre los 40 y los 110 centímetros.
|
11 |
+
- text: Hace mucho tiempo, en el gran océano que baña las costas del Perú no había peces.
|
12 |
+
---
|
13 |
+
|
14 |
+
# Readability ES Sentences for two classes
|
15 |
+
|
16 |
+
Model based on the Roberta architecture finetuned on [BERTIN](https://huggingface.co/bertin-project/bertin-roberta-base-spanish) for readability assessment of Spanish texts.
|
17 |
+
|
18 |
+
## Description and performance
|
19 |
+
|
20 |
+
This version of the model was trained on a mix of datasets, using sentence-level granularity when possible. The model performs classification among three complexity levels:
|
21 |
+
- Basic.
|
22 |
+
- Intermediate.
|
23 |
+
- Advanced.
|
24 |
+
|
25 |
+
The relationship of these categories with the Common European Framework of Reference for Languages is described in [our report](https://wandb.ai/readability-es/readability-es/reports/Texts-Readability-Analysis-for-Spanish--VmlldzoxNzU2MDUx).
|
26 |
+
|
27 |
+
This model achieves a F1 macro average score of 0.7881, measured on the validation set.
|
28 |
+
|
29 |
+
## Model variants
|
30 |
+
|
31 |
+
- [`readability-es-sentences`](https://huggingface.co/hackathon-pln-es/readability-es-sentences). Two classes, sentence-based dataset.
|
32 |
+
- [`readability-es-paragraphs`](https://huggingface.co/hackathon-pln-es/readability-es-paragraphs). Two classes, paragraph-based dataset.
|
33 |
+
- [`readability-es-3class-sentences`](https://huggingface.co/hackathon-pln-es/readability-es-3class-sentences). Three classes, sentence-based dataset.
|
34 |
+
- `readability-es-3class-paragraphs` (this model). Three classes, paragraph-based dataset.
|
35 |
+
|
36 |
+
## Datasets
|
37 |
+
|
38 |
+
- [`readability-es-sentences`](https://huggingface.co/datasets/hackathon-pln-es/readability-es-sentences), composed of:
|
39 |
+
* coh-metrix-esp corpus.
|
40 |
+
* Various text resources scraped from websites.
|
41 |
+
- Other non-public datasets: newsela-es, simplext.
|
42 |
+
|
43 |
+
## Training details
|
44 |
+
|
45 |
+
Please, refer to [this training run](https://wandb.ai/readability-es/readability-es/runs/22apaysv/overview) for full details on hyperparameters and training regime.
|
46 |
+
|
47 |
+
## Biases and Limitations
|
48 |
+
|
49 |
+
- Due to the scarcity of data and the lack of a reliable gold test set, performance metrics are reported on the validation set.
|
50 |
+
- One of the datasets involved is the Spanish version of newsela, which is frequently used as a reference. However, it was created by translating previous datasets, and therefore it may contain somewhat unnatural phrases.
|
51 |
+
- Some of the datasets used cannot be publicly disseminated, making it more difficult to assess the existence of biases or mistakes.
|
52 |
+
- Language might be biased towards the Spanish dialect spoken in Spain. Other regional variants might be sub-represented.
|
53 |
+
- No effort has been performed to alleviate the shortcomings and biases described in the [original implementation of BERTIN](https://huggingface.co/bertin-project/bertin-roberta-base-spanish#bias-examples-spanish).
|
54 |
+
|
55 |
+
## Authors
|
56 |
+
|
57 |
+
- [Laura Vásquez-Rodríguez](https://lmvasque.github.io/)
|
58 |
+
- [Pedro Cuenca](https://twitter.com/pcuenq)
|
59 |
+
- [Sergio Morales](https://www.fireblend.com/)
|
60 |
+
- [Fernando Alva-Manchego](https://feralvam.github.io/)
|
61 |
+
|