pcuenq HF staff commited on
Commit
3f593b2
1 Parent(s): c167fa9

Model card.

Browse files
Files changed (1) hide show
  1. README.md +61 -0
README.md ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: es
3
+ license: cc-by-4.0
4
+ tags:
5
+ - spanish
6
+ - roberta
7
+ - bertin
8
+ pipeline_tag: text-classification
9
+ widget:
10
+ - text: Las Líneas de Nazca son una serie de marcas trazadas en el suelo, cuya anchura oscila entre los 40 y los 110 centímetros.
11
+ - text: Hace mucho tiempo, en el gran océano que baña las costas del Perú no había peces.
12
+ ---
13
+
14
+ # Readability ES Sentences for two classes
15
+
16
+ Model based on the Roberta architecture finetuned on [BERTIN](https://huggingface.co/bertin-project/bertin-roberta-base-spanish) for readability assessment of Spanish texts.
17
+
18
+ ## Description and performance
19
+
20
+ This version of the model was trained on a mix of datasets, using sentence-level granularity when possible. The model performs classification among three complexity levels:
21
+ - Basic.
22
+ - Intermediate.
23
+ - Advanced.
24
+
25
+ The relationship of these categories with the Common European Framework of Reference for Languages is described in [our report](https://wandb.ai/readability-es/readability-es/reports/Texts-Readability-Analysis-for-Spanish--VmlldzoxNzU2MDUx).
26
+
27
+ This model achieves a F1 macro average score of 0.7881, measured on the validation set.
28
+
29
+ ## Model variants
30
+
31
+ - [`readability-es-sentences`](https://huggingface.co/hackathon-pln-es/readability-es-sentences). Two classes, sentence-based dataset.
32
+ - [`readability-es-paragraphs`](https://huggingface.co/hackathon-pln-es/readability-es-paragraphs). Two classes, paragraph-based dataset.
33
+ - [`readability-es-3class-sentences`](https://huggingface.co/hackathon-pln-es/readability-es-3class-sentences). Three classes, sentence-based dataset.
34
+ - `readability-es-3class-paragraphs` (this model). Three classes, paragraph-based dataset.
35
+
36
+ ## Datasets
37
+
38
+ - [`readability-es-sentences`](https://huggingface.co/datasets/hackathon-pln-es/readability-es-sentences), composed of:
39
+ * coh-metrix-esp corpus.
40
+ * Various text resources scraped from websites.
41
+ - Other non-public datasets: newsela-es, simplext.
42
+
43
+ ## Training details
44
+
45
+ Please, refer to [this training run](https://wandb.ai/readability-es/readability-es/runs/22apaysv/overview) for full details on hyperparameters and training regime.
46
+
47
+ ## Biases and Limitations
48
+
49
+ - Due to the scarcity of data and the lack of a reliable gold test set, performance metrics are reported on the validation set.
50
+ - One of the datasets involved is the Spanish version of newsela, which is frequently used as a reference. However, it was created by translating previous datasets, and therefore it may contain somewhat unnatural phrases.
51
+ - Some of the datasets used cannot be publicly disseminated, making it more difficult to assess the existence of biases or mistakes.
52
+ - Language might be biased towards the Spanish dialect spoken in Spain. Other regional variants might be sub-represented.
53
+ - No effort has been performed to alleviate the shortcomings and biases described in the [original implementation of BERTIN](https://huggingface.co/bertin-project/bertin-roberta-base-spanish#bias-examples-spanish).
54
+
55
+ ## Authors
56
+
57
+ - [Laura Vásquez-Rodríguez](https://lmvasque.github.io/)
58
+ - [Pedro Cuenca](https://twitter.com/pcuenq)
59
+ - [Sergio Morales](https://www.fireblend.com/)
60
+ - [Fernando Alva-Manchego](https://feralvam.github.io/)
61
+