feralvam commited on
Commit
0601979
1 Parent(s): 22e8bfc

Update links to dataset

Browse files
Files changed (1) hide show
  1. app.py +2 -2
app.py CHANGED
@@ -24,7 +24,7 @@ We aim to contribute to the development of **neural models for readability asses
24
 
25
  ### Dataset
26
 
27
- We curated a new dataset that combines existing corpora for readability assessment (i.e. [Newsela](https://newsela.com/data)) and texts scraped from webpages aimed at learners of Spanish as a second language. Texts in the Newsela corpus contain the grade level (according to the USA educational system) that they were written for. In the case of scraped texts, we selected webpages that explicitly indicated the [CEFR](https://en.wikipedia.org/wiki/Common_European_Framework_of_Reference_for_Languages) level that each text belongs to.
28
 
29
  In our dataset, each text has two readability labels, according to the following mapping:
30
 
@@ -34,7 +34,7 @@ In our dataset, each text has two readability labels, according to the following
34
  | With CERF Levels | A1, A2, B1 | B2, C1, C2 | A1, A2 | B1,B2 | C1,C2 |
35
  | Newsela Corpus | Versions 3-4 | Versions 0-1 | Grade Level 2-5 | Grade Level 6-8 | Grade Level 9-12 |
36
 
37
- In addition, texts in the dataset could be too long to fit in a model. As such, we created two versions of the dataset, dividing each text into [sentences](https://huggingface.co/datasets/hackathon-pln-es/readability-es-sentences) and [paragraphs](https://huggingface.co/datasets/hackathon-pln-es/readability-es-paragraphs).
38
 
39
  We also scraped several texts from the ["Corpus de Aprendices del Español" (CAES)](http://galvan.usc.es/caes/). However, due to the time constraints, we leave experiments with it for future work. This data is available [here](https://huggingface.co/datasets/hackathon-pln-es/readability-es-caes).
40
 
24
 
25
  ### Dataset
26
 
27
+ We curated a new dataset that combines corpora for readability assessment (e.g. [Newsela](https://aclanthology.org/Q15-1021/)) and text simplification (e.g. [Simplext](https://link.springer.com/article/10.1007/s10579-014-9265-4)), with texts scraped from webpages aimed at learners of Spanish as a second language (e.g. [hablacultura](https://hablacultura.com/cultura-textos-aprender-espanol/) and [kwiziq](https://spanish.kwiziq.com/learn/reading)). Texts in the Newsela corpus contain the grade level (according to the USA educational system) that they were written for. In the case of scraped texts, we selected webpages that explicitly indicated the [CEFR](https://en.wikipedia.org/wiki/Common_European_Framework_of_Reference_for_Languages) level that each text belongs to.
28
 
29
  In our dataset, each text has two readability labels, according to the following mapping:
30
 
34
  | With CERF Levels | A1, A2, B1 | B2, C1, C2 | A1, A2 | B1,B2 | C1,C2 |
35
  | Newsela Corpus | Versions 3-4 | Versions 0-1 | Grade Level 2-5 | Grade Level 6-8 | Grade Level 9-12 |
36
 
37
+ In addition, texts in the dataset could be too long to fit in a model. As such, we created two versions of the dataset, dividing each text into sentences and paragraphs. Due to licenses attached to these datasets and webpages, some of the texts cannot be publicly-shared. The public version of the data we used is available [here](https://huggingface.co/datasets/hackathon-pln-es/readability-es-hackathon-pln-public).
38
 
39
  We also scraped several texts from the ["Corpus de Aprendices del Español" (CAES)](http://galvan.usc.es/caes/). However, due to the time constraints, we leave experiments with it for future work. This data is available [here](https://huggingface.co/datasets/hackathon-pln-es/readability-es-caes).
40