Model based on the Roberta architecture finetuned on BERTIN for readability assessment of Spanish texts.
This version of the model was trained on a mix of datasets, using sentence-level granularity when possible. The model performs classification among three complexity levels:
The relationship of these categories with the Common European Framework of Reference for Languages is described in our report.
This model achieves a F1 macro average score of 0.6951, measured on the validation set.
readability-es-sentences. Two classes, sentence-based dataset.
readability-es-paragraphs. Two classes, paragraph-based dataset.
readability-es-3class-sentences(this model). Three classes, sentence-based dataset.
readability-es-3class-paragraphs. Three classes, paragraph-based dataset.
readability-es-hackathon-pln-public, composed of:
- coh-metrix-esp corpus.
- Various text resources scraped from websites.
- Other non-public datasets: newsela-es, simplext.
Please, refer to this training run for full details on hyperparameters and training regime.
- Due to the scarcity of data and the lack of a reliable gold test set, performance metrics are reported on the validation set.
- One of the datasets involved is the Spanish version of newsela, which is frequently used as a reference. However, it was created by translating previous datasets, and therefore it may contain somewhat unnatural phrases.
- Some of the datasets used cannot be publicly disseminated, making it more difficult to assess the existence of biases or mistakes.
- Language might be biased towards the Spanish dialect spoken in Spain. Other regional variants might be sub-represented.
- No effort has been performed to alleviate the shortcomings and biases described in the original implementation of BERTIN.
- Downloads last month