metadata

language: es
license: cc-by-4.0
tags:
  - spanish
  - roberta
  - bertin
pipeline_tag: text-classification
widget:
  - text: >-
      La cueva de Zaratustra en el Pretil de los Consejos. Rimeros de libros
      hacen escombro y cubren las paredes. Empapelan los cuatro vidrios de una
      puerta cuatro cromos espeluznantes de un novelón por entregas. En la cueva
      hacen tertulia el gato, el can, el loro y el librero. Zaratustra, abichado
      y giboso -la cara de tocino rancio y la bufanda de verde serpiente-
      promueve con su caracterización de fantoche, una aguda y dolorosa
      disonancia muy emotiva y muy moderna. Encogido en el roto pelote de su
      silla enana, con los pies entrapados y cepones en la tarima del brasero,
      guarda la tienda. Un ratón saca el hocico intrigante por un agujero.

Readability ES Paragraphs for two classes

Model based on the Roberta architecture finetuned on BERTIN for readability assessment of Spanish texts.

Description and performance

This version of the model was trained on a mix of datasets, using paragraph-level granularity when possible. The model performs binary classification among the following classes:

Simple.
Complex.

It achieves a F1 macro average score of 0.8891, measured on the validation set.

Model variants

readability-es-sentences. Two classes, sentence-based dataset.
readability-es-paragraphs (this model). Two classes, paragraph-based dataset.
readability-es-3class-sentences. Three classes, sentence-based dataset.
readability-es-3class-paragraphs. Three classes, paragraph-based dataset.

Datasets

readability-es-hackathon-pln-public, composed of:
- coh-metrix-esp corpus.
- Various text resources scraped from websites.
Other non-public datasets: newsela-es, simplext.

Training details

Please, refer to this training run for full details on hyperparameters and training regime.

Biases and Limitations

Due to the scarcity of data and the lack of a reliable gold test set, performance metrics are reported on the validation set.
One of the datasets involved is the Spanish version of newsela, which is frequently used as a reference. However, it was created by translating previous datasets, and therefore it may contain somewhat unnatural phrases.
Some of the datasets used cannot be publicly disseminated, making it more difficult to assess the existence of biases or mistakes.
Language might be biased towards the Spanish dialect spoken in Spain. Other regional variants might be sub-represented.
No effort has been performed to alleviate the shortcomings and biases described in the original implementation of BERTIN.

somosnlp-hackathon-2022
/

readability-es-paragraphs