feralvam's picture
Update links to dataset
0601979
import gradio as gr
from transformers import pipeline
title = "Automatic Readability Assessment of Texts in Spanish"
description = """
Is a text **complex** or **simple**? Can it be understood by someone learning Spanish with a **basic**, **intermediate** or **advanced** knowledge of the language? Find out with our models below!
"""
article = """
### What's Readability Assessment?
[Automatic Readability Assessment](https://arxiv.org/abs/2105.00973) consists of determining "how difficult" it could be to read and understand a piece of text.
This could be estimated using readability formulas, such as [Flesch for English](https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests) or [similar ones for Spanish](https://www.siicsalud.com/imagenes/blancopet1.pdf).
However, their dependance on surface statistics (e.g. average sentence length) makes them unreliable.
As such, developing models that could estimate a text's readability by "looking beyond the surface" is a necessity.
### Goal
We aim to contribute to the development of **neural models for readability assessment for Spanish**, following previous work for [English](https://aclanthology.org/2021.cl-1.6/) and [Filipino](https://aclanthology.org/2021.ranlp-1.69/).
### Dataset
We curated a new dataset that combines corpora for readability assessment (e.g. [Newsela](https://aclanthology.org/Q15-1021/)) and text simplification (e.g. [Simplext](https://link.springer.com/article/10.1007/s10579-014-9265-4)), with texts scraped from webpages aimed at learners of Spanish as a second language (e.g. [hablacultura](https://hablacultura.com/cultura-textos-aprender-espanol/) and [kwiziq](https://spanish.kwiziq.com/learn/reading)). Texts in the Newsela corpus contain the grade level (according to the USA educational system) that they were written for. In the case of scraped texts, we selected webpages that explicitly indicated the [CEFR](https://en.wikipedia.org/wiki/Common_European_Framework_of_Reference_for_Languages) level that each text belongs to.
In our dataset, each text has two readability labels, according to the following mapping:
| | 2-class | | 3-class | | |
|------------------|--------------|--------------|-----------------|-----------------|------------------|
| | Simple | Complex | Basic | Intermediate | Advanced |
| With CERF Levels | A1, A2, B1 | B2, C1, C2 | A1, A2 | B1,B2 | C1,C2 |
| Newsela Corpus | Versions 3-4 | Versions 0-1 | Grade Level 2-5 | Grade Level 6-8 | Grade Level 9-12 |
In addition, texts in the dataset could be too long to fit in a model. As such, we created two versions of the dataset, dividing each text into sentences and paragraphs. Due to licenses attached to these datasets and webpages, some of the texts cannot be publicly-shared. The public version of the data we used is available [here](https://huggingface.co/datasets/hackathon-pln-es/readability-es-hackathon-pln-public).
We also scraped several texts from the ["Corpus de Aprendices del Español" (CAES)](http://galvan.usc.es/caes/). However, due to the time constraints, we leave experiments with it for future work. This data is available [here](https://huggingface.co/datasets/hackathon-pln-es/readability-es-caes).
### Models
Our models are based on [BERTIN](https://huggingface.co/bertin-project). We fine-tuned [bertin-roberta-base-spanish](https://huggingface.co/bertin-project/bertin-roberta-base-spanish) in the different versions of our collected dataset. The following models are available:
- [2-class sentence-level](https://huggingface.co/hackathon-pln-es/readability-es-sentences)*
- [2-class paragraph-level](https://huggingface.co/hackathon-pln-es/readability-es-paragraphs)
- [3-class sentence-level](https://huggingface.co/hackathon-pln-es/readability-es-3class-sentences)
- [3-class paragraph-level](https://huggingface.co/hackathon-pln-es/readability-es-3class-paragraphs)*
Models showcased in the demo are marked with (*) above. More details about how we trained these models can be found in our [report](https://wandb.ai/readability-es/readability-es/reports/Texts-Readability-Analysis-for-Spanish--VmlldzoxNzU2MDUx).
### Final Remarks
- **Limitations and Biases.** The readability of a document can be affected by its domain and target audience. For example, an article in a medical journal can be more difficult to understand than a news article. However, medical professionals may have less difficulty than lay readers. As such, it is important to take all characteristics of the documents into account when analysing the performance of our models. A deeper study of such type for our models is left as future work. The CAES dataset, in particular, offers benefits for that type of investigation, since its metadata includes information such as the domain of the document, the years of study of the person who wrote the text, etc. However, we did not use this dataset for our current models since its texts were produced *by* students and not *for* students, and due to the high variability of the characteristics of the writers and documents.
- **Data.** One of the main challenges in the area of Readability Assessment is the availability of reliable data. For Spanish, in particular, the highest-quality existing dataset is Newsela. However, it has a restrictive license that prohibits publicly-sharing its texts. In addition, since these texts are translations from original English news, they can suffer from [translationese](https://en.wiktionary.org/wiki/translationese), deeming them less suitable for training models that will analyse texts produced directly in Spanish. Therefore, our first challenge was to find texts that were originally-written in Spanish *and* that contained information about their readability level (i.e. the target gold label). Unfortunately, we could not find any other big publicly-available corpus with those characteristics, and decided to combine texts scraped from several webpages. This also prevented us from developing models that could estimate readability in more fine-grained levels (e.g. CEFR levels), which was our original goal. Future work will include contacting editorial groups that create texts for learners of Spanish as a second language, and establish collaborations that could result in creating new language resources for the readability research community.
- **Models.** As explained before, our models are direct fine-tuned versions of [BERTIN](https://huggingface.co/bertin-project). In the future, we aim to compare our models to fine-tuned versions of [multilingual BERT](https://huggingface.co/bert-base-multilingual-cased), to analyse whether multilingual embeddings could offer additional benefits. In addition, our current setting treats Readability Assessment as a classification task. Future work includes studying models that treat the problem as a regression task or, as [recent work suggests](https://arxiv.org/abs/2203.07450), as a pair-wise ranking problem.
### Team
- [Laura Vásquez-Rodríguez](https://lmvasque.github.io/)
- [Pedro Cuenca](https://twitter.com/pcuenq/)
- [Sergio Morales](https://www.fireblend.com/)
- [Fernando Alva-Manchego](https://feralvam.github.io/)
"""
examples = [
["Esta es una frase simple.", "simple or complex?"],
["La ciencia nos enseña, en efecto, a someter nuestra razón a la verdad y a conocer y juzgar las cosas como son, es decir, como ellas mismas eligen ser y no como quisiéramos que fueran.", "simple or complex?"],
["Las Líneas de Nazca son una serie de marcas trazadas en el suelo, cuya anchura oscila entre los 40 y los 110 centímetros.", "basic, intermediate, or advanced?"],
["Hace mucho tiempo, en el gran océano que baña las costas del Perú no había peces.", "basic, intermediate, or advanced?"],
["El turismo en Costa Rica es uno de los principales sectores económicos y de más rápido crecimiento del país.", "basic, intermediate, or advanced?"],
]
model_binary = pipeline("sentiment-analysis", model="hackathon-pln-es/readability-es-sentences", return_all_scores=True)
model_ternary = pipeline("sentiment-analysis", model="hackathon-pln-es/readability-es-3class-paragraphs", return_all_scores=True)
def predict(text, levels):
if levels == 0:
predicted_scores = model_binary(text)[0]
else:
predicted_scores = model_ternary(text)[0]
output_scores = {}
for e in predicted_scores:
output_scores[e['label']] = e['score']
return output_scores
iface = gr.Interface(
fn=predict,
inputs=[
gr.inputs.Textbox(lines=7, placeholder="Write a text in Spanish or choose of the examples below.", label="Text in Spanish"),
gr.inputs.Radio(choices=["simple or complex?", "basic, intermediate, or advanced?"], type="index", label="Readability Levels"),
],
outputs=[
gr.outputs.Label(num_top_classes=3, label="Predicted Readability Level")
],
theme="huggingface",
title = title, description = description, article = article, examples=examples,
allow_flagging="never",
)
iface.launch()