Spaces:

somosnlp-hackathon-2022
/

readability-assessment-spanish

Build error

App Files Files Community

readability-assessment-spanish / app.py

feralvam

Update app.py

0352b62 about 2 years ago

raw history blame

No virus

6.02 kB

	import gradio as gr

	from transformers import pipeline

	title = "Automatic Readability Assessment of Texts in Spanish"

	description = """
	Is a text complex or simple? Can it be understood by someone learning Spanish with a basic, intermediate or advanced knowledge of the language? Find out with our models below!
	"""

	article = """

	### What's Readability Assessment?

	[Automatic Readability Assessment](https://arxiv.org/abs/2105.00973) consists of determining "how difficult" it could be to read and understand a piece of text.
	This could be estimated using readability formulas, such as [Flesch for English](https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests) or [similar ones for Spanish](https://www.siicsalud.com/imagenes/blancopet1.pdf).
	However, their dependance on surface statistics (e.g. average sentence length) makes them unreliable.
	As such, developing models that could estimate a text's readability by "looking beyond the surface" is a necessity.

	### Goal

	We aim to contribute to the development of neural models for readability assessment for Spanish, following previous work for [English](https://aclanthology.org/2021.cl-1.6/) and [Filipino](https://aclanthology.org/2021.ranlp-1.69/).


	### Dataset

	We curated a new dataset that combines existing corpora for readability assessment (i.e. [Newsela](https://newsela.com/data)) and texts scraped from webpages aimed at learners of Spanish as a second language. Texts in the Newsela corpus contain the grade level (according to the USA educational system) that they were written for. In the case of scraped texts, we selected webpages that explicitly indicated the [CEFR](https://en.wikipedia.org/wiki/Common_European_Framework_of_Reference_for_Languages) level that each text belongs to.

	In our dataset, each text has two readability labels, according to the following mapping:

	\| \| 2-class \| \| 3-class \| \| \|
	\|------------------\|--------------\|--------------\|-----------------\|-----------------\|------------------\|
	\| \| Simple \| Complex \| Basic \| Intermediate \| Advanced \|
	\| With CERF Levels \| A1, A2, B1 \| B2, C1, C2 \| A1, A2 \| B1,B2 \| C1,C2 \|
	\| Newsela Corpus \| Versions 3-4 \| Versions 0-1 \| Grade Level 2-5 \| Grade Level 6-8 \| Grade Level 9-12 \|

	In addition, texts in the dataset could be too long to fit in a model. As such, we created two versions of the dataset, dividing each text into [sentences](https://huggingface.co/datasets/hackathon-pln-es/readability-es-sentences) and [paragraphs](https://huggingface.co/datasets/hackathon-pln-es/readability-es-paragraphs).

	We also scraped several texts from the ["Corpus de Aprendices del Español" (CAES)](http://galvan.usc.es/caes/). However, due to the time constraints, we leave experiments with it for future work. This data is available [here](https://huggingface.co/datasets/hackathon-pln-es/readability-es-caes).

	### Models

	Our models are based on [BERTIN](https://huggingface.co/bertin-project). We fine-tuned [bertin-roberta-base-spanish](https://huggingface.co/bertin-project/bertin-roberta-base-spanish) in the different versions of our collected dataset. The following models are available:

	- [2-class sentence-level](https://huggingface.co/hackathon-pln-es/readability-es-sentences)
	- [2-class paragraph-level](https://huggingface.co/hackathon-pln-es/readability-es-paragraphs)
	- [3-class sentence-level](https://huggingface.co/hackathon-pln-es/readability-es-3class-sentences)
	- [3-class paragraph-level](https://huggingface.co/hackathon-pln-es/readability-es-3class-paragraphs)

	More details about how we trained these models can be found in our [report](https://wandb.ai/readability-es/readability-es/reports/Texts-Readability-Analysis-for-Spanish--VmlldzoxNzU2MDUx).

	### Team

	- [Laura Vásquez-Rodríguez](https://lmvasque.github.io/)
	- [Pedro Cuenca](https://twitter.com/pcuenq/)
	- [Sergio Morales](https://www.fireblend.com/)
	- [Fernando Alva-Manchego](https://feralvam.github.io/)

	"""

	examples = [
	["Esta es una frase simple.", "simple or complex?"],
	["La ciencia nos enseña, en efecto, a someter nuestra razón a la verdad y a conocer y juzgar las cosas como son, es decir, como ellas mismas eligen ser y no como quisiéramos que fueran.", "simple or complex?"],
	["Las Líneas de Nazca son una serie de marcas trazadas en el suelo, cuya anchura oscila entre los 40 y los 110 centímetros.", "basic, intermediate, or advanced?"],
	["Hace mucho tiempo, en el gran océano que baña las costas del Perú no había peces.", "basic, intermediate, or advanced?"],
	["El turismo en Costa Rica es uno de los principales sectores económicos y de más rápido crecimiento del país.", "basic, intermediate, or advanced?"],
	]


	model_binary = pipeline("sentiment-analysis", model="hackathon-pln-es/readability-es-sentences", return_all_scores=True)
	model_ternary = pipeline("sentiment-analysis", model="hackathon-pln-es/readability-es-3class-paragraphs", return_all_scores=True)

	def predict(text, levels):
	if levels == 0:
	predicted_scores = model_binary(text)[0]
	else:
	predicted_scores = model_ternary(text)[0]

	output_scores = {}
	for e in predicted_scores:
	output_scores[e['label']] = e['score']

	return output_scores


	iface = gr.Interface(
	fn=predict,
	inputs=[
	gr.inputs.Textbox(lines=7, placeholder="Write a text in Spanish or choose of the examples below.", label="Text in Spanish"),
	gr.inputs.Radio(choices=["simple or complex?", "basic, intermediate, or advanced?"], type="index", label="Readability Levels"),
	],
	outputs=[
	gr.outputs.Label(num_top_classes=3, label="Predicted Readability Level")
	],
	theme="huggingface",
	title = title, description = description, article = article, examples=examples,
	allow_flagging="never",
	)
	iface.launch()