somosnlp-hackathon-2022
/

readability-es-paragraphs

Text Classification

Inference Endpoints

Model card Files Files and versions Community

readability-es-paragraphs / README.md

pcuenq's picture

pcuenq HF staff

Update dataset link

87e56e2 about 2 years ago

|

raw history blame contribute delete

No virus

3.39 kB

	---
	language: es
	license: cc-by-4.0
	tags:
	- spanish
	- roberta
	- bertin
	pipeline_tag: text-classification
	widget:
	- text: La cueva de Zaratustra en el Pretil de los Consejos. Rimeros de libros hacen escombro y cubren las paredes. Empapelan los cuatro vidrios de una puerta cuatro cromos espeluznantes de un novelón por entregas. En la cueva hacen tertulia el gato, el can, el loro y el librero. Zaratustra, abichado y giboso -la cara de tocino rancio y la bufanda de verde serpiente- promueve con su caracterización de fantoche, una aguda y dolorosa disonancia muy emotiva y muy moderna. Encogido en el roto pelote de su silla enana, con los pies entrapados y cepones en la tarima del brasero, guarda la tienda. Un ratón saca el hocico intrigante por un agujero.
	---

	# Readability ES Paragraphs for two classes

	Model based on the Roberta architecture finetuned on [BERTIN](https://huggingface.co/bertin-project/bertin-roberta-base-spanish) for readability assessment of Spanish texts.

	## Description and performance

	This version of the model was trained on a mix of datasets, using paragraph-level granularity when possible. The model performs binary classification among the following classes:
	- Simple.
	- Complex.

	It achieves a F1 macro average score of 0.8891, measured on the validation set.

	## Model variants

	- [`readability-es-sentences`](https://huggingface.co/hackathon-pln-es/readability-es-sentences). Two classes, sentence-based dataset.
	- `readability-es-paragraphs` (this model). Two classes, paragraph-based dataset.
	- [`readability-es-3class-sentences`](https://huggingface.co/hackathon-pln-es/readability-es-3class-sentences). Three classes, sentence-based dataset.
	- [`readability-es-3class-paragraphs`](https://huggingface.co/hackathon-pln-es/readability-es-3class-paragraphs). Three classes, paragraph-based dataset.

	## Datasets

	- [`readability-es-hackathon-pln-public`](https://huggingface.co/datasets/hackathon-pln-es/readability-es-hackathon-pln-public), composed of:
	* coh-metrix-esp corpus.
	* Various text resources scraped from websites.
	- Other non-public datasets: newsela-es, simplext.

	## Training details

	Please, refer to [this training run](https://wandb.ai/readability-es/readability-es/runs/2z8080pi/overview) for full details on hyperparameters and training regime.

	## Biases and Limitations

	- Due to the scarcity of data and the lack of a reliable gold test set, performance metrics are reported on the validation set.
	- One of the datasets involved is the Spanish version of newsela, which is frequently used as a reference. However, it was created by translating previous datasets, and therefore it may contain somewhat unnatural phrases.
	- Some of the datasets used cannot be publicly disseminated, making it more difficult to assess the existence of biases or mistakes.
	- Language might be biased towards the Spanish dialect spoken in Spain. Other regional variants might be sub-represented.
	- No effort has been performed to alleviate the shortcomings and biases described in the [original implementation of BERTIN](https://huggingface.co/bertin-project/bertin-roberta-base-spanish#bias-examples-spanish).

	## Authors

	- [Laura Vásquez-Rodríguez](https://lmvasque.github.io/)
	- [Pedro Cuenca](https://twitter.com/pcuenq)
	- [Sergio Morales](https://www.fireblend.com/)
	- [Fernando Alva-Manchego](https://feralvam.github.io/)