|
--- |
|
language: |
|
- es |
|
license: apache-2.0 |
|
tags: |
|
- Text2Text Generation |
|
- Inclusive Language |
|
- Text Neutralization |
|
- pytorch |
|
datasets: |
|
- hackathon-pln-es/neutral-es |
|
metrics: |
|
- sacrebleu |
|
|
|
|
|
|
|
model-index: |
|
- name: es_text_neutralizer |
|
results: |
|
- task: |
|
type: Text2Text Generation |
|
name: Neutralization of texts in Spanish |
|
dataset: |
|
type: hackathon-pln-es/neutral-es |
|
name: neutral-es |
|
metrics: |
|
- type: sacrebleu |
|
value: 0.96 |
|
name: sacrebleu |
|
- type: bertscore |
|
value: 0.98 |
|
name: BertScoreF1 |
|
- type: DiffBleu |
|
value: 0.35 |
|
name: DiffBleu |
|
--- |
|
|
|
## Model objective |
|
|
|
Spanish is a beautiful language and it has many ways of referring to people, neutralizing the genders and using some of the resources inside the language. One would say *Todas las personas asistentes* instead of *Todos los asistentes* and it would end in a more inclusive way for talking about people. The purpose of this collaboratively trained model is to create a solution that reinforces the UN objective of the gender equality. |
|
|
|
Given any input, our model will generate a gender neutral sentence, correcting any non-inclusive expressions or words. |
|
It's a straightforward and fast solution that creates a positive impact in the contemporary social panorama. |
|
|
|
|
|
<p align="center"> |
|
<img src="https://upload.wikimedia.org/wikipedia/commons/2/29/Gender_equality_symbol_%28clipart%29.png" width="250"/> |
|
</p> |
|
|
|
By using gender inclusive models we can help reducing gender bias in a language corpus by, for instance, adding data augmentation and creating different examples |
|
|
|
|
|
## Training and evaluation data |
|
|
|
The data used for the model training has been created form a compilation of sources, obtained from a series of guidelines and manuals issued by Spanish Ministry of Health, Social Services and Equality in the matter of the usage of non-sexist language, stipulated in this linked [document:](https://www.inmujeres.gob.es/servRecursos/formacion/GuiasLengNoSexista/docs/Guiaslenguajenosexista_.pdf): |
|
|
|
### Compiled sources |
|
|
|
[Guía para un discurso igualitario en la universidad de alicante](https://ieg.ua.es/es/documentos/normativasobreigualdad/guia-para-un-discurso-igualitario-en-la-ua.pdf) |
|
|
|
[Guía UC de Comunicación en Igualdad](<https://web.unican.es/unidades/igualdad/SiteAssets/igualdad/comunicacion-en-igualdad/guia%20comunicacion%20igualdad%20(web).pdf>) |
|
|
|
[Buenas prácticas para el tratamiento del lenguaje en igualdad](https://e-archivo.uc3m.es/handle/10016/22811) |
|
|
|
[Guía del lenguaje no sexista de la Universidad de Castilla-La Mancha](https://unidadigualdad.ugr.es/page/guiialenguajeuniversitarionosexista_universidaddecastillalamancha/!) |
|
|
|
[Guía de Lenguaje Para el Ámbito Educativo](https://www.educacionyfp.gob.es/va/dam/jcr:8ce318fd-c8ff-4ad2-97b4-7318c27d1682/guialenguajeambitoeducativo.pdf) |
|
|
|
[Guía para un uso igualitario y no sexista del lenguaje y dela imagen en la Universidad de Jaén](https://www.ujaen.es/servicios/uigualdad/sites/servicio_uigualdad/files/uploads/Guia_lenguaje_no_sexista.pdf) |
|
|
|
[Guía de uso no sexista del vocabulario español](https://www.um.es/documents/2187255/2187763/guia-leng-no-sexista.pdf/d5b22eb9-b2e4-4f4b-82aa-8a129cdc83e3) |
|
|
|
[Guía para el uso no sexista de la lengua castellana y de imágnes en la UPV/EHV](https://www.ehu.eus/documents/1734204/1884196/Guia_uso_no_sexista_EHU.pdf) |
|
|
|
[Guía de lenguaje no sexista UNED](http://portal.uned.es/pls/portal/docs/PAGE/UNED_MAIN/LAUNIVERSIDAD/VICERRECTORADOS/GERENCIA/OFICINA_IGUALDAD/CONCEPTOS%20BASICOS/GUIA_LENGUAJE.PDF) |
|
|
|
[COMUNICACIÓN AMBIENTAL CON PERSPECTIVA DE GÉNERO](https://cima.cantabria.es/documents/5710649/5729124/COMUNICACI%C3%93N+AMBIENTAL+CON+PERSPECTIVA+DE+G%C3%89NERO.pdf/ccc18730-53e3-35b9-731e-b4c43339254b) |
|
|
|
[Recomendaciones para la utilización de lenguaje no sexista](https://www.csic.es/sites/default/files/guia_para_un_uso_no_sexista_de_la_lengua_adoptada_por_csic2.pdf) |
|
|
|
[Estudio sobre lenguaje y contenido sexista en la Web](https://www.mujeresenred.net/IMG/pdf/Estudio_paginas_web_T-incluye_ok.pdf) |
|
|
|
[Nombra.en.red. En femenino y en masculino](https://www.inmujeres.gob.es/areasTematicas/educacion/publicaciones/serieLenguaje/docs/Nombra_en_red.pdf) |
|
|
|
|
|
## Model specs |
|
|
|
This model is a fine-tuned version of [spanish-t5-small](https://huggingface.co/flax-community/spanish-t5-small) on the data described below. |
|
It achieves the following results on the evaluation set: |
|
- 'eval_bleu': 93.8347, |
|
- 'eval_f1': 0.9904, |
|
|
|
## Training procedure |
|
### Training hyperparameters |
|
The following hyperparameters were used during training: |
|
- learning_rate: 1e-04 |
|
- train_batch_size: 32 |
|
- seed: 42 |
|
- num_epochs: 10 |
|
- weight_decay: 0,01 |
|
|
|
|
|
## Metrics |
|
|
|
For training, we used both Blue (sacrebleu implementation in HF) and BertScore. The first one, a standard in Machine Translation processes, has been added for ensuring robustness of the newly generated data, while the second one is kept for keeping the expected semantic similarity. |
|
|
|
However, given the actual use case, we expect generated segments to be very close to input segments and to label segments in training. As an example, we can take the following: |
|
|
|
inputSegment = 'De acuerdo con las informaciones anteriores , las alumnas se han quejado de la actitud de los profesores en los exámenes finales. Los representantes estudiantiles son los alumnos Juanju y Javi.' |
|
expectedOutput (label) = 'De acuerdo con las informaciones anteriores, el alumnado se ha quejado de la actitud del profesorado en los exámenes finales. Los representantes estudiantiles son los alumnos Juanju y Javi.' |
|
actualOutput = 'De acuerdo con las informaciones anteriores, el alumnado se ha quejado de la actitud del profesorado en los exámenes finales. Los representantes estudiantiles son el alumnado Juanju y Javi.' |
|
|
|
As you can see, segments are pretty similar. So, instead of measuring Bleu or BertScore here, we propose an alternate metric that would be DiffBleu: |
|
|
|
$$DiffBleu = BLEU(actualOutput - inputSegment, labels - inputSegment)$$ |
|
|
|
Where the minuses as in set notation. This way, we also evaluate DiffBleu after the model has been trained. |
|
|
|
|
|
## Team Members |
|
|
|
- Fernando Velasco [(fermaat)](https://huggingface.co/fermaat) |
|
- Cibeles Redondo [(CibelesR)](https://huggingface.co/CibelesR) |
|
- Juan Julian Cea [(Juanju)](https://huggingface.co/Juanju) |
|
- Magdalena Kujalowicz [(MacadellaCosta)](https://huggingface.co/MacadellaCosta) |
|
- Javier Blasco [(javiblasco)](https://huggingface.co/javiblasco) |
|
|
|
|
|
|
|
|
|
Enjoy! |