File size: 6,667 Bytes
bc31490 fccbfaa bc31490 fccbfaa bc31490 fccbfaa bc31490 fccbfaa bc31490 725d2fc 24ef19b 725d2fc 24ef19b 725d2fc bc31490 8781145 51187ae 04ce47d 8781145 d3f038e 04ce47d d3f038e 725d2fc fccbfaa 8781145 04ce47d 8781145 51187ae 7fde946 51187ae d3f038e 3db3099 04ce47d d3f038e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 |
---
language:
- es
license: apache-2.0
tags:
- Text2Text Generation
- Inclusive Language
- Text Neutralization
- pytorch
datasets:
- hackathon-pln-es/neutral-es
metrics:
- sacrebleu
model-index:
- name: es_text_neutralizer
results:
- task:
type: Text2Text Generation
name: Neutralization of texts in Spanish
dataset:
type: hackathon-pln-es/neutral-es
name: neutral-es
metrics:
- type: sacrebleu
value: 0.96
name: sacrebleu # Optional. Example: Test WER
- type: bertscore # Required. Example: wer
value: 0.98
name: BertScoreF1 # Optional. Example: Test WER
- type: DiffBleu # Required. Example: wer
value: 0.35
name: DiffBleu # Optional. Example: Test WER
---
## Model objective
Spanish is a beautiful language and it has many ways of referring to people, neutralizing the genders and using some of the resources inside the language. One would say *Todas las personas asistentes* instead of *Todos los asistentes* and it would end in a more inclusive way for talking about people. The purpose of this collaboratively trained model is to create a solution that reinforces the UN objective of the gender equality.
Given any input, our model will generate a gender neutral sentence, correcting any non-inclusive expressions or words.
It's a straightforward and fast solution that creates a positive impact in the contemporary social panorama.
<p align="center">
<img src="https://upload.wikimedia.org/wikipedia/commons/2/29/Gender_equality_symbol_%28clipart%29.png" width="250"/>
</p>
By using gender inclusive models we can help reducing gender bias in a language corpus by, for instance, adding data augmentation and creating different examples
## Training and evaluation data
The data used for the model training has been created form a compilation of sources, obtained from a series of guidelines and manuals issued by Spanish Ministry of Health, Social Services and Equality in the matter of the usage of non-sexist language, stipulated in this linked [document:](https://www.inmujeres.gob.es/servRecursos/formacion/GuiasLengNoSexista/docs/Guiaslenguajenosexista_.pdf):
### Compiled sources
[Guía para un discurso igualitario en la universidad de alicante](https://ieg.ua.es/es/documentos/normativasobreigualdad/guia-para-un-discurso-igualitario-en-la-ua.pdf)
[Guía UC de Comunicación en Igualdad](<https://web.unican.es/unidades/igualdad/SiteAssets/igualdad/comunicacion-en-igualdad/guia%20comunicacion%20igualdad%20(web).pdf>)
[Buenas prácticas para el tratamiento del lenguaje en igualdad](https://e-archivo.uc3m.es/handle/10016/22811)
[Guía del lenguaje no sexista de la Universidad de Castilla-La Mancha](https://unidadigualdad.ugr.es/page/guiialenguajeuniversitarionosexista_universidaddecastillalamancha/!)
[Guía de Lenguaje Para el Ámbito Educativo](https://www.educacionyfp.gob.es/va/dam/jcr:8ce318fd-c8ff-4ad2-97b4-7318c27d1682/guialenguajeambitoeducativo.pdf)
[Guía para un uso igualitario y no sexista del lenguaje y dela imagen en la Universidad de Jaén](https://www.ujaen.es/servicios/uigualdad/sites/servicio_uigualdad/files/uploads/Guia_lenguaje_no_sexista.pdf)
[Guía de uso no sexista del vocabulario español](https://www.um.es/documents/2187255/2187763/guia-leng-no-sexista.pdf/d5b22eb9-b2e4-4f4b-82aa-8a129cdc83e3)
[Guía para el uso no sexista de la lengua castellana y de imágnes en la UPV/EHV](https://www.ehu.eus/documents/1734204/1884196/Guia_uso_no_sexista_EHU.pdf)
[Guía de lenguaje no sexista UNED](http://portal.uned.es/pls/portal/docs/PAGE/UNED_MAIN/LAUNIVERSIDAD/VICERRECTORADOS/GERENCIA/OFICINA_IGUALDAD/CONCEPTOS%20BASICOS/GUIA_LENGUAJE.PDF)
[COMUNICACIÓN AMBIENTAL CON PERSPECTIVA DE GÉNERO](https://cima.cantabria.es/documents/5710649/5729124/COMUNICACI%C3%93N+AMBIENTAL+CON+PERSPECTIVA+DE+G%C3%89NERO.pdf/ccc18730-53e3-35b9-731e-b4c43339254b)
[Recomendaciones para la utilización de lenguaje no sexista](https://www.csic.es/sites/default/files/guia_para_un_uso_no_sexista_de_la_lengua_adoptada_por_csic2.pdf)
[Estudio sobre lenguaje y contenido sexista en la Web](https://www.mujeresenred.net/IMG/pdf/Estudio_paginas_web_T-incluye_ok.pdf)
[Nombra.en.red. En femenino y en masculino](https://www.inmujeres.gob.es/areasTematicas/educacion/publicaciones/serieLenguaje/docs/Nombra_en_red.pdf)
## Model specs
This model is a fine-tuned version of [spanish-t5-small](https://huggingface.co/flax-community/spanish-t5-small) on the data described below.
It achieves the following results on the evaluation set:
- 'eval_bleu': 93.8347,
- 'eval_f1': 0.9904,
## Training procedure
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-04
- train_batch_size: 32
- seed: 42
- num_epochs: 10
- weight_decay: 0,01
## Metrics
For training, we used both Blue (sacrebleu implementation in HF) and BertScore. The first one, a standard in Machine Translation processes, has been added for ensuring robustness of the newly generated data, while the second one is kept for keeping the expected semantic similarity.
However, given the actual use case, we expect generated segments to be very close to input segments and to label segments in training. As an example, we can take the following:
inputSegment = 'De acuerdo con las informaciones anteriores , las alumnas se han quejado de la actitud de los profesores en los exámenes finales. Los representantes estudiantiles son los alumnos Juanju y Javi.'
expectedOutput (label) = 'De acuerdo con las informaciones anteriores, el alumnado se ha quejado de la actitud del profesorado en los exámenes finales. Los representantes estudiantiles son los alumnos Juanju y Javi.'
actualOutput = 'De acuerdo con las informaciones anteriores, el alumnado se ha quejado de la actitud del profesorado en los exámenes finales. Los representantes estudiantiles son el alumnado Juanju y Javi.'
As you can see, segments are pretty similar. So, instead of measuring Bleu or BertScore here, we propose an alternate metric that would be DiffBleu:
$$DiffBleu = BLEU(actualOutput - inputSegment, labels - inputSegment)$$
Where the minuses as in set notation. This way, we also evaluate DiffBleu after the model has been trained.
## Team Members
- Fernando Velasco [(fermaat)](https://huggingface.co/fermaat)
- Cibeles Redondo [(CibelesR)](https://huggingface.co/CibelesR)
- Juan Julian Cea [(Juanju)](https://huggingface.co/Juanju)
- Magdalena Kujalowicz [(MacadellaCosta)](https://huggingface.co/MacadellaCosta)
- Javier Blasco [(javiblasco)](https://huggingface.co/javiblasco)
Enjoy! |