monsoon-nlp
commited on
Commit
•
73d1f1f
1
Parent(s):
7068663
add code sample
Browse files- README.md +27 -5
- config.json +0 -1
README.md
CHANGED
@@ -1,3 +1,7 @@
|
|
|
|
|
|
|
|
|
|
1 |
# es-seq2seq-gender (encoder)
|
2 |
|
3 |
This is a seq2seq model (encoder half) to "flip" gender in Spanish sentences.
|
@@ -14,11 +18,29 @@ Intended Examples:
|
|
14 |
People's names are unchanged in this version, but you can use packages
|
15 |
such as https://pypi.org/project/gender-guesser/
|
16 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
17 |
## Training
|
18 |
|
19 |
-
I originally developed
|
20 |
<a href="https://github.com/MonsoonNLP/el-la">a gender flip Python script</a>
|
21 |
-
with
|
22 |
<a href="https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased">BETO</a>,
|
23 |
the Spanish-language BERT from Universidad de Chile,
|
24 |
and spaCy to parse dependencies in sentences.
|
@@ -26,7 +48,7 @@ and spaCy to parse dependencies in sentences.
|
|
26 |
More about this project: https://medium.com/ai-in-plain-english/gender-bias-in-spanish-bert-1f4d76780617
|
27 |
|
28 |
The seq2seq model is trained on gender-flipped text from that script run on the
|
29 |
-
<a href="https://huggingface.co/datasets/muchocine">muchocine dataset</a>,
|
30 |
and the first 6,853 lines from the
|
31 |
<a href="https://oscar-corpus.com/">OSCAR corpus</a>
|
32 |
(Spanish ded-duped).
|
@@ -40,10 +62,10 @@ short of capturing gender diversity in the world and in the Spanish
|
|
40 |
language. Some communities prefer the plural -@s to represent
|
41 |
-os and -as, or -e and -es for gender-neutral or mixed-gender plural,
|
42 |
or use fewer gendered professional nouns (la juez and not jueza). This is not yet
|
43 |
-
embraced by the Royal Spanish Academy
|
44 |
and is not represented in the corpora and tokenizers used to build this project.
|
45 |
|
46 |
-
This seq2seq project and script could, in the future, help generate more text samples
|
47 |
and prepare NLP models to understand us all better.
|
48 |
|
49 |
#### Sources
|
|
|
1 |
+
---
|
2 |
+
language: es
|
3 |
+
---
|
4 |
+
|
5 |
# es-seq2seq-gender (encoder)
|
6 |
|
7 |
This is a seq2seq model (encoder half) to "flip" gender in Spanish sentences.
|
|
|
18 |
People's names are unchanged in this version, but you can use packages
|
19 |
such as https://pypi.org/project/gender-guesser/
|
20 |
|
21 |
+
|
22 |
+
## Sample code
|
23 |
+
|
24 |
+
https://colab.research.google.com/drive/1Ta_YkXx93FyxqEu_zJ-W23PjPumMNHe5
|
25 |
+
|
26 |
+
```
|
27 |
+
import torch
|
28 |
+
from transformers import AutoTokenizer, EncoderDecoderModel
|
29 |
+
|
30 |
+
model = EncoderDecoderModel.from_encoder_decoder_pretrained("monsoon-nlp/es-seq2seq-gender-encoder", "monsoon-nlp/es-seq2seq-gender-decoder")
|
31 |
+
tokenizer = AutoTokenizer.from_pretrained('monsoon-nlp/es-seq2seq-gender-decoder') # all are same as BETO uncased original
|
32 |
+
|
33 |
+
input_ids = torch.tensor(tokenizer.encode("la profesora vieja")).unsqueeze(0)
|
34 |
+
generated = model.generate(input_ids, decoder_start_token_id=model.config.decoder.pad_token_id)
|
35 |
+
tokenizer.decode(generated.tolist()[0])
|
36 |
+
> '[PAD] el profesor viejo profesor viejo profesor...'
|
37 |
+
```
|
38 |
+
|
39 |
## Training
|
40 |
|
41 |
+
I originally developed
|
42 |
<a href="https://github.com/MonsoonNLP/el-la">a gender flip Python script</a>
|
43 |
+
with
|
44 |
<a href="https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased">BETO</a>,
|
45 |
the Spanish-language BERT from Universidad de Chile,
|
46 |
and spaCy to parse dependencies in sentences.
|
|
|
48 |
More about this project: https://medium.com/ai-in-plain-english/gender-bias-in-spanish-bert-1f4d76780617
|
49 |
|
50 |
The seq2seq model is trained on gender-flipped text from that script run on the
|
51 |
+
<a href="https://huggingface.co/datasets/muchocine">muchocine dataset</a>,
|
52 |
and the first 6,853 lines from the
|
53 |
<a href="https://oscar-corpus.com/">OSCAR corpus</a>
|
54 |
(Spanish ded-duped).
|
|
|
62 |
language. Some communities prefer the plural -@s to represent
|
63 |
-os and -as, or -e and -es for gender-neutral or mixed-gender plural,
|
64 |
or use fewer gendered professional nouns (la juez and not jueza). This is not yet
|
65 |
+
embraced by the Royal Spanish Academy
|
66 |
and is not represented in the corpora and tokenizers used to build this project.
|
67 |
|
68 |
+
This seq2seq project and script could, in the future, help generate more text samples
|
69 |
and prepare NLP models to understand us all better.
|
70 |
|
71 |
#### Sources
|
config.json
CHANGED
@@ -1,5 +1,4 @@
|
|
1 |
{
|
2 |
-
"_name_or_path": "dccuchile/bert-base-spanish-wwm-uncased",
|
3 |
"architectures": [
|
4 |
"BertModel"
|
5 |
],
|
|
|
1 |
{
|
|
|
2 |
"architectures": [
|
3 |
"BertModel"
|
4 |
],
|