monsoon-nlp commited on
Commit
73d1f1f
1 Parent(s): 7068663

add code sample

Browse files
Files changed (2) hide show
  1. README.md +27 -5
  2. config.json +0 -1
README.md CHANGED
@@ -1,3 +1,7 @@
 
 
 
 
1
  # es-seq2seq-gender (encoder)
2
 
3
  This is a seq2seq model (encoder half) to "flip" gender in Spanish sentences.
@@ -14,11 +18,29 @@ Intended Examples:
14
  People's names are unchanged in this version, but you can use packages
15
  such as https://pypi.org/project/gender-guesser/
16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  ## Training
18
 
19
- I originally developed
20
  <a href="https://github.com/MonsoonNLP/el-la">a gender flip Python script</a>
21
- with
22
  <a href="https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased">BETO</a>,
23
  the Spanish-language BERT from Universidad de Chile,
24
  and spaCy to parse dependencies in sentences.
@@ -26,7 +48,7 @@ and spaCy to parse dependencies in sentences.
26
  More about this project: https://medium.com/ai-in-plain-english/gender-bias-in-spanish-bert-1f4d76780617
27
 
28
  The seq2seq model is trained on gender-flipped text from that script run on the
29
- <a href="https://huggingface.co/datasets/muchocine">muchocine dataset</a>,
30
  and the first 6,853 lines from the
31
  <a href="https://oscar-corpus.com/">OSCAR corpus</a>
32
  (Spanish ded-duped).
@@ -40,10 +62,10 @@ short of capturing gender diversity in the world and in the Spanish
40
  language. Some communities prefer the plural -@s to represent
41
  -os and -as, or -e and -es for gender-neutral or mixed-gender plural,
42
  or use fewer gendered professional nouns (la juez and not jueza). This is not yet
43
- embraced by the Royal Spanish Academy
44
  and is not represented in the corpora and tokenizers used to build this project.
45
 
46
- This seq2seq project and script could, in the future, help generate more text samples
47
  and prepare NLP models to understand us all better.
48
 
49
  #### Sources
 
1
+ ---
2
+ language: es
3
+ ---
4
+
5
  # es-seq2seq-gender (encoder)
6
 
7
  This is a seq2seq model (encoder half) to "flip" gender in Spanish sentences.
 
18
  People's names are unchanged in this version, but you can use packages
19
  such as https://pypi.org/project/gender-guesser/
20
 
21
+
22
+ ## Sample code
23
+
24
+ https://colab.research.google.com/drive/1Ta_YkXx93FyxqEu_zJ-W23PjPumMNHe5
25
+
26
+ ```
27
+ import torch
28
+ from transformers import AutoTokenizer, EncoderDecoderModel
29
+
30
+ model = EncoderDecoderModel.from_encoder_decoder_pretrained("monsoon-nlp/es-seq2seq-gender-encoder", "monsoon-nlp/es-seq2seq-gender-decoder")
31
+ tokenizer = AutoTokenizer.from_pretrained('monsoon-nlp/es-seq2seq-gender-decoder') # all are same as BETO uncased original
32
+
33
+ input_ids = torch.tensor(tokenizer.encode("la profesora vieja")).unsqueeze(0)
34
+ generated = model.generate(input_ids, decoder_start_token_id=model.config.decoder.pad_token_id)
35
+ tokenizer.decode(generated.tolist()[0])
36
+ > '[PAD] el profesor viejo profesor viejo profesor...'
37
+ ```
38
+
39
  ## Training
40
 
41
+ I originally developed
42
  <a href="https://github.com/MonsoonNLP/el-la">a gender flip Python script</a>
43
+ with
44
  <a href="https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased">BETO</a>,
45
  the Spanish-language BERT from Universidad de Chile,
46
  and spaCy to parse dependencies in sentences.
 
48
  More about this project: https://medium.com/ai-in-plain-english/gender-bias-in-spanish-bert-1f4d76780617
49
 
50
  The seq2seq model is trained on gender-flipped text from that script run on the
51
+ <a href="https://huggingface.co/datasets/muchocine">muchocine dataset</a>,
52
  and the first 6,853 lines from the
53
  <a href="https://oscar-corpus.com/">OSCAR corpus</a>
54
  (Spanish ded-duped).
 
62
  language. Some communities prefer the plural -@s to represent
63
  -os and -as, or -e and -es for gender-neutral or mixed-gender plural,
64
  or use fewer gendered professional nouns (la juez and not jueza). This is not yet
65
+ embraced by the Royal Spanish Academy
66
  and is not represented in the corpora and tokenizers used to build this project.
67
 
68
+ This seq2seq project and script could, in the future, help generate more text samples
69
  and prepare NLP models to understand us all better.
70
 
71
  #### Sources
config.json CHANGED
@@ -1,5 +1,4 @@
1
  {
2
- "_name_or_path": "dccuchile/bert-base-spanish-wwm-uncased",
3
  "architectures": [
4
  "BertModel"
5
  ],
 
1
  {
 
2
  "architectures": [
3
  "BertModel"
4
  ],