monsoon-nlp
/

es-seq2seq-gender-encoder

@@ -1,3 +1,7 @@
 # es-seq2seq-gender (encoder)
 This is a seq2seq model (encoder half) to "flip" gender in Spanish sentences.
@@ -14,11 +18,29 @@ Intended Examples:
 People's names are unchanged in this version, but you can use packages
 such as https://pypi.org/project/gender-guesser/
 ## Training
-I originally developed
 <a href="https://github.com/MonsoonNLP/el-la">a gender flip Python script</a>
-with
 <a href="https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased">BETO</a>,
 the Spanish-language BERT from Universidad de Chile,
 and spaCy to parse dependencies in sentences.
@@ -26,7 +48,7 @@ and spaCy to parse dependencies in sentences.
 More about this project: https://medium.com/ai-in-plain-english/gender-bias-in-spanish-bert-1f4d76780617
 The seq2seq model is trained on gender-flipped text from that script run on the
-<a href="https://huggingface.co/datasets/muchocine">muchocine dataset</a>,
 and the first 6,853 lines from the
 <a href="https://oscar-corpus.com/">OSCAR corpus</a>
 (Spanish ded-duped).
@@ -40,10 +62,10 @@ short of capturing gender diversity in the world and in the Spanish
 language. Some communities prefer the plural -@s to represent
 -os and -as, or -e and -es for gender-neutral or mixed-gender plural,
 or use fewer gendered professional nouns (la juez and not jueza). This is not yet
-embraced by the Royal Spanish Academy
 and is not represented in the corpora and tokenizers used to build this project.
-This seq2seq project and script could, in the future, help generate more text samples
 and prepare NLP models to understand us all better.
 #### Sources

+---
+language: es
+---
 # es-seq2seq-gender (encoder)
 This is a seq2seq model (encoder half) to "flip" gender in Spanish sentences.
 People's names are unchanged in this version, but you can use packages
 such as https://pypi.org/project/gender-guesser/
+## Sample code
+https://colab.research.google.com/drive/1Ta_YkXx93FyxqEu_zJ-W23PjPumMNHe5
+```
+import torch
+from transformers import AutoTokenizer, EncoderDecoderModel
+model = EncoderDecoderModel.from_encoder_decoder_pretrained("monsoon-nlp/es-seq2seq-gender-encoder", "monsoon-nlp/es-seq2seq-gender-decoder")
+tokenizer = AutoTokenizer.from_pretrained('monsoon-nlp/es-seq2seq-gender-decoder') # all are same as BETO uncased original
+input_ids = torch.tensor(tokenizer.encode("la profesora vieja")).unsqueeze(0)
+generated = model.generate(input_ids, decoder_start_token_id=model.config.decoder.pad_token_id)
+tokenizer.decode(generated.tolist()[0])
+> '[PAD] el profesor viejo profesor viejo profesor...'
+```
 ## Training
+I originally developed
 <a href="https://github.com/MonsoonNLP/el-la">a gender flip Python script</a>
+with
 <a href="https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased">BETO</a>,
 the Spanish-language BERT from Universidad de Chile,
 and spaCy to parse dependencies in sentences.
 More about this project: https://medium.com/ai-in-plain-english/gender-bias-in-spanish-bert-1f4d76780617
 The seq2seq model is trained on gender-flipped text from that script run on the
+<a href="https://huggingface.co/datasets/muchocine">muchocine dataset</a>,
 and the first 6,853 lines from the
 <a href="https://oscar-corpus.com/">OSCAR corpus</a>
 (Spanish ded-duped).
 language. Some communities prefer the plural -@s to represent
 -os and -as, or -e and -es for gender-neutral or mixed-gender plural,
 or use fewer gendered professional nouns (la juez and not jueza). This is not yet
+embraced by the Royal Spanish Academy
 and is not represented in the corpora and tokenizers used to build this project.
+This seq2seq project and script could, in the future, help generate more text samples
 and prepare NLP models to understand us all better.
 #### Sources

config.json CHANGED Viewed

@@ -1,5 +1,4 @@
 {
-  "_name_or_path": "dccuchile/bert-base-spanish-wwm-uncased",
   "architectures": [
     "BertModel"
   ],

 {
   "architectures": [
     "BertModel"
   ],