|
--- |
|
language: ar |
|
--- |
|
|
|
# ar-seq2seq-gender (encoder) |
|
|
|
This is a seq2seq model (encoder half) to "flip" gender in **first-person** Arabic sentences. |
|
The model can augment your existing Arabic data, or generate counterfactuals |
|
to test a model's decisions (would changing the gender of the subject or speaker change output?). |
|
|
|
Intended Examples: |
|
- 'أنا سعيد' <=> 'انا سعيدة' |
|
- 'ركض إلى المتجر' <=> 'ركضت إلى المتجر' |
|
|
|
People's names, gender pronouns, gendered words (father, mother), and many other values are currently unchanged by this model. Future versions may be trained on more data. |
|
|
|
## Sample Code |
|
|
|
``` |
|
import torch |
|
from transformers import AutoTokenizer, EncoderDecoderModel |
|
|
|
model = EncoderDecoderModel.from_encoder_decoder_pretrained( |
|
"monsoon-nlp/ar-seq2seq-gender-encoder", |
|
"monsoon-nlp/ar-seq2seq-gender-decoder", |
|
min_length=40 |
|
) |
|
tokenizer = AutoTokenizer.from_pretrained('monsoon-nlp/ar-seq2seq-gender-decoder') # same as MARBERT original |
|
|
|
input_ids = torch.tensor(tokenizer.encode("أنا سعيدة")).unsqueeze(0) |
|
generated = model.generate(input_ids, decoder_start_token_id=model.config.decoder.pad_token_id) |
|
tokenizer.decode(generated.tolist()[0][1 : len(input_ids[0]) - 1]) |
|
> 'انا سعيد' |
|
``` |
|
|
|
https://colab.research.google.com/drive/1S0kE_2WiV82JkqKik_sBW-0TUtzUVmrV?usp=sharing |
|
|
|
## Training |
|
|
|
I originally developed |
|
<a href="https://github.com/MonsoonNLP/el-la">a gender flip Python script</a> |
|
for Spanish sentences, using |
|
<a href="https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased">BETO</a>, |
|
and spaCy. More about this project: https://medium.com/ai-in-plain-english/gender-bias-in-spanish-bert-1f4d76780617 |
|
|
|
The Arabic model encoder and decoder started with weights and vocabulary from |
|
<a href="https://github.com/UBC-NLP/marbert">MARBERT from UBC-NLP</a>, |
|
and was trained on the |
|
<a href="https://camel.abudhabi.nyu.edu/arabic-parallel-gender-corpus/">Arabic Parallel Gender Corpus</a> |
|
from NYU Abu Dhabi. The text is first-person sentences from OpenSubtitles, with parallel |
|
gender-reinflected sentences generated by Arabic speakers. |
|
|
|
Training notebook: https://colab.research.google.com/drive/1TuDfnV2gQ-WsDtHkF52jbn699bk6vJZV |
|
|
|
## Non-binary gender |
|
|
|
This model is useful to generate male and female text samples, but falls |
|
short of capturing gender diversity in the world and in the Arabic |
|
language. This subject is discussed in the bias statement of the |
|
<a href="https://www.aclweb.org/anthology/2020.gebnlp-1.12/">Gender Reinflection paper</a>. |
|
|