File size: 2,531 Bytes
de3a998
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
114aef1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
de3a998
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
---
language: ar
---

# ar-seq2seq-gender (decoder)

This is a seq2seq model (decoder half) to "flip" gender in **first-person** Arabic sentences.
The model can augment your existing Arabic data, or generate counterfactuals
to test a model's decisions (would changing the gender of the subject or speaker change output?).

Intended Examples:
- 'أنا سعيد' <=> 'انا سعيدة'
- 'ركض إلى المتجر' <=> 'ركضت إلى المتجر'

People's names, gender pronouns, gendered words (father, mother), and many other values are currently unchanged by this model. Future versions may be trained on more data.

## Sample Code

```
import torch
from transformers import AutoTokenizer, EncoderDecoderModel

model = EncoderDecoderModel.from_encoder_decoder_pretrained(
  "monsoon-nlp/ar-seq2seq-gender-encoder",
  "monsoon-nlp/ar-seq2seq-gender-decoder",
  min_length=40
)
tokenizer = AutoTokenizer.from_pretrained('monsoon-nlp/ar-seq2seq-gender-decoder') # same as MARBERT original

input_ids = torch.tensor(tokenizer.encode("أنا سعيدة")).unsqueeze(0)
generated = model.generate(input_ids, decoder_start_token_id=model.config.decoder.pad_token_id)
tokenizer.decode(generated.tolist()[0][1 : len(input_ids[0]) - 1])
> 'انا سعيد'
```

https://colab.research.google.com/drive/1S0kE_2WiV82JkqKik_sBW-0TUtzUVmrV?usp=sharing

## Training

I originally developed
<a href="https://github.com/MonsoonNLP/el-la">a gender flip Python script</a>
for Spanish sentences, using
<a href="https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased">BETO</a>,
and spaCy. More about this project: https://medium.com/ai-in-plain-english/gender-bias-in-spanish-bert-1f4d76780617

The Arabic model encoder and decoder started with weights and vocabulary from
<a href="https://github.com/UBC-NLP/marbert">MARBERT from UBC-NLP</a>,
and was trained on the
<a href="https://camel.abudhabi.nyu.edu/arabic-parallel-gender-corpus/">Arabic Parallel Gender Corpus</a>
from NYU Abu Dhabi. The text is first-person sentences from OpenSubtitles, with parallel
gender-reinflected sentences generated by Arabic speakers.

Training notebook: https://colab.research.google.com/drive/1TuDfnV2gQ-WsDtHkF52jbn699bk6vJZV

## Non-binary gender

This model is useful to generate male and female text samples, but falls
short of capturing gender diversity in the world and in the Arabic
language. This subject is discussed in the bias statement of the
<a href="https://www.aclweb.org/anthology/2020.gebnlp-1.12/">Gender Reinflection paper</a>.