RoBERTa base model fine-tuned on pronoun fill masking

This is RoBERTa base fine-tuned for fill masking of just pronouns. The model's purpose is to post process machine translated text where sentence level translation may not have enough context to correctly deduce the correct pronoun to use.

This model was trained on 10B tokens of literature (private light novel and book dataset as well as books1 and 20% of books3 from The Pile).

This model achieves an 88% top1 accuracy, evaluated with a sliding window of 512 tokens (84% without a sliding window).

How to use

Use fix_pronouns_in_text from pronoun_fixer.py

from transformers import AutoModelForMaskedLM, AutoTokenizer, FillMaskPipeline
import pronoun_fixer


# text produced by sentence level machine translation where the pronoun was ambiguous in the source language
# and is wrong in the target language
MTL_TEXT = """
Cadence Lee thought he was a normal girl, perhaps a little well to do, but not exceptionally so.
"""

device = 'cuda'
pronoun_checkpoint = "thefrigidliquidation/roberta-base-pronouns"
pronoun_model = AutoModelForMaskedLM.from_pretrained(pronoun_checkpoint).to(device)
pronoun_tokenizer = AutoTokenizer.from_pretrained(pronoun_checkpoint)
unmasker = FillMaskPipeline(model=pronoun_model, tokenizer=pronoun_tokenizer, device=device, top_k=10)

fixed_text = pronoun_fixer.fix_pronouns_in_text(unmasker, pronoun_tokenizer, MTL_TEXT)

print(fixed_text)
# Cadence Lee thought she was a normal girl, perhaps a little well to do, but not exceptionally so.
# now the pronoun is fixed