metadata
license: cc-by-4.0
language:
- en
pipeline_tag: fill-mask
Retrained XLM-RoBERTa with Social Media Data
XLM-RoBERTa [1] base model retrained in simpletransformers with social media language data from the Global Corpus of Language Use [2].
Dialect | n |
---|---|
Australia | 25,000 |
Canada | 25,000 |
Ireland | 25,000 |
New Zealand | 25,000 |
United Kingdom | 25,000 |
United States of America | 25,000 |
Random sample with replacement for text produced between 2022-05-16 and 2023-06-05.
def retrain_xlmroberta():
import pandas as pd
import os
from simpletransformers.language_modeling import LanguageModelingModel
train_file = "retrain/train.txt"
test_file = "retrain/test.txt"
model_args = {
"reprocess_input_data":True,
"overwrite_output_dir":True,
"output_dir":f"xlm-roberta-lm",
"best_model_dir":f"xlm-roberta-lm/best_model",
"manual_seed":42,
"evaluate_during_training":True,
"evaluate_during_training_steps":500,
"num_train_epochs":8,
"dataset_type":"simple",
"use_multiprocessing":False,
"use_multiprocessing_for_evaluation":False
}
model = LanguageModelingModel(
"xlmroberta",
"xlm-roberta-base",
args=model_args,
use_cuda=True,
train_files=train_file
)
model.train_model(train_file, eval_file=test_file)
References
[1] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm ́an, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learning at Scale. ArXiv:1911.02116 [cs].
[2] Jonathan Dunn. 2020. Mapping languages: the Corpus of Global Language Use. Language Resources and Evaluation, 54(4):999–1018.