--- license: cc-by-4.0 language: - en pipeline_tag: fill-mask --- # Retrained XLM-RoBERTa with Social Media Data XLM-RoBERTa [1] base model retrained in simpletransformers with social media language data from the Global Corpus of Language Use [2]. | Dialect | n | | ----- | ----- | | Australia | 25,000 | | Canada | 25,000 | | Ireland | 25,000 | | New Zealand | 25,000 | | United Kingdom | 25,000 | | United States of America | 25,000 | Random sample with replacement for text produced between 2022-05-16 and 2023-06-05. ``` def retrain_xlmroberta(): import pandas as pd import os from simpletransformers.language_modeling import LanguageModelingModel train_file = "retrain/train.txt" test_file = "retrain/test.txt" model_args = { "reprocess_input_data":True, "overwrite_output_dir":True, "output_dir":f"xlm-roberta-lm", "best_model_dir":f"xlm-roberta-lm/best_model", "manual_seed":42, "evaluate_during_training":True, "evaluate_during_training_steps":500, "num_train_epochs":8, "dataset_type":"simple", "use_multiprocessing":False, "use_multiprocessing_for_evaluation":False } model = LanguageModelingModel( "xlmroberta", "xlm-roberta-base", args=model_args, use_cuda=True, train_files=train_file ) model.train_model(train_file, eval_file=test_file) ``` References [1] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm ́an, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learning at Scale. ArXiv:1911.02116 [cs]. [2] Jonathan Dunn. 2020. Mapping languages: the Corpus of Global Language Use. Language Resources and Evaluation, 54(4):999–1018.