|
--- |
|
license: mit |
|
language: |
|
- en |
|
- es |
|
- it |
|
- pt |
|
- fr |
|
- zh |
|
- hi |
|
- ar |
|
- nl |
|
- ko |
|
pipeline_tag: fill-mask |
|
tags: |
|
- twitter |
|
--- |
|
# XLM-RoBERTA-large-twitter |
|
|
|
This is a XLM-RoBERTa-large model tuned on a corpus of over 156 million tweets in ten languages: English, Spanish, Italian, Portuguese, French, Chinese, Hindi, Arabic, Dutch and Korean. |
|
The model has been trained from the original XLM-RoBERTA-large checkpoint for 2 epochs with a batch size of 1024. |
|
|
|
For best results, preprocess the tweets using the following method before passing them to the model: |
|
```python |
|
def preprocess(text): |
|
new_text = [] |
|
for t in text.split(" "): |
|
t = '@user' if t.startswith('@') and len(t) > 1 else t |
|
t = 'http' if t.startswith('http') else t |
|
new_text.append(t) |
|
return " ".join(new_text) |
|
``` |