File size: 805 Bytes
09fb39e
 
6ef8d02
 
 
 
 
 
 
 
 
 
 
 
 
 
09fb39e
6ef8d02
 
 
e089ad0
 
a0e364d
e089ad0
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
---
license: mit
language:
- en
- es
- it
- pt
- fr
- zh
- hi
- ar
- nl
- ko
pipeline_tag: fill-mask
tags:
- twitter
---
# XLM-RoBERTA-large-twitter

This is a XLM-RoBERTa-large model tuned on a corpus of over 156 million tweets in ten languages: English, Spanish, Italian, Portuguese, French, Chinese, Hindi, Arabic, Dutch and Korean.
The model has been trained from the original XLM-RoBERTA-large checkpoint for 2 epochs with a batch size of 1024.

For best results, preprocess the tweets using the following method before passing them to the model:
```python
def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)
```