naumov-al commited on
Commit
8763c75
1 Parent(s): 2d5fbdf

add README

Browse files
Files changed (1) hide show
  1. README.md +29 -0
README.md ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # COVID-twitter-XLM-Roberta-large
2
+
3
+ ## Model description
4
+
5
+ This is a model based on the [XLM-RoBERTa large](https://huggingface.co/xlm-roberta-large) topology (provided by Facebook, see original [paper](https://arxiv.org/abs/1911.02116)) with additional training on a corpus of unmarked tweets.
6
+
7
+ For more details, please see, our [GitHub repository](https://github.com/sag111/COVID-19-tweets-Russia).
8
+
9
+
10
+ ## Training data
11
+
12
+ We formed a corpus of unlabeled twitter messages.
13
+
14
+ The data on keyword "covid" was expanded with texts containing other words often occurred in hashtags on the Covid-19 pandemic: "covid", "stayhome", and "coronavirus" (hereinafter, these are translations of Russian words into English).
15
+
16
+ Separately, messages were collected from Twitter users from large regions of Russia. The search was provided using different word forms of 58 manually selected keywords on Russian related to the topic of coronavirus infection (including: "PCR", "pandemic", "self-isolation", etc.).
17
+
18
+ The unlabeled corpus includes all unique Russian-language tweets from the collected data (>1M tweets). Since modern language models are usually multilingual, about 1M more tweets in other languages were added to this corpus using filtering procedures described above. Thus, in the unlabeled part of the collected data, there were about 2 million messages.
19
+
20
+
21
+ ### BibTeX entry and citation info
22
+
23
+ Our GitHub repository: https://github.com/sag111/COVID-19-tweets-Russia
24
+
25
+ If you have found our results helpful in your work, feel free to cite our publication and this repository as:
26
+
27
+ ```
28
+ coming soon
29
+ ```