julien-c HF staff commited on
Commit
91cdfee
1 Parent(s): c5e61a0

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/digitalepidemiologylab/covid-twitter-bert/README.md

Files changed (1) hide show
  1. README.md +33 -3
README.md CHANGED
@@ -1,13 +1,23 @@
 
 
 
 
 
 
 
 
 
1
  # COVID-Twitter-BERT (CT-BERT) v1
2
- BERT-large-uncased model, pretrained on a corpus of messages from Twitter about COVID-19.
3
 
4
- Find more info on our [GitHub page](https://github.com/digitalepidemiologylab/covid-twitter-bert).
 
5
 
 
6
 
7
  ## Overview
8
  This model was trained on 160M tweets collected between January 12 and April 16, 2020 containing at least one of the keywords "wuhan", "ncov", "coronavirus", "covid", or "sars-cov-2". These tweets were filtered and preprocessed to reach a final sample of 22.5M tweets (containing 40.7M sentences and 633M tokens) which were used for training.
9
 
10
- This model was evaluated based on downstream classification tasks, but it could be used for any other NLP task which can leverage contextual embeddings.
11
 
12
  In order to achieve best results, make sure to use the same text preprocessing as we did for pretraining. This involves replacing user mentions, urls and emojis. You can find a script on our projects [GitHub repo](https://github.com/digitalepidemiologylab/covid-twitter-bert).
13
 
@@ -17,5 +27,25 @@ tokenizer = AutoTokenizer.from_pretrained("digitalepidemiologylab/covid-twitter-
17
  model = AutoModel.from_pretrained("digitalepidemiologylab/covid-twitter-bert")
18
  ```
19
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
  ## References
21
  [1] Martin Müller, Marcel Salaté, Per E Kummervold. "COVID-Twitter-BERT: A Natural Language Processing Model to Analyse COVID-19 Content on Twitter" arXiv preprint arXiv:2005.07503 (2020).
1
+ ---
2
+ language: "en"
3
+ thumbnail: "https://raw.githubusercontent.com/digitalepidemiologylab/covid-twitter-bert/master/images/COVID-Twitter-BERT_small.png"
4
+ tags:
5
+ - Twitter
6
+ - COVID-19
7
+ license: mit
8
+ ---
9
+
10
  # COVID-Twitter-BERT (CT-BERT) v1
 
11
 
12
+ :warning: _You may want to use the [v2 model](https://huggingface.co/digitalepidemiologylab/covid-twitter-bert-v2) which was trained on more recent data and yields better performance_ :warning:
13
+
14
 
15
+ BERT-large-uncased model, pretrained on a corpus of messages from Twitter about COVID-19. Find more info on our [GitHub page](https://github.com/digitalepidemiologylab/covid-twitter-bert).
16
 
17
  ## Overview
18
  This model was trained on 160M tweets collected between January 12 and April 16, 2020 containing at least one of the keywords "wuhan", "ncov", "coronavirus", "covid", or "sars-cov-2". These tweets were filtered and preprocessed to reach a final sample of 22.5M tweets (containing 40.7M sentences and 633M tokens) which were used for training.
19
 
20
+ This model was evaluated based on downstream classification tasks, but it could be used for any other NLP task which can leverage contextual embeddings.
21
 
22
  In order to achieve best results, make sure to use the same text preprocessing as we did for pretraining. This involves replacing user mentions, urls and emojis. You can find a script on our projects [GitHub repo](https://github.com/digitalepidemiologylab/covid-twitter-bert).
23
 
27
  model = AutoModel.from_pretrained("digitalepidemiologylab/covid-twitter-bert")
28
  ```
29
 
30
+ You can also use the model with the `pipeline` interface:
31
+
32
+ ```python
33
+ from transformers import pipeline
34
+ import json
35
+
36
+ pipe = pipeline(task='fill-mask', model='digitalepidemiologylab/covid-twitter-bert-v2')
37
+ out = pipe(f"In places with a lot of people, it's a good idea to wear a {pipe.tokenizer.mask_token}")
38
+ print(json.dumps(out, indent=4))
39
+ [
40
+ {
41
+ "sequence": "[CLS] in places with a lot of people, it's a good idea to wear a mask [SEP]",
42
+ "score": 0.9959408044815063,
43
+ "token": 7308,
44
+ "token_str": "mask"
45
+ },
46
+ ...
47
+ ]
48
+ ```
49
+
50
  ## References
51
  [1] Martin Müller, Marcel Salaté, Per E Kummervold. "COVID-Twitter-BERT: A Natural Language Processing Model to Analyse COVID-19 Content on Twitter" arXiv preprint arXiv:2005.07503 (2020).