rabindralamsal commited on
Commit
0e5603e
1 Parent(s): 88649d9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -17
README.md CHANGED
@@ -1,13 +1,13 @@
1
  # CrisisTransformers
2
- CrisisTransformers is a family of pre-trained language models and sentence encoders introduced in the paper "[CrisisTransformers: Pre-trained language models and sentence encoders for crisis-related social media texts](https://arxiv.org/abs/2309.05494)". The models were trained based on the RoBERTa pre-training procedure on a massive corpus of over 15 billion word tokens sourced from tweets associated with 30+ crisis events such as disease outbreaks, natural disasters, conflicts, etc. Please refer to the associated paper for more details.
3
 
4
- CrisisTransformers were evaluated on 18 public crisis-specific datasets against strong baselines such as BERT, RoBERTa, BERTweet, etc. Our pre-trained models outperform the baselines across all 18 datasets in classification tasks, and our best-performing sentence-encoder outperforms the state-of-the-art by more than 17\% in sentence encoding tasks.
5
 
6
  ## Uses
7
- CrisisTransformers has 8 pre-trained models and a sentence encoder. The pre-trained models should be finetuned for downstream tasks just like [BERT](https://huggingface.co/bert-base-cased) and [RoBERTa](https://huggingface.co/roberta-base). The sentence encoder can be used out-of-the-box just like [Sentence-Transformers](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) for sentence encoding to facilitate tasks such as semantic search, clustering, topic modelling.
8
 
9
  ## Models and naming conventions
10
- *CT-M1* models were trained from scratch up to 40 epochs, while *CT-M2* models were initialized with pre-trained RoBERTa's weights and *CT-M3* models were initialized with pre-trained BERTweet's weights and both trained for up to 20 epochs. *OneLook* represents the checkpoint after 1 epoch, *BestLoss* represents the checkpoint with the lowest loss during training, and *Complete* represents the checkpoint after completing all epochs. SE represents sentence encoder.
11
 
12
  | pre-trained model | source |
13
  |--|--|
@@ -23,27 +23,36 @@ CrisisTransformers has 8 pre-trained models and a sentence encoder. The pre-trai
23
 
24
  | sentence encoder | source |
25
  |--|--|
26
- |CT-M1-Complete-SE|[crisistransformers/CT-M1-Complete-SE](https://huggingface.co/crisistransformers/CT-M1-Complete-SE)|
 
 
27
 
28
-
29
- ## Results
30
- Here are the main results from the associated paper.
31
-
32
- <p float="left">
33
- <a href="https://raw.githubusercontent.com/rabindralamsal/images/main/cls.png"><img width="100%" alt="classification" src="https://raw.githubusercontent.com/rabindralamsal/images/main/cls.png"></a>
34
- <a href="https://raw.githubusercontent.com/rabindralamsal/images/main/se.png"><img width="50%" alt="sentence encoding" src="https://raw.githubusercontent.com/rabindralamsal/images/main/se.png"></a>
35
- </p>
36
 
37
  ## Citation
38
- If you use CrisisTransformers, please cite the following paper:
39
  ```
40
- @misc{lamsal2023crisistransformers,
41
  title={CrisisTransformers: Pre-trained language models and sentence encoders for crisis-related social media texts},
42
  author={Rabindra Lamsal and
43
  Maria Rodriguez Read and
44
  Shanika Karunasekera},
45
- year={2023},
46
- eprint={2309.05494},
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
  archivePrefix={arXiv},
48
  primaryClass={cs.CL}
49
  }
 
1
  # CrisisTransformers
2
+ CrisisTransformers is a family of pre-trained language models and sentence encoders introduced in the papers "[CrisisTransformers: Pre-trained language models and sentence encoders for crisis-related social media texts](https://www.sciencedirect.com/science/article/pii/S0950705124005501)" and "[Semantically Enriched Cross-Lingual Sentence Embeddings for Crisis-related Social Media Texts](https://arxiv.org/abs/2403.16614)". The models were trained based on the RoBERTa pre-training procedure on a massive corpus of over 15 billion word tokens sourced from tweets associated with 30+ crisis events such as disease outbreaks, natural disasters, conflicts, etc. Please refer to the [associated paper](https://www.sciencedirect.com/science/article/pii/S0950705124005501) for more details.
3
 
4
+ CrisisTransformers were evaluated on 18 public crisis-specific datasets against strong baselines. Our pre-trained models outperform the baselines across all 18 datasets in classification tasks, and our best-performing sentence-encoder (mono-lingual) outperforms the state-of-the-art by more than 17\% in sentence encoding tasks. The multi-lingual sentence encoders (support 50+ languages; see [associated paper](https://arxiv.org/abs/2403.16614)) are designed to approximate the embedding space of the best-performing mono-lingual sentence encoder.
5
 
6
  ## Uses
7
+ CrisisTransformers has 8 pre-trained models, 1 mono-lingual and 2 multi-lingual sentence encoders. The pre-trained models should be finetuned for downstream tasks just like [BERT](https://huggingface.co/bert-base-cased) and [RoBERTa](https://huggingface.co/roberta-base). The sentence encoders can be used out-of-the-box just like [Sentence-Transformers](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) for sentence encoding to facilitate tasks such as semantic search, clustering, topic modelling.
8
 
9
  ## Models and naming conventions
10
+ *CT-M1* models were trained from scratch up to 40 epochs, while *CT-M2* models were initialized with pre-trained RoBERTa's weights and *CT-M3* models were initialized with pre-trained BERTweet's weights and both trained for up to 20 epochs. *OneLook* represents the checkpoint after 1 epoch, *BestLoss* represents the checkpoint with the lowest loss during training, and *Complete* represents the checkpoint after completing all epochs. *SE* represents sentence encoder.
11
 
12
  | pre-trained model | source |
13
  |--|--|
 
23
 
24
  | sentence encoder | source |
25
  |--|--|
26
+ |CT-M1-Complete-SE (mono-lingual: EN)|[crisistransformers/CT-M1-Complete-SE](https://huggingface.co/crisistransformers/CT-M1-Complete-SE)|
27
+ |CT-XLMR-SE (multi-lingual)|[crisistransformers/CT-XLMR-SE](https://huggingface.co/crisistransformers/CT-XLMR-SE)|
28
+ |CT-mBERT-SE (multi-lingual)|[crisistransformers/CT-mBERT-SE](https://huggingface.co/crisistransformers/CT-mBERT-SE)|
29
 
30
+ Languages supported by the multi-lingual sentence encoders: Albanian, Arabic, Armenian, Bulgarian, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, Estonian, Finnish, French, French (Canada), Galician, Georgian, German, Greek, Gujarati, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Kurdish (Sorani), Latvian, Lithuanian, Macedonian, Malay, Marathi, Mongolian, Myanmar (Burmese), Norwegian, Persian, Polish, Portuguese, Portuguese (Brazil), Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish, Thai, Turkish, Ukrainian, Urdu, and Vietnamese.
 
 
 
 
 
 
 
31
 
32
  ## Citation
33
+ If you use CrisisTransformers and the mono-lingual sentence encoder, please cite the following paper:
34
  ```
35
+ @article{lamsal2023crisistransformers,
36
  title={CrisisTransformers: Pre-trained language models and sentence encoders for crisis-related social media texts},
37
  author={Rabindra Lamsal and
38
  Maria Rodriguez Read and
39
  Shanika Karunasekera},
40
+ journal={Knowledge-Based Systems},
41
+ pages={111916},
42
+ year={2024},
43
+ publisher={Elsevier}
44
+ }
45
+ ```
46
+
47
+ If you use the multi-lingual sentence encoders, please cite the following paper:
48
+ ```
49
+ @article{lamsal2024semantically,
50
+ title={Semantically Enriched Cross-Lingual Sentence Embeddings for Crisis-related Social Media Texts},
51
+ author={Rabindra Lamsal and
52
+ Maria Rodriguez Read and
53
+ Shanika Karunasekera},
54
+ year={2024},
55
+ eprint={2403.16614},
56
  archivePrefix={arXiv},
57
  primaryClass={cs.CL}
58
  }