nlp-thedeep commited on
Commit
bc1988f
1 Parent(s): ef5ed2c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +33 -1
README.md CHANGED
@@ -4,4 +4,36 @@ language:
4
  - en
5
  - fr
6
  - es
7
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  - en
5
  - fr
6
  - es
7
+ - multilingual
8
+ ---
9
+
10
+ # HumBert
11
+
12
+ HumBert is a [XLM-Roberta](https://huggingface.co/xlm-roberta-base) model trained on humanitarian texts - approximately 50 million textual examples (roughly 2 billion tokens) from public humanitarian reports, law cases and news articles.
13
+ Data were collected from three main sources: [Reliefweb](https://reliefweb.int/), [UNHCR Refworld](https://www.refworld.org/) and [Europe Media Monitor News Brief](https://emm.newsbrief.eu/).
14
+ Although XLM-Roberta was trained on 100 different languages, this fine-tuning was performed on three languages, English, French and Spanish, due to the impossibility of finding a good amount of such kind of humanitarian data in other languages.
15
+
16
+
17
+ ## Intended uses & limitations
18
+
19
+ To the best of our knowledge, HumBert is the first language model adapted on humanitarian topics, which often use a very specific language, making adaptation to downstream tasks (such as dister responses text classification) more effective.
20
+ This model is primarily aimed at being fine-tuned on tasks such as sequence classification or token classification.
21
+
22
+ ## Benchmarks
23
+
24
+ Soon...
25
+
26
+ ## Usage
27
+
28
+ Here is how to use this model to get the features of a given text in PyTorch:
29
+
30
+ ```python
31
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
32
+ tokenizer = AutoTokenizer.from_pretrained('nlp-thedeep/humbert')
33
+ model = AutoModelForMaskedLM.from_pretrained("nlp-thedeep/humbert")
34
+ # prepare input
35
+ text = "Replace me by any text you'd like."
36
+ encoded_input = tokenizer(text, return_tensors='pt')
37
+ # forward pass
38
+ output = model(**encoded_input)
39
+ ```