license: apache-2.0
language:
- en
- fr
- es
- multilingual
widget:
- text: >-
Critical levels of out of school children were reported, with 72% of
respondents pointing to moderate to high numbers of primary school age not
accessing <mask>
HumBert
HumBert (Humanitarian Bert) is a XLM-Roberta model trained on humanitarian texts - approximately 50 million textual examples (roughly 2 billion tokens) from public humanitarian reports, law cases and news articles. Data were collected from three main sources: Reliefweb, UNHCR Refworld and Europe Media Monitor News Brief. Although XLM-Roberta was trained on 100 different languages, this fine-tuning was performed on three languages, English, French and Spanish, due to the impossibility of finding a good amount of such kind of humanitarian data in other languages.
Intended uses
To the best of our knowledge, HumBert is the first language model adapted on humanitarian topics, which often use a very specific language, making adaptation to downstream tasks (such as dister responses text classification) more effective. This model is primarily aimed at being fine-tuned on tasks such as sequence classification or token classification.
Benchmarks
Soon...
Usage
Here is how to use this model to get the features of a given text in PyTorch:
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained('nlp-thedeep/humbert')
model = AutoModelForMaskedLM.from_pretrained("nlp-thedeep/humbert")
# prepare input
text = "YOUR TEXT"
encoded_input = tokenizer(text, return_tensors='pt')
# forward pass
output = model(**encoded_input)