HumBert

HumBert (Humanitarian Bert) is a XLM-Roberta model trained on humanitarian texts - approximately 50 million textual examples (roughly 2 billion tokens) from public humanitarian reports, law cases and news articles. Data were collected from three main sources: Reliefweb, UNHCR Refworld and Europe Media Monitor News Brief. Although XLM-Roberta was trained on 100 different languages, this fine-tuning was performed on three languages, English, French and Spanish, due to the impossibility of finding a good amount of such kind of humanitarian data in other languages.

Intended uses

To the best of our knowledge, HumBert is the first language model adapted on humanitarian topics, which often use a very specific language, making adaptation to downstream tasks (such as dister responses text classification) more effective. This model is primarily aimed at being fine-tuned on tasks such as sequence classification or token classification.

Benchmarks

Soon...

Usage

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained('nlp-thedeep/humbert')
model = AutoModelForMaskedLM.from_pretrained("nlp-thedeep/humbert")
# prepare input
text = "YOUR TEXT"
encoded_input = tokenizer(text, return_tensors='pt')
# forward pass
output = model(**encoded_input)