Canarim-Bert-Nheengatu is a BERT model pre-trained for the Nheengatu language, an indigenous language spoken in Brazil. The model was trained with the aim of being used in NLP (Natural Language Processing) tasks for the Nheengatu language, thereby aiding in the development of resources for the language.


Nheengatu, also known as modern Tupi and Amazonian General Language, among other names, is one of the dozens of still living Brazilian indigenous languages. The term Nheengatu emerged around the mid-19th century, originally meaning "good language," a result of the composition of the noun nheenga 'language' and the adjective katú 'good'. In the ISO 639-3 standard, it is represented by the code yrl, derived from yeral (general in Portuguese), one of the terms by which it is designated in Spanish.

The study of Nheengatu is of great historical importance, as it was, for two and a half centuries, in the words of José Ribamar Bessa Freire, “the main language of the Amazon”, a position it would lose to Portuguese only in the second half of the 19th century. It is perhaps the only Brazilian indigenous language whose development over more than four centuries can be traced through texts that document its various stages of evolution. (Source: Leonel Figueiredo de Alencar - CompLin)

Training Data

To train the model, an extensive collection of Nheengatu text data was gathered, extracted from various sources such as books, articles, websites, etc. The data were cleaned and prepared for model training. Below is a table with all the sources used for training the model.

Available Models

Model Arch. #Layers #Params
Canarim-Bert-Nheengatu Bert 12 110M

How to Use

from transformers import pipeline

pipe = pipeline('fill-mask', "dominguesm/canarim-bert-nheengatu")

# ptbr: Ele tinha febre, por isso não foi pescar.
# yrl: Aé urikú takuwa yawé resewara ti usú upinaitika.
pipe('Aé urikú takuwa yawé [MASK] ti usú upinaitika.')
# [{'score': 0.41232067346572876,
#   'token': 460,
#   'token_str': 'tẽ',
#   'sequence': 'Aé urikú takuwa yawé tẽ ti usú upinaitika.'},
#  {'score': 0.1178387925028801,
#   'token': 665,
#   'token_str': 'resewara',
#   'sequence': 'Aé urikú takuwa yawé resewara ti usú upinaitika.'},
#  {'score': 0.029453271999955177,
#   'token': 2168,
#   'token_str': 'artigu',
#   'sequence': 'Aé urikú takuwa yawé artigu ti usú upinaitika.'},
#  {'score': 0.027277836576104164,
#   'token': 669,
#   'token_str': 'sikuyaára',
#   'sequence': 'Aé urikú takuwa yawé sikuyaára ti usú upinaitika.'},
#  {'score': 0.020948367193341255,
#   'token': 642,
#   'token_str': 'akayu',
#   'sequence': 'Aé urikú takuwa yawé akayu ti usú upinaitika.'}]

NLP Task Performance - POSTAG

The model was evaluated in the token classification task (POSTAG), using the UD_Nheengatu-CompLin dataset. Below are the evaluation results.

                precision    recall  f1-score   support

         ADJ     0.7895    0.6522    0.7143        23
         ADP     0.9355    0.9158    0.9255        95
         ADV     0.8261    0.8172    0.8216        93
         AUX     0.9444    0.9189    0.9315        37
       CCONJ     0.7778    0.8750    0.8235         8
         DET     0.8776    0.9149    0.8958        47
        INTJ     0.5000    0.5000    0.5000         4
        NOUN     0.9257    0.9222    0.9239       270
         NUM     1.0000    0.6667    0.8000         6
        PART     0.9775    0.9062    0.9405        96
        PRON     0.9568    1.0000    0.9779       155
       PROPN     0.6429    0.4286    0.5143        21
       PUNCT     0.9963    1.0000    0.9981       267
       SCONJ     0.8000    0.7500    0.7742        32
        VERB     0.8651    0.9347    0.8986       199

   micro avg     0.9202    0.9202    0.9202      1353
   macro avg     0.8543    0.8135    0.8293      1353
weighted avg     0.9191    0.9202    0.9187      1353

More details about the model and evaluation can be found at dominguesm/canarim-bert-postag-nheengatu.

