File size: 7,163 Bytes

---
language:
  - yrl
license: cc-by-nc-4.0
pipeline_tag: token-classification
tags:
  - named-entity-recognition
  - Transformer
  - pytorch
  - bert
  - nheengatu
metrics:
  - f1
  - precision
  - recall
model-index:
- name: canarim-bert-postag-nheengatu
  results:
  - task:
      type: named-entity-recognition
    dataset:
      type: UD_Nheengatu-CompLin
      name: UD Nheengatu CompLin
    metrics:
      - type: f1
        value: 82.93
        name: F1 Score
      - type: accuracy
        value: 92.02
        name: Accuracy
      - type: recall
        value: 81.35
        name: Recall
widget:
  - text: "Apigawa i paya waá umurari iké, sera José."
  - text: "Asú apagari nhaã apigawa supé."
  - text: "― Taukwáu ra."
  - text: "Asuí kwá mukũi apigawa-itá aintá usemu kaá kití aintá upurakí arama balata, asuí mairamé aintá usika ana iwitera rupitá-pe, ape aintá umaã siya kumã iwa-itá."
---

# Canarim-Bert-PosTag-Nheengatu

<p align="center">
  <img width="350" alt="Camarim Logo" src="https://raw.githubusercontent.com/DominguesM/canarim-bert-nheengatu/main/assets/canarim-yrl-nbg.png">
</p>

<br/>

## About

The `canarim-bert-posTag-nheengatu` model is a part-of-speech tagging model for the Nheengatu language, trained using the `UD_Nheengatu-CompLin` dataset available on [github](https://github.com/UniversalDependencies/UD_Nheengatu-CompLin/). It is based on the tokenizer and the [`Canarim-Bert-Nheengatu`](https://huggingface.co/dominguesm/canarim-bert-nheengatu) model.

## Supported Tags

The model can identify the following grammatical classes:

|**tag**|**abbreviation in glossary**|**expansion of abbreviation**|
|-------|-----------------------------|-----------------------------|
|ADJ|adj.|1st class adjective|
|ADP|posp.|postposition|
|ADV|adv.|adverb|
|AUX|aux.|auxiliary|
|CCONJ|cconj.|coordinating conjunction|
|DET|det.|determiner|
|INTJ|interj.|interjection|
|NOUN|n.|1st class noun|
|NUM|num.|numeral|
|PART|part.|particle|
|PRON|pron.|1st class pronoun|
|PROPN|prop.|proper noun|
|PUNCT|punct.|punctuation|
|SCONJ|sconj.|subordinating conjunction|
|VERB|v.|1st class verb|

## Training

### Dataset

The dataset used for training was the [`UD_Nheengatu-CompLin`](https://github.com/UniversalDependencies/UD_Nheengatu-CompLin/), divided into 80/10/10 proportions for training, evaluation, and testing, respectively.


```
DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'text'],
        num_rows: 1068
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'text'],
        num_rows: 134
    })
    eval: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'text'],
        num_rows: 134
    })
})
```

### Hyperparameters

The hyperparameters used for training were:

* `learning_rate`: 3e-4
* `train_batch_size`: 16
* `eval_batch_size`: 32
* `gradient_accumulation_steps`: 1
* `weight_decay`: 0.01
* `num_train_epochs`: 10

### Results

The training and validation loss over the steps can be seen below:

<p align="center">
  <img width="600" alt="Train Loss" src="https://raw.githubusercontent.com/DominguesM/canarim-bert-nheengatu/main/assets/postag-train-loss.png">
</p>

<p align="center">
  <img width="600" alt="Eval Loss" src="https://raw.githubusercontent.com/DominguesM/canarim-bert-nheengatu/main/assets/postag-eval-loss.png">
</p>

The model's results on the evaluation set can be viewed below:

```
{
  'eval_loss': 0.5337784886360168,
  'eval_precision': 0.913735899137359,
  'eval_recall': 0.913735899137359,
  'eval_f1': 0.913735899137359,
  'eval_accuracy': 0.913735899137359,
  'eval_runtime': 0.1957,
  'eval_samples_per_second': 684.883,
  'eval_steps_per_second': 25.555,
  'epoch': 10.0
}
```

### Metrics

The model's evaluation metrics on the test set can be viewed below:

```
                precision    recall  f1-score   support

         ADJ     0.7895    0.6522    0.7143        23
         ADP     0.9355    0.9158    0.9255        95
         ADV     0.8261    0.8172    0.8216        93
         AUX     0.9444    0.9189    0.9315        37
       CCONJ     0.7778    0.8750    0.8235         8
         DET     0.8776    0.9149    0.8958        47
        INTJ     0.5000    0.5000    0.5000         4
        NOUN     0.9257    0.9222    0.9239       270
         NUM     1.0000    0.6667    0.8000         6
        PART     0.9775    0.9062    0.9405        96
        PRON     0.9568    1.0000    0.9779       155
       PROPN     0.6429    0.4286    0.5143        21
       PUNCT     0.9963    1.0000    0.9981       267
       SCONJ     0.8000    0.7500    0.7742        32
        VERB     0.8651    0.9347    0.8986       199

   micro avg     0.9202    0.9202    0.9202      1353
   macro avg     0.8543    0.8135    0.8293      1353
weighted avg     0.9191    0.9202    0.9187      1353
```

<br/>

<p align="center">
  <img width="600" alt="Canarim BERT Nheengatu - POSTAG - Confusion Matrix" src="https://raw.githubusercontent.com/DominguesM/canarim-bert-nheengatu/main/assets/postag-confusion-matrix.png">
</p>

## Usage

The use of this model follows the common standards of the [transformers](https://github.com/huggingface/transformers) library. To use it, simply install the library and load the model:


```python
from transformers import pipeline

model_name = "dominguesm/canarim-bert-postag-nheengatu"

pipe = pipeline("ner", model=model_name)

pipe("Yamunhã timbiú, yapinaitika, yamunhã kaxirí.", aggregation_strategy="average")
```

The result will be:

```json
[
  {"entity_group": "VERB", "score": 0.999668, "word": "Yamunhã", "start": 0, "end": 7},
  {"entity_group": "NOUN", "score": 0.99986947, "word": "timbiú", "start": 8, "end": 14},
  {"entity_group": "PUNCT", "score": 0.99993193, "word": ",", "start": 14, "end": 15},
  {"entity_group": "VERB", "score": 0.9995308, "word": "yapinaitika", "start": 16, "end": 27},
  {"entity_group": "PUNCT", "score": 0.9999416, "word": ",", "start": 27, "end": 28},
  {"entity_group": "VERB", "score": 0.99955815, "word": "yamunhã", "start": 29, "end": 36},
  {"entity_group": "NOUN", "score": 0.9998684, "word": "kaxirí", "start": 37, "end": 43},
  {"entity_group": "PUNCT", "score": 0.99997807, "word": ".", "start": 43, "end": 44}
]
```

## License

The license of this model follows that of the dataset used for training, which is [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode). For more information, please visit the [dataset repository](https://github.com/UniversalDependencies/UD_Nheengatu-CompLin/tree/master).


## References

```bibtex
@inproceedings{stil,
  author = {Leonel de Alencar},
  title = {Yauti: A Tool for Morphosyntactic Analysis of Nheengatu within the Universal Dependencies Framework},
  booktitle = {Anais do XIV Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana},
  location = {Belo Horizonte/MG},
  year = {2023},
  keywords = {},
  issn = {0000-0000},
  pages = {135--145},
  publisher = {SBC},
  address = {Porto Alegre, RS, Brasil},
  doi = {10.5753/stil.2023.234131},
  url = {https://sol.sbc.org.br/index.php/stil/article/view/25445}
}
```