dominguesm
commited on
Commit
•
de6f73b
1
Parent(s):
486e2e9
Atualização README
Browse files- README.md +179 -4
- README_ptbr.md +221 -0
README.md
CHANGED
@@ -1,7 +1,7 @@
|
|
1 |
---
|
2 |
language:
|
3 |
- yrl
|
4 |
-
license: cc-by-4.0
|
5 |
pipeline_tag: token-classification
|
6 |
tags:
|
7 |
- named-entity-recognition
|
@@ -38,12 +38,187 @@ widget:
|
|
38 |
- text: "Asuí kwá mukũi apigawa-itá aintá usemu kaá kití aintá upurakí arama balata, asuí mairamé aintá usika ana iwitera rupitá-pe, ape aintá umaã siya kumã iwa-itá."
|
39 |
---
|
40 |
|
|
|
|
|
41 |
<p align="center">
|
42 |
<img width="350" alt="Camarim Logo" src="https://raw.githubusercontent.com/DominguesM/canarim-bert-nheengatu/main/assets/canarim-yrl-nbg.png">
|
43 |
</p>
|
44 |
|
45 |
-
<
|
46 |
|
47 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
48 |
|
49 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
language:
|
3 |
- yrl
|
4 |
+
license: cc-by-nc-4.0
|
5 |
pipeline_tag: token-classification
|
6 |
tags:
|
7 |
- named-entity-recognition
|
|
|
38 |
- text: "Asuí kwá mukũi apigawa-itá aintá usemu kaá kití aintá upurakí arama balata, asuí mairamé aintá usika ana iwitera rupitá-pe, ape aintá umaã siya kumã iwa-itá."
|
39 |
---
|
40 |
|
41 |
+
# Canarim-Bert-PosTag-Nheengatu
|
42 |
+
|
43 |
<p align="center">
|
44 |
<img width="350" alt="Camarim Logo" src="https://raw.githubusercontent.com/DominguesM/canarim-bert-nheengatu/main/assets/canarim-yrl-nbg.png">
|
45 |
</p>
|
46 |
|
47 |
+
<br/>
|
48 |
|
49 |
+
## About
|
50 |
+
|
51 |
+
The `canarim-bert-posTag-nheengatu` model is a part-of-speech tagging model for the Nheengatu language, trained using the `UD_Nheengatu-CompLin` dataset available on [github](https://github.com/UniversalDependencies/UD_Nheengatu-CompLin/). It is based on the tokenizer and the [`Canarim-Bert-Nheengatu`](https://huggingface.co/dominguesm/canarim-bert-nheengatu) model.
|
52 |
+
|
53 |
+
## Supported Tags
|
54 |
+
|
55 |
+
The model can identify the following grammatical classes:
|
56 |
+
|
57 |
+
|**tag**|**abbreviation in glossary**|**expansion of abbreviation**|
|
58 |
+
|-------|-----------------------------|-----------------------------|
|
59 |
+
|ADJ|adj.|1st class adjective|
|
60 |
+
|ADP|posp.|postposition|
|
61 |
+
|ADV|adv.|adverb|
|
62 |
+
|AUX|aux.|auxiliary|
|
63 |
+
|CCONJ|cconj.|coordinating conjunction|
|
64 |
+
|DET|det.|determiner|
|
65 |
+
|INTJ|interj.|interjection|
|
66 |
+
|NOUN|n.|1st class noun|
|
67 |
+
|NUM|num.|numeral|
|
68 |
+
|PART|part.|particle|
|
69 |
+
|PRON|pron.|1st class pronoun|
|
70 |
+
|PROPN|prop.|proper noun|
|
71 |
+
|PUNCT|punct.|punctuation|
|
72 |
+
|SCONJ|sconj.|subordinating conjunction|
|
73 |
+
|VERB|v.|1st class verb|
|
74 |
+
|
75 |
+
## Training
|
76 |
+
|
77 |
+
### Dataset
|
78 |
+
|
79 |
+
The dataset used for training was the [`UD_Nheengatu-CompLin`](https://github.com/UniversalDependencies/UD_Nheengatu-CompLin/), divided into 80/10/10 proportions for training, evaluation, and testing, respectively.
|
80 |
+
|
81 |
+
|
82 |
+
```
|
83 |
+
DatasetDict({
|
84 |
+
train: Dataset({
|
85 |
+
features: ['id', 'tokens', 'pos_tags', 'text'],
|
86 |
+
num_rows: 1068
|
87 |
+
})
|
88 |
+
test: Dataset({
|
89 |
+
features: ['id', 'tokens', 'pos_tags', 'text'],
|
90 |
+
num_rows: 134
|
91 |
+
})
|
92 |
+
eval: Dataset({
|
93 |
+
features: ['id', 'tokens', 'pos_tags', 'text'],
|
94 |
+
num_rows: 134
|
95 |
+
})
|
96 |
+
})
|
97 |
+
```
|
98 |
+
|
99 |
+
### Hyperparameters
|
100 |
+
|
101 |
+
The hyperparameters used for training were:
|
102 |
+
|
103 |
+
* `learning_rate`: 3e-4
|
104 |
+
* `train_batch_size`: 16
|
105 |
+
* `eval_batch_size`: 32
|
106 |
+
* `gradient_accumulation_steps`: 1
|
107 |
+
* `weight_decay`: 0.01
|
108 |
+
* `num_train_epochs`: 10
|
109 |
+
|
110 |
+
### Results
|
111 |
+
|
112 |
+
The training and validation loss over the steps can be seen below:
|
113 |
+
|
114 |
+
<p align="center">
|
115 |
+
<img width="600" alt="Train Loss" src="https://raw.githubusercontent.com/DominguesM/canarim-bert-nheengatu/main/assets/postag-train-loss.png">
|
116 |
+
</p>
|
117 |
+
|
118 |
+
<p align="center">
|
119 |
+
<img width="600" alt="Eval Loss" src="https://raw.githubusercontent.com/DominguesM/canarim-bert-nheengatu/main/assets/postag-eval-loss.png">
|
120 |
+
</p>
|
121 |
+
|
122 |
+
The model's results on the evaluation set can be viewed below:
|
123 |
+
|
124 |
+
```
|
125 |
+
{
|
126 |
+
'eval_loss': 0.5337784886360168,
|
127 |
+
'eval_precision': 0.913735899137359,
|
128 |
+
'eval_recall': 0.913735899137359,
|
129 |
+
'eval_f1': 0.913735899137359,
|
130 |
+
'eval_accuracy': 0.913735899137359,
|
131 |
+
'eval_runtime': 0.1957,
|
132 |
+
'eval_samples_per_second': 684.883,
|
133 |
+
'eval_steps_per_second': 25.555,
|
134 |
+
'epoch': 10.0
|
135 |
+
}
|
136 |
+
```
|
137 |
+
|
138 |
+
### Metrics
|
139 |
+
|
140 |
+
The model's evaluation metrics on the test set can be viewed below:
|
141 |
+
|
142 |
+
```
|
143 |
+
precision recall f1-score support
|
144 |
+
|
145 |
+
ADJ 0.7895 0.6522 0.7143 23
|
146 |
+
ADP 0.9355 0.9158 0.9255 95
|
147 |
+
ADV 0.8261 0.8172 0.8216 93
|
148 |
+
AUX 0.9444 0.9189 0.9315 37
|
149 |
+
CCONJ 0.7778 0.8750 0.8235 8
|
150 |
+
DET 0.8776 0.9149 0.8958 47
|
151 |
+
INTJ 0.5000 0.5000 0.5000 4
|
152 |
+
NOUN 0.9257 0.9222 0.9239 270
|
153 |
+
NUM 1.0000 0.6667 0.8000 6
|
154 |
+
PART 0.9775 0.9062 0.9405 96
|
155 |
+
PRON 0.9568 1.0000 0.9779 155
|
156 |
+
PROPN 0.6429 0.4286 0.5143 21
|
157 |
+
PUNCT 0.9963 1.0000 0.9981 267
|
158 |
+
SCONJ 0.8000 0.7500 0.7742 32
|
159 |
+
VERB 0.8651 0.9347 0.8986 199
|
160 |
+
|
161 |
+
micro avg 0.9202 0.9202 0.9202 1353
|
162 |
+
macro avg 0.8543 0.8135 0.8293 1353
|
163 |
+
weighted avg 0.9191 0.9202 0.9187 1353
|
164 |
+
```
|
165 |
+
|
166 |
+
<br/>
|
167 |
+
|
168 |
+
<p align="center">
|
169 |
+
<img width="600" alt="Canarim BERT Nheengatu - POSTAG - Confusion Matrix" src="https://raw.githubusercontent.com/DominguesM/canarim-bert-nheengatu/main/assets/postag-confusion-matrix.png">
|
170 |
+
</p>
|
171 |
+
|
172 |
+
## Usage
|
173 |
+
|
174 |
+
The use of this model follows the common standards of the [transformers](https://github.com/huggingface/transformers) library. To use it, simply install the library and load the model:
|
175 |
+
|
176 |
+
|
177 |
+
```python
|
178 |
+
from transformers import pipeline
|
179 |
+
|
180 |
+
model_name = "dominguesm/canarim-bert-postag-nheengatu"
|
181 |
+
|
182 |
+
pipe = pipeline("ner", model=model_name)
|
183 |
+
|
184 |
+
pipe("Yamunhã timbiú, yapinaitika, yamunhã kaxirí.", aggregation_strategy="average")
|
185 |
+
```
|
186 |
+
|
187 |
+
The result will be:
|
188 |
+
|
189 |
+
```json
|
190 |
+
[
|
191 |
+
{"entity_group": "VERB", "score": 0.999668, "word": "Yamunhã", "start": 0, "end": 7},
|
192 |
+
{"entity_group": "NOUN", "score": 0.99986947, "word": "timbiú", "start": 8, "end": 14},
|
193 |
+
{"entity_group": "PUNCT", "score": 0.99993193, "word": ",", "start": 14, "end": 15},
|
194 |
+
{"entity_group": "VERB", "score": 0.9995308, "word": "yapinaitika", "start": 16, "end": 27},
|
195 |
+
{"entity_group": "PUNCT", "score": 0.9999416, "word": ",", "start": 27, "end": 28},
|
196 |
+
{"entity_group": "VERB", "score": 0.99955815, "word": "yamunhã", "start": 29, "end": 36},
|
197 |
+
{"entity_group": "NOUN", "score": 0.9998684, "word": "kaxirí", "start": 37, "end": 43},
|
198 |
+
{"entity_group": "PUNCT", "score": 0.99997807, "word": ".", "start": 43, "end": 44}
|
199 |
+
]
|
200 |
+
```
|
201 |
+
|
202 |
+
## License
|
203 |
+
|
204 |
+
The license of this model follows that of the dataset used for training, which is [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode). For more information, please visit the [dataset repository](https://github.com/UniversalDependencies/UD_Nheengatu-CompLin/tree/master).
|
205 |
+
|
206 |
+
|
207 |
+
## References
|
208 |
|
209 |
+
```bibtex
|
210 |
+
@inproceedings{stil,
|
211 |
+
author = {Leonel de Alencar},
|
212 |
+
title = {Yauti: A Tool for Morphosyntactic Analysis of Nheengatu within the Universal Dependencies Framework},
|
213 |
+
booktitle = {Anais do XIV Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana},
|
214 |
+
location = {Belo Horizonte/MG},
|
215 |
+
year = {2023},
|
216 |
+
keywords = {},
|
217 |
+
issn = {0000-0000},
|
218 |
+
pages = {135--145},
|
219 |
+
publisher = {SBC},
|
220 |
+
address = {Porto Alegre, RS, Brasil},
|
221 |
+
doi = {10.5753/stil.2023.234131},
|
222 |
+
url = {https://sol.sbc.org.br/index.php/stil/article/view/25445}
|
223 |
+
}
|
224 |
+
```
|
README_ptbr.md
ADDED
@@ -0,0 +1,221 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- yrl
|
4 |
+
license: cc-by-nc-4.0
|
5 |
+
pipeline_tag: token-classification
|
6 |
+
tags:
|
7 |
+
- named-entity-recognition
|
8 |
+
- Transformer
|
9 |
+
- pytorch
|
10 |
+
- bert
|
11 |
+
- nheengatu
|
12 |
+
metrics:
|
13 |
+
- f1
|
14 |
+
- precision
|
15 |
+
- recall
|
16 |
+
model-index:
|
17 |
+
- name: canarim-bert-postag-nheengatu
|
18 |
+
results:
|
19 |
+
- task:
|
20 |
+
type: named-entity-recognition
|
21 |
+
dataset:
|
22 |
+
type: UD_Nheengatu-CompLin
|
23 |
+
name: UD Nheengatu CompLin
|
24 |
+
metrics:
|
25 |
+
- type: f1
|
26 |
+
value: 82.93
|
27 |
+
name: F1 Score
|
28 |
+
- type: accuracy
|
29 |
+
value: 92.02
|
30 |
+
name: Accuracy
|
31 |
+
- type: recall
|
32 |
+
value: 81.35
|
33 |
+
name: Recall
|
34 |
+
widget:
|
35 |
+
- text: "Apigawa i paya waá umurari iké, sera José."
|
36 |
+
- text: "Asú apagari nhaã apigawa supé."
|
37 |
+
- text: "― Taukwáu ra."
|
38 |
+
- text: "Asuí kwá mukũi apigawa-itá aintá usemu kaá kití aintá upurakí arama balata, asuí mairamé aintá usika ana iwitera rupitá-pe, ape aintá umaã siya kumã iwa-itá."
|
39 |
+
---
|
40 |
+
|
41 |
+
# Canarim-Bert-PosTag-Nheengatu
|
42 |
+
|
43 |
+
<p align="center">
|
44 |
+
<img width="350" alt="Camarim Logo" src="https://raw.githubusercontent.com/DominguesM/canarim-bert-nheengatu/main/assets/canarim-yrl-nbg.png">
|
45 |
+
</p>
|
46 |
+
|
47 |
+
<br/>
|
48 |
+
|
49 |
+
## Sobre
|
50 |
+
|
51 |
+
O modelo `canarim-bert-posTag-nheengatu` é um modelo de marcação de classe gramatical para a língua Nheengatu que foi treinado no conjunto de dados `UD_Nheengatu-CompLin` disponível no [github](https://github.com/UniversalDependencies/UD_Nheengatu-CompLin/). Foi utilizado como base o tokenizador e o modelo [`Canarim-Bert-Nheengatu`](https://huggingface.co/dominguesm/canarim-bert-nheengatu).
|
52 |
+
|
53 |
+
## Etiquetas Suportadas
|
54 |
+
|
55 |
+
O modelo é capaz de identificar as seguintes classes gramaticais:
|
56 |
+
|
57 |
+
|**etiqueta**|**abreviatura no glossário**|**expansão da abreviatura**|
|
58 |
+
|------------|----------------------------|---------------------------|
|
59 |
+
|ADJ|adj.|adjetivo de 1ª cl.|
|
60 |
+
|ADP|posp.|posposição|
|
61 |
+
|ADV|adv.|advérbio|
|
62 |
+
|AUX|aux.|auxiliar|
|
63 |
+
|CCONJ|cconj.|conjunção coordenativa|
|
64 |
+
|DET|det.|determinante|
|
65 |
+
|INTJ|interj.|interjeição|
|
66 |
+
|NOUN|n.|substantivo de 1ª classe|
|
67 |
+
|NUM|num.|numeral|
|
68 |
+
|PART|part.|partícula|
|
69 |
+
|PRON|pron.|pronome de 1ª classe|
|
70 |
+
|PROPN|prop.|substantivo próprio|
|
71 |
+
|PUNCT|punct.|pontuação|
|
72 |
+
|SCONJ|sconj.|conjunção subordinativa|
|
73 |
+
|VERB|v.|verbo de 1ª classe|
|
74 |
+
|
75 |
+
## Treinamento
|
76 |
+
|
77 |
+
### Conjunto de Dados
|
78 |
+
|
79 |
+
O conjunto de dados utilizado para o treinamento foi o [`UD_Nheengatu-CompLin`](https://github.com/UniversalDependencies/UD_Nheengatu-CompLin/), dividido na proporção 80/10/10 para treino, avaliação e teste, respectivamente.
|
80 |
+
|
81 |
+
```
|
82 |
+
DatasetDict({
|
83 |
+
train: Dataset({
|
84 |
+
features: ['id', 'tokens', 'pos_tags', 'text'],
|
85 |
+
num_rows: 1068
|
86 |
+
})
|
87 |
+
test: Dataset({
|
88 |
+
features: ['id', 'tokens', 'pos_tags', 'text'],
|
89 |
+
num_rows: 134
|
90 |
+
})
|
91 |
+
eval: Dataset({
|
92 |
+
features: ['id', 'tokens', 'pos_tags', 'text'],
|
93 |
+
num_rows: 134
|
94 |
+
})
|
95 |
+
})
|
96 |
+
```
|
97 |
+
|
98 |
+
### Hiperparâmetros
|
99 |
+
|
100 |
+
Os hiperparâmetros utilizados para o treinamento foram:
|
101 |
+
|
102 |
+
* `learning_rate`: 3e-4
|
103 |
+
* `train_batch_size`: 16
|
104 |
+
* `eval_batch_size`: 32
|
105 |
+
* `gradient_accumulation_steps`: 1
|
106 |
+
* `weight_decay`: 0.01
|
107 |
+
* `num_train_epochs`: 10
|
108 |
+
|
109 |
+
### Resultados
|
110 |
+
|
111 |
+
A perca de treinamento e validação ao longo das épocas pode ser visualizada abaixo:
|
112 |
+
|
113 |
+
<p align="center">
|
114 |
+
<img width="600" alt="Train Loss" src="https://raw.githubusercontent.com/DominguesM/canarim-bert-nheengatu/main/assets/postag-train-loss.png">
|
115 |
+
</p>
|
116 |
+
|
117 |
+
<p align="center">
|
118 |
+
<img width="600" alt="Eval Loss" src="https://raw.githubusercontent.com/DominguesM/canarim-bert-nheengatu/main/assets/postag-eval-loss.png">
|
119 |
+
</p>
|
120 |
+
|
121 |
+
Os resultados do modelo no conjunto de avaliação podem ser visualizados abaixo:
|
122 |
+
|
123 |
+
```
|
124 |
+
{
|
125 |
+
'eval_loss': 0.5337784886360168,
|
126 |
+
'eval_precision': 0.913735899137359,
|
127 |
+
'eval_recall': 0.913735899137359,
|
128 |
+
'eval_f1': 0.913735899137359,
|
129 |
+
'eval_accuracy': 0.913735899137359,
|
130 |
+
'eval_runtime': 0.1957,
|
131 |
+
'eval_samples_per_second': 684.883,
|
132 |
+
'eval_steps_per_second': 25.555,
|
133 |
+
'epoch': 10.0
|
134 |
+
}
|
135 |
+
```
|
136 |
+
|
137 |
+
### Métricas
|
138 |
+
|
139 |
+
As métricas de avaliação do modelo no conjunto de teste podem ser visualizadas abaixo:
|
140 |
+
|
141 |
+
```
|
142 |
+
precision recall f1-score support
|
143 |
+
|
144 |
+
ADJ 0.7895 0.6522 0.7143 23
|
145 |
+
ADP 0.9355 0.9158 0.9255 95
|
146 |
+
ADV 0.8261 0.8172 0.8216 93
|
147 |
+
AUX 0.9444 0.9189 0.9315 37
|
148 |
+
CCONJ 0.7778 0.8750 0.8235 8
|
149 |
+
DET 0.8776 0.9149 0.8958 47
|
150 |
+
INTJ 0.5000 0.5000 0.5000 4
|
151 |
+
NOUN 0.9257 0.9222 0.9239 270
|
152 |
+
NUM 1.0000 0.6667 0.8000 6
|
153 |
+
PART 0.9775 0.9062 0.9405 96
|
154 |
+
PRON 0.9568 1.0000 0.9779 155
|
155 |
+
PROPN 0.6429 0.4286 0.5143 21
|
156 |
+
PUNCT 0.9963 1.0000 0.9981 267
|
157 |
+
SCONJ 0.8000 0.7500 0.7742 32
|
158 |
+
VERB 0.8651 0.9347 0.8986 199
|
159 |
+
|
160 |
+
micro avg 0.9202 0.9202 0.9202 1353
|
161 |
+
macro avg 0.8543 0.8135 0.8293 1353
|
162 |
+
weighted avg 0.9191 0.9202 0.9187 1353
|
163 |
+
```
|
164 |
+
|
165 |
+
<br/>
|
166 |
+
|
167 |
+
<p align="center">
|
168 |
+
<img width="600" alt="Canarim BERT Nheengatu - POSTAG - Confusion Matrix" src="https://raw.githubusercontent.com/DominguesM/canarim-bert-nheengatu/main/assets/postag-confusion-matrix.png">
|
169 |
+
</p>
|
170 |
+
|
171 |
+
## Uso
|
172 |
+
|
173 |
+
A utilização deste modelo segue os padrões comuns da biblioteca [transformers](https://github.com/huggingface/transformers). Para utilizá-lo, basta instalar a biblioteca e carregar o modelo:
|
174 |
+
|
175 |
+
```python
|
176 |
+
from transformers import pipeline
|
177 |
+
|
178 |
+
model_name = "dominguesm/canarim-bert-postag-nheengatu"
|
179 |
+
|
180 |
+
pipe = pipeline("ner", model=model_name)
|
181 |
+
|
182 |
+
pipe("Yamunhã timbiú, yapinaitika, yamunhã kaxirí.", aggregation_strategy="average")
|
183 |
+
```
|
184 |
+
|
185 |
+
O resultado será:
|
186 |
+
|
187 |
+
```json
|
188 |
+
[
|
189 |
+
{"entity_group": "VERB", "score": 0.999668, "word": "Yamunhã", "start": 0, "end": 7},
|
190 |
+
{"entity_group": "NOUN", "score": 0.99986947, "word": "timbiú", "start": 8, "end": 14},
|
191 |
+
{"entity_group": "PUNCT", "score": 0.99993193, "word": ",", "start": 14, "end": 15},
|
192 |
+
{"entity_group": "VERB", "score": 0.9995308, "word": "yapinaitika", "start": 16, "end": 27},
|
193 |
+
{"entity_group": "PUNCT", "score": 0.9999416, "word": ",", "start": 27, "end": 28},
|
194 |
+
{"entity_group": "VERB", "score": 0.99955815, "word": "yamunhã", "start": 29, "end": 36},
|
195 |
+
{"entity_group": "NOUN", "score": 0.9998684, "word": "kaxirí", "start": 37, "end": 43},
|
196 |
+
{"entity_group": "PUNCT", "score": 0.99997807, "word": ".", "start": 43, "end": 44}
|
197 |
+
]
|
198 |
+
```
|
199 |
+
|
200 |
+
## Licença
|
201 |
+
|
202 |
+
A licença deste modelo segue a licença do conjunto de dados utilizado para o treinamento, ou seja, [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode). Para mais informações, acesse o [repositório do conjunto de dados](https://github.com/UniversalDependencies/UD_Nheengatu-CompLin/tree/master)
|
203 |
+
|
204 |
+
## Referências
|
205 |
+
|
206 |
+
```bibtex
|
207 |
+
@inproceedings{stil,
|
208 |
+
author = {Leonel de Alencar},
|
209 |
+
title = {Yauti: A Tool for Morphosyntactic Analysis of Nheengatu within the Universal Dependencies Framework},
|
210 |
+
booktitle = {Anais do XIV Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana},
|
211 |
+
location = {Belo Horizonte/MG},
|
212 |
+
year = {2023},
|
213 |
+
keywords = {},
|
214 |
+
issn = {0000-0000},
|
215 |
+
pages = {135--145},
|
216 |
+
publisher = {SBC},
|
217 |
+
address = {Porto Alegre, RS, Brasil},
|
218 |
+
doi = {10.5753/stil.2023.234131},
|
219 |
+
url = {https://sol.sbc.org.br/index.php/stil/article/view/25445}
|
220 |
+
}
|
221 |
+
```
|