polish-roberta-base-v2-pos-tagging
This model is a fine-tuned version of sdadas/polish-roberta-base-v2 on the nkjp1m dataset. It achieves the following results on the evaluation set:
- Loss: 0.0508
- Precision: 0.9853
- Recall: 0.9858
- F1: 0.9856
- Accuracy: 0.9884
You can find the training notebook here: https://github.com/WikKam/roberta-pos-finetuning
Usage
from transformers import pipeline
nlp = pipeline("token-classification", "wkaminski/polish-roberta-base-v2-pos-tagging")
nlp("Ale dzisiaj leje")
Model description
This model is a part-of-speech tagger for the Polish language based on sdadas/polish-roberta-base-v2.
It support 40 classes representing flexemic class (detailed part of speech):
{
0: 'adj',
1: 'adja',
2: 'adjc',
3: 'adjp',
4: 'adv',
5: 'aglt',
6: 'bedzie',
7: 'brev',
8: 'comp',
9: 'conj',
10: 'depr',
11: 'dig',
12: 'fin',
13: 'frag',
14: 'ger',
15: 'imps',
16: 'impt',
17: 'inf',
18: 'interj',
19: 'interp',
20: 'num',
21: 'numcomp',
22: 'pact',
23: 'pacta',
24: 'pant',
25: 'part',
26: 'pcon',
27: 'ppas',
28: 'ppron12',
29: 'ppron3',
30: 'praet',
31: 'pred',
32: 'prep',
33: 'romandig',
34: 'siebie',
35: 'subst',
36: 'sym',
37: 'winien',
38: 'xxs',
39: 'xxx'
}
Tags meaning is the same as in nkjp1m dataset:
flexeme | abbreviation | base form | example |
---|---|---|---|
noun | subst | singular nominative | profesor |
depreciative form | depr | singular nominative form of the corresponding noun | profesor |
main numeral | num | inanimate masculine nominative form | pięć, dwa |
collective numeral | numcol | inanimate masculine nominative form of the main numeral | pięć, dwa |
adjective | adj | singular nominative masculine positive form | polski |
ad-adjectival adjective | adja | singular nominative masculine positive form of the adjective | polski |
post-prepositional adjective | adjp | singular nominative masculine positive form of the adjective | polski |
predicative adjective | adjc | singular nominative masculine positive form of the adjective | zdrowy, ciekawy |
adverb | adv | positive form | dobrze, bardzo |
non-3rd person pronoun | ppron12 | singular nominative | ja |
3rd-person pronoun | ppron3 | singular nominative | on |
pronoun siebie | siebie | accusative | siebie |
non-past form | fin | infinitive | czytać |
future być | bedzie | infinitive | być |
agglutinate być | aglt | infinitive | być |
l-participle | praet | infinitive | czytać |
imperative | impt | infinitive | czytać |
impersonal | imps | infinitive | czytać |
infinitive | inf | infinitive | czytać |
contemporary adv. participle | pcon | infinitive | czytać |
anterior adv. participle | pant | infinitive | czytać |
gerund | ger | infinitive | czytać |
active adj. participle | pact | infinitive | czytać |
passive adj. participle | ppas | infinitive | czytać |
winien | winien | singular masculine form | powinien, rad |
predicative | pred | the only form of that flexeme | warto |
preposition | prep | the non-vocalic form of that flexeme | na, przez, w |
coordinating conjunction | conj | the only form of that flexeme | oraz |
subordinating conjunction | comp | the only form of that flexeme | że |
particle-adverb | qub | the only form of that flexeme | nie, -że, się |
abbreviation | brev | the full dictionary form | rok, i tak dalej |
bound word | burk | the only form of that flexeme | trochu, oścież |
interjection | interj | the only form of that flexeme | ech, kurde |
punctuation | interp | the only form of that flexeme | ;, ., (, ] |
alien | xxx | the only form of that flexeme | cool , nihil |
Intended uses & limitations
Even though we have some nice tools for pos-tagging in polish (http://morfeusz.sgjp.pl/), I needed a pos tagger for polish that could be easily loaded inside the browser. Huggingface supports such functionality and that's why I created this model.
Training and evaluation data
Model was trained on a half of test data of the nkjp1m dataset (~0.5 milion tokens).
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 3
Training results
Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy |
---|---|---|---|---|---|---|---|
0.0665 | 1.0 | 2155 | 0.0629 | 0.9835 | 0.9836 | 0.9836 | 0.9867 |
0.0369 | 2.0 | 4310 | 0.0539 | 0.9842 | 0.9848 | 0.9845 | 0.9876 |
0.0243 | 3.0 | 6465 | 0.0508 | 0.9853 | 0.9858 | 0.9856 | 0.9884 |
Framework versions
- Transformers 4.36.0
- Pytorch 2.1.0+cu118
- Datasets 2.15.0
- Tokenizers 0.15.0
- Downloads last month
- 6
Model tree for wkaminski/polish-roberta-base-v2-pos-tagging
Base model
sdadas/polish-roberta-base-v2Evaluation results
- Precision on nkjp1mtest set self-reported0.985
- Recall on nkjp1mtest set self-reported0.986
- F1 on nkjp1mtest set self-reported0.986
- Accuracy on nkjp1mtest set self-reported0.988