Edit model card

polish-roberta-base-v2-pos-tagging

This model is a fine-tuned version of sdadas/polish-roberta-base-v2 on the nkjp1m dataset. It achieves the following results on the evaluation set:

  • Loss: 0.0508
  • Precision: 0.9853
  • Recall: 0.9858
  • F1: 0.9856
  • Accuracy: 0.9884

You can find the training notebook here: https://github.com/WikKam/roberta-pos-finetuning

Usage

from transformers import pipeline

nlp = pipeline("token-classification", "wkaminski/polish-roberta-base-v2-pos-tagging")

nlp("Ale dzisiaj leje")

Model description

This model is a part-of-speech tagger for the Polish language based on sdadas/polish-roberta-base-v2.

It support 40 classes representing flexemic class (detailed part of speech):

{
 0: 'adj',
 1: 'adja',
 2: 'adjc',
 3: 'adjp',
 4: 'adv',
 5: 'aglt',
 6: 'bedzie',
 7: 'brev',
 8: 'comp',
 9: 'conj',
 10: 'depr',
 11: 'dig',
 12: 'fin',
 13: 'frag',
 14: 'ger',
 15: 'imps',
 16: 'impt',
 17: 'inf',
 18: 'interj',
 19: 'interp',
 20: 'num',
 21: 'numcomp',
 22: 'pact',
 23: 'pacta',
 24: 'pant',
 25: 'part',
 26: 'pcon',
 27: 'ppas',
 28: 'ppron12',
 29: 'ppron3',
 30: 'praet',
 31: 'pred',
 32: 'prep',
 33: 'romandig',
 34: 'siebie',
 35: 'subst',
 36: 'sym',
 37: 'winien',
 38: 'xxs',
 39: 'xxx'
}

Tags meaning is the same as in nkjp1m dataset:

flexeme abbreviation base form example
noun subst singular nominative profesor
depreciative form depr singular nominative form of the corresponding noun profesor
main numeral num inanimate masculine nominative form pięć, dwa
collective numeral numcol inanimate masculine nominative form of the main numeral pięć, dwa
adjective adj singular nominative masculine positive form polski
ad-adjectival adjective adja singular nominative masculine positive form of the adjective polski
post-prepositional adjective adjp singular nominative masculine positive form of the adjective polski
predicative adjective adjc singular nominative masculine positive form of the adjective zdrowy, ciekawy
adverb adv positive form dobrze, bardzo
non-3rd person pronoun ppron12 singular nominative ja
3rd-person pronoun ppron3 singular nominative on
pronoun siebie siebie accusative siebie
non-past form fin infinitive czytać
future być bedzie infinitive być
agglutinate być aglt infinitive być
l-participle praet infinitive czytać
imperative impt infinitive czytać
impersonal imps infinitive czytać
infinitive inf infinitive czytać
contemporary adv. participle pcon infinitive czytać
anterior adv. participle pant infinitive czytać
gerund ger infinitive czytać
active adj. participle pact infinitive czytać
passive adj. participle ppas infinitive czytać
winien winien singular masculine form powinien, rad
predicative pred the only form of that flexeme warto
preposition prep the non-vocalic form of that flexeme na, przez, w
coordinating conjunction conj the only form of that flexeme oraz
subordinating conjunction comp the only form of that flexeme że
particle-adverb qub the only form of that flexeme nie, -że, się
abbreviation brev the full dictionary form rok, i tak dalej
bound word burk the only form of that flexeme trochu, oścież
interjection interj the only form of that flexeme ech, kurde
punctuation interp the only form of that flexeme ;, ., (, ]
alien xxx the only form of that flexeme cool , nihil

Intended uses & limitations

Even though we have some nice tools for pos-tagging in polish (http://morfeusz.sgjp.pl/), I needed a pos tagger for polish that could be easily loaded inside the browser. Huggingface supports such functionality and that's why I created this model.

Training and evaluation data

Model was trained on a half of test data of the nkjp1m dataset (~0.5 milion tokens).

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 16
  • eval_batch_size: 16
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 3

Training results

Training Loss Epoch Step Validation Loss Precision Recall F1 Accuracy
0.0665 1.0 2155 0.0629 0.9835 0.9836 0.9836 0.9867
0.0369 2.0 4310 0.0539 0.9842 0.9848 0.9845 0.9876
0.0243 3.0 6465 0.0508 0.9853 0.9858 0.9856 0.9884

Framework versions

  • Transformers 4.36.0
  • Pytorch 2.1.0+cu118
  • Datasets 2.15.0
  • Tokenizers 0.15.0
Downloads last month
5
Safetensors
Model size
124M params
Tensor type
F32
·

Finetuned from

Evaluation results