polish-roberta-base-v2-pos-tagging

This model is a fine-tuned version of sdadas/polish-roberta-base-v2 on the nkjp1m dataset. It achieves the following results on the evaluation set:

Loss: 0.0508
Precision: 0.9853
Recall: 0.9858
F1: 0.9856
Accuracy: 0.9884

You can find the training notebook here: https://github.com/WikKam/roberta-pos-finetuning

Usage

from transformers import pipeline

nlp = pipeline("token-classification", "wkaminski/polish-roberta-base-v2-pos-tagging")

nlp("Ale dzisiaj leje")

Model description

This model is a part-of-speech tagger for the Polish language based on sdadas/polish-roberta-base-v2.

It support 40 classes representing flexemic class (detailed part of speech):

{
 0: 'adj',
 1: 'adja',
 2: 'adjc',
 3: 'adjp',
 4: 'adv',
 5: 'aglt',
 6: 'bedzie',
 7: 'brev',
 8: 'comp',
 9: 'conj',
 10: 'depr',
 11: 'dig',
 12: 'fin',
 13: 'frag',
 14: 'ger',
 15: 'imps',
 16: 'impt',
 17: 'inf',
 18: 'interj',
 19: 'interp',
 20: 'num',
 21: 'numcomp',
 22: 'pact',
 23: 'pacta',
 24: 'pant',
 25: 'part',
 26: 'pcon',
 27: 'ppas',
 28: 'ppron12',
 29: 'ppron3',
 30: 'praet',
 31: 'pred',
 32: 'prep',
 33: 'romandig',
 34: 'siebie',
 35: 'subst',
 36: 'sym',
 37: 'winien',
 38: 'xxs',
 39: 'xxx'
}

Tags meaning is the same as in nkjp1m dataset:

flexeme	abbreviation	base form	example
noun	subst	singular nominative	profesor
depreciative form	depr	singular nominative form of the corresponding noun	profesor
main numeral	num	inanimate masculine nominative form	pięć, dwa
collective numeral	numcol	inanimate masculine nominative form of the main numeral	pięć, dwa
adjective	adj	singular nominative masculine positive form	polski
ad-adjectival adjective	adja	singular nominative masculine positive form of the adjective	polski
post-prepositional adjective	adjp	singular nominative masculine positive form of the adjective	polski
predicative adjective	adjc	singular nominative masculine positive form of the adjective	zdrowy, ciekawy
adverb	adv	positive form	dobrze, bardzo
non-3rd person pronoun	ppron12	singular nominative	ja
3rd-person pronoun	ppron3	singular nominative	on
pronoun siebie	siebie	accusative	siebie
non-past form	fin	infinitive	czytać
future być	bedzie	infinitive	być
agglutinate być	aglt	infinitive	być
l-participle	praet	infinitive	czytać
imperative	impt	infinitive	czytać
impersonal	imps	infinitive	czytać
infinitive	inf	infinitive	czytać
contemporary adv. participle	pcon	infinitive	czytać
anterior adv. participle	pant	infinitive	czytać
gerund	ger	infinitive	czytać
active adj. participle	pact	infinitive	czytać
passive adj. participle	ppas	infinitive	czytać
winien	winien	singular masculine form	powinien, rad
predicative	pred	the only form of that flexeme	warto
preposition	prep	the non-vocalic form of that flexeme	na, przez, w
coordinating conjunction	conj	the only form of that flexeme	oraz
subordinating conjunction	comp	the only form of that flexeme	że
particle-adverb	qub	the only form of that flexeme	nie, -że, się
abbreviation	brev	the full dictionary form	rok, i tak dalej
bound word	burk	the only form of that flexeme	trochu, oścież
interjection	interj	the only form of that flexeme	ech, kurde
punctuation	interp	the only form of that flexeme	;, ., (, ]
alien	xxx	the only form of that flexeme	cool , nihil

Intended uses & limitations

Even though we have some nice tools for pos-tagging in polish (http://morfeusz.sgjp.pl/), I needed a pos tagger for polish that could be easily loaded inside the browser. Huggingface supports such functionality and that's why I created this model.

Training and evaluation data

Model was trained on a half of test data of the nkjp1m dataset (~0.5 milion tokens).

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-05
train_batch_size: 16
eval_batch_size: 16
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 3

Training results

Training Loss	Epoch	Step	Validation Loss	Precision	Recall	F1	Accuracy
0.0665	1.0	2155	0.0629	0.9835	0.9836	0.9836	0.9867
0.0369	2.0	4310	0.0539	0.9842	0.9848	0.9845	0.9876
0.0243	3.0	6465	0.0508	0.9853	0.9858	0.9856	0.9884

Framework versions

Transformers 4.36.0
Pytorch 2.1.0+cu118
Datasets 2.15.0
Tokenizers 0.15.0

wkaminski
/

polish-roberta-base-v2-pos-tagging