slavlemma-large / README.md
anowakowski's picture
Upload slavlemma-large
cf824dc
|
raw
history blame
1.87 kB
metadata
language:
  - pl
  - cs
  - ru
tags:
  - mT5
  - lemmatization
license: apache-2.0

SlavLemma Large

SlavLemma models are intended for lemmatization of named entities and multi-word expressions in Polish, Czech and Russian languages.

They were fine-tuned from the google/mT5 models, e.g.: google/mt5-large.

Usage

When using the model, prepend one of the language tokens (>>pl<<, >>cs<<, >>ru<<) to the input, based on the language of the phrase you want to lemmatize.

Sample usage:

from transformers import pipeline

pipe = pipeline(task="text2text-generation", model="amu-cai/slavlemma-large", tokenizer="amu-cai/slavlemma-large")
hyp = [res['generated_text'] for res in pipe([">>pl<< federalnego urzędu statystycznego"], clean_up_tokenization_spaces=True, num_beams=5)][0]

Evaluation results

Lemmatization Exact Match was computed on the SlavNER 2021 test sets (COVID-19 and USA 2020 Elections).

COVID-19:

Model pl cs ru
slavlemma-large 93.76 89.80 77.30
slavlemma-base 91.00 86.29 76.10
slavlemma-small 86.80 80.98 73.83

USA 2020 Elections:

Model pl cs ru
slavlemma-large 89.12 87.27 82.50
slavlemma-base 84.19 81.97 80.27
slavlemma-small 78.85 75.86 76.18

Citation

If you use the model, please cite the following paper:

TBD

Framework versions

  • Transformers 4.26.0
  • Pytorch 1.13.1.post200
  • Datasets 2.9.0
  • Tokenizers 0.13.2