File size: 2,590 Bytes
e4f585c
3455472
 
 
 
e4f585c
 
3455472
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86cece5
3455472
 
 
 
 
 
 
 
 
 
f2340ea
 
 
3455472
 
 
 
 
5e68ffd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3455472
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
---
language: pl
tags:
  - T5
  - lemmatization
license: apache-2.0
---


# PoLemma Small

PoLemma models are intended for lemmatization of named entities and multi-word expressions in the Polish language.

They were fine-tuned from the allegro/plT5 models, e.g.: [allegro/plt5-small](https://huggingface.co/allegro/plt5-small).

## Usage

Sample usage:

```
from transformers import pipeline

pipe = pipeline(task="text2text-generation", model="amu-cai/polemma-small", tokenizer="amu-cai/polemma-small")
hyp = [res['generated_text'] for res in pipe(["federalnego urzędu statystycznego"], clean_up_tokenization_spaces=True, num_beams=5)][0]
```


## Evaluation results

Lemmatization Exact Match was computed on the SlavNER 2021 test set.

| Model | Exact Match ||
| :------ | ------: | ------: |
| [polemma-large](https://huggingface.co/amu-cai/polemma-large) | 92.61  | 
| [polemma-base](https://huggingface.co/amu-cai/polemma-base) | 91.34  |
| [polemma-small](https://huggingface.co/amu-cai/polemma-small)| 88.46 |

## Citation

If you use the model, please cite the following paper:

```
@inproceedings{palka-nowakowski-2023-exploring,
    title = "Exploring the Use of Foundation Models for Named Entity Recognition and Lemmatization Tasks in {S}lavic Languages",
    author = "Pa{\l}ka, Gabriela  and
      Nowakowski, Artur",
    booktitle = "Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023)",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.bsnlp-1.19",
    pages = "165--171",
    abstract = "This paper describes Adam Mickiewicz University{'}s (AMU) solution for the 4th Shared Task on SlavNER. The task involves the identification, categorization, and lemmatization of named entities in Slavic languages. Our approach involved exploring the use of foundation models for these tasks. In particular, we used models based on the popular BERT and T5 model architectures. Additionally, we used external datasets to further improve the quality of our models. Our solution obtained promising results, achieving high metrics scores in both tasks. We describe our approach and the results of our experiments in detail, showing that the method is effective for NER and lemmatization in Slavic languages. Additionally, our models for lemmatization will be available at: https://huggingface.co/amu-cai.",
}
```

### Framework versions

- Transformers 4.26.0
- Pytorch 1.13.1.post200
- Datasets 2.9.0
- Tokenizers 0.13.2