|
--- |
|
language: |
|
- multilingual |
|
- pl |
|
- ru |
|
- uk |
|
- bg |
|
- cs |
|
- sl |
|
datasets: |
|
- SlavicNER |
|
license: apache-2.0 |
|
library_name: transformers |
|
pipeline_tag: text2text-generation |
|
tags: |
|
- lemmatization |
|
widget: |
|
- text: "pl:Polsce" |
|
example_title: "Polish" |
|
- text: "cs:Velké Británii" |
|
example_title: "Czech" |
|
- text: "bg:българите" |
|
example_title: "Bulgarian" |
|
- text: "ru:Великобританию" |
|
example_title: "Russian" |
|
- text: "sl:evropske komisije" |
|
example_title: "Slovene" |
|
- text: "uk:Європейського агентства лікарських засобів" |
|
example_title: "Ukrainian" |
|
--- |
|
|
|
# Model description |
|
|
|
This is a baseline model for named entity **lemmatization** trained on the single-out topic split of the |
|
[SlavicNER corpus](https://github.com/SlavicNLP/SlavicNER). |
|
|
|
|
|
# Resources and Technical Documentation |
|
|
|
- Paper: [Cross-lingual Named Entity Corpus for Slavic Languages](https://arxiv.org/pdf/2404.00482), to appear in LREC-COLING 2024. |
|
- Annotation guidelines: https://arxiv.org/pdf/2404.00482 |
|
- SlavicNER Corpus: https://github.com/SlavicNLP/SlavicNER |
|
|
|
|
|
# Evaluation |
|
|
|
*Will appear soon* |
|
|
|
|
|
# Usage |
|
|
|
You can use this model directly with a pipeline for text2text generation: |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
model_name = "SlavicNLP/slavicner-lemma-single-out-large" |
|
pipe = pipeline("text2text-generation", model_name) |
|
|
|
texts = ["pl:Polsce", "cs:Velké Británii", "bg:българите", "ru:Великобританию", "sl:evropske komisije", |
|
"uk:Європейського агентства лікарських засобів"] |
|
|
|
outputs = pipe(texts) |
|
|
|
lemmas = [o['generated_text'] for o in outputs] |
|
print(lemmas) |
|
# ['Polska', 'Velká Británie', 'българи', 'Великобритания', 'evropska komisija', 'Європейське агентство лікарських засобів'] |
|
``` |
|
|
|
# Citation |
|
|
|
```latex |
|
@inproceedings{piskorski-etal-2024-cross-lingual, |
|
title = "Cross-lingual Named Entity Corpus for {S}lavic Languages", |
|
author = "Piskorski, Jakub and |
|
Marci{\'n}czuk, Micha{\l} and |
|
Yangarber, Roman", |
|
editor = "Calzolari, Nicoletta and |
|
Kan, Min-Yen and |
|
Hoste, Veronique and |
|
Lenci, Alessandro and |
|
Sakti, Sakriani and |
|
Xue, Nianwen", |
|
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", |
|
month = may, |
|
year = "2024", |
|
address = "Torino, Italy", |
|
publisher = "ELRA and ICCL", |
|
url = "https://aclanthology.org/2024.lrec-main.369", |
|
pages = "4143--4157", |
|
abstract = "This paper presents a corpus manually annotated with named entities for six Slavic languages {---} Bulgarian, Czech, Polish, Slovenian, Russian, |
|
and Ukrainian. This work is the result of a series of shared tasks, conducted in 2017{--}2023 as a part of the Workshops on Slavic Natural |
|
Language Processing. The corpus consists of 5,017 documents on seven topics. The documents are annotated with five classes of named entities. |
|
Each entity is described by a category, a lemma, and a unique cross-lingual identifier. We provide two train-tune dataset splits |
|
{---} single topic out and cross topics. For each split, we set benchmarks using a transformer-based neural network architecture |
|
with the pre-trained multilingual models {---} XLM-RoBERTa-large for named entity mention recognition and categorization, |
|
and mT5-large for named entity lemmatization and linking.", |
|
} |
|
``` |
|
|
|
|