Fine-tuned multilingual model for russian language NER
This is the model card for fine-tuned Babelscape/wikineural-multilingual-ner, which has multilingual mBERT as its base. I`ve fine-tuned it using RCC-MSU/collection3 dataset for token-classification task. The dataset has BIO-pattern and following labels:
label_names = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
Model Details
Fine-tuning was proceeded in 3 epochs, and computed next metrics:
Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy |
---|---|---|---|---|---|---|
1 | 0.041000 | 0.032810 | 0.959569 | 0.974253 | 0.966855 | 0.993325 |
2 | 0.020800 | 0.028395 | 0.959569 | 0.974253 | 0.966855 | 0.993325 |
3 | 0.010500 | 0.029138 | 0.963239 | 0.973767 | 0.968474 | 0.993247 |
To avoid over-fitting due to a small amount of training samples, i used high weight_decay = 0.1.
Basic usage
So, you can easily use this model with pipeline for 'token-classification' task.
import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
from datasets import load_dataset
model_ckpt = "nesemenpolkov/msu-wiki-ner"
label_names = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']
id2label = {i: label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForTokenClassification.from_pretrained(
model_ckpt,
id2label=id2label,
label2id=label2id,
ignore_mismatched_sizes=True
)
pipe = pipeline(
task="token-classification",
model=model,
tokenizer=tokenizer,
device=torch.device("cuda" if torch.cuda.is_available() else "cpu"),
aggregation_strategy="simple"
)
demo_sample = "Этот Иван Иванов, в паспорте Иванов И.И."
with torch.no_grad():
out = pipe(demo_sample)
Bias, Risks, and Limitations
This model is finetuned version of Babelscape/wikineural-multilingual-ner, on a russian language NER dataset RCC-MSU/collection3. It can show low scores on another language texts.
Citation [optional]
@inproceedings{tedeschi-etal-2021-wikineural-combined,
title = "Fine-tuned multilingual model for russian language NER.",
author = "nesemenpolkov",
booktitle = "Detecting names in noisy and dirty data.",
month = oct,
year = "2024",
address = "Moscow, Russian Federation",
}
- Downloads last month
- 28
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Model tree for nesemenpolkov/msu-wiki-ner
Base model
Babelscape/wikineural-multilingual-ner