--- library_name: transformers tags: - ner - msu - wiki - fine-tuned datasets: - RCC-MSU/collection3 language: - ru metrics: - precision - recall - f1 base_model: - Babelscape/wikineural-multilingual-ner pipeline_tag: token-classification --- # Fine-tuned multilingual model for russian language NER This is the model card for fine-tuned [Babelscape/wikineural-multilingual-ner](https://huggingface.co/Babelscape/wikineural-multilingual-ner), which has multilingual mBERT as its base. I`ve fine-tuned it using [RCC-MSU/collection3](https://huggingface.co/datasets/RCC-MSU/collection3) dataset for token-classification task. The dataset has BIO-pattern and following labels: ```python label_names = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'] ``` ## Model Details Fine-tuning was proceeded in 3 epochs, and computed next metrics: | Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy | | ----- | ------------- | --------------- | --------- | ------ | -- | -------- | | 1 | 0.041000 | 0.032810 | 0.959569 | 0.974253 | 0.966855 | 0.993325 | | 2 | 0.020800 | 0.028395 | 0.959569 | 0.974253 | 0.966855 | 0.993325 | | 3 | 0.010500 | 0.029138 | 0.963239 | 0.973767 | 0.968474 | 0.993247 | To avoid over-fitting due to a small amount of training samples, i used high weight_decay = 0.1. ## Basic usage So, you can easily use this model with pipeline for 'token-classification' task. ```python import torch from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline from datasets import load_dataset model_ckpt = "nesemenpolkov/msu-wiki-ner" label_names = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'] id2label = {i: label for i, label in enumerate(label_names)} label2id = {v: k for k, v in id2label.items()} tokenizer = AutoTokenizer.from_pretrained(model_ckpt) model = AutoModelForTokenClassification.from_pretrained( model_ckpt, id2label=id2label, label2id=label2id, ignore_mismatched_sizes=True ) pipe = pipeline( task="token-classification", model=model, tokenizer=tokenizer, device=torch.device("cuda" if torch.cuda.is_available() else "cpu"), aggregation_strategy="simple" ) demo_sample = "Этот Иван Иванов, в паспорте Иванов И.И." with torch.no_grad(): out = pipe(demo_sample) ``` ## Bias, Risks, and Limitations This model is finetuned version of [Babelscape/wikineural-multilingual-ner](https://huggingface.co/Babelscape/wikineural-multilingual-ner), on a russian language NER dataset [RCC-MSU/collection3](https://huggingface.co/datasets/RCC-MSU/collection3). It can show low scores on another language texts. ## Citation [optional] ``` @inproceedings{tedeschi-etal-2021-wikineural-combined, title = "Fine-tuned multilingual model for russian language NER.", author = "nesemenpolkov", booktitle = "Detecting names in noisy and dirty data.", month = oct, year = "2024", address = "Moscow, Russian Federation", } ```