You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Model Card for Model ID

THE MODEL CARD IS CURRENTLY IN PROGRESS!

This is a finetuned mBERT for the task of token-level language identification. The model was trained on the dataset liminovna/KazRusCSW-G-T, that encompasses texts (mainly comments) from social media, Telegram and Youtube specifically. The data in the dataset has been annotated manually.

The model predicts the following tags:

kz -- Kazakh word
ru -- Russian word
skz -- Kazakh word transliterated into Cyryllic script without specific Kazakh characters
mixed_kz-ru, mixed_ru-kz -- words of hybrid origin, i.e. Kazakh root + Russian inflection, or vice versa
ambig -- word that exists in both Kazakh and Russian
other -- words from another language
univ -- punctuation and masks ([EMOJI], [HASHTAG], [LINK], [NUMBER])

Model Details

Model Description

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

Developed by: [More Information Needed]
Model type: [More Information Needed]
Language(s) (NLP): [More Information Needed]
License: [More Information Needed]
Finetuned from model [optional]: [More Information Needed]

Model Sources [optional]

Repository: [More Information Needed]
Paper [optional]: [More Information Needed]
Demo [optional]: [More Information Needed]

Uses

This model's initial purpose was to annotate the Kazakh-Russian code-switching corpus (yet to be published). The broader aim of this model is to annotate tokens in short texts such as comments or posts in the Kazakh-Russian bilingual social media segment.

Bias, Risks, and Limitations

This is a study project, so the quality of the model is not perfect

Recommendations

The model has only gone through limited testing.

How to Get Started with the Model

# log into the huggingface hub
from huggingface_hub import notebook_login

notebook_login()

# all the necessary imports
import torch
from transformers import AutoModelForTokenClassification
from transformers import AutoTokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# loading the model
finetuned_model_path = 'liminovna/KazRusCSW_mbert'

# tokenizer has been supplemented with a few special tokens, such as `['[MENTION]', '[NUMBER]', '[HASHTAG]', '[EMOJI]', '[LINK]', '[EMAIL]', '\\n']`
tokenizer = AutoTokenizer.from_pretrained(finetuned_model_path)

model = AutoModelForTokenClassification.from_pretrained(finetuned_model_path).to(device)

examples = ['Сен вообще көрдіңба? Қандай ақпарат тарағанын?', 'Екеуі де күшті ғойй,особенно Тілеген прям унайды ше сөйлегені']

import torch.nn.functional as F
inputs = tokenizer(examples, return_tensors="pt", truncation=True, padding=True).to(device)
with torch.no_grad():
    res = model(**inputs)

input_ids = inputs['input_ids'] # input_ids
probas, label_ids = torch.max(F.softmax(res.logits, dim=-1), dim=-1) # max class probability for each token (first example only)

# printing the results
for i in range(len(input_ids)): # for each example
    example_tokens = tokenizer.convert_ids_to_tokens(input_ids[i]) # tokens
    example_probas = probas[i].tolist() # probabilities
    example_labels = list(map(model.config.id2label.get, label_ids[i].tolist())) # converting tag ids oonto tag names

    res_words = []
    res_labels = []
    res_probas = []

    for t, l, p in zip(example_tokens, example_probas, example_labels):
        if t.startswith('##'):
            res_words[-1] = res_words[-1] + t[2:]
        elif t not in ['[SEP]', '[PAD]', '[CLS]']: # ignore certani tokens
            res_words.append(t)
            res_labels.append(l)
            res_probas.append(p)
        else:
            pass

    print(f'Example {i}:')
    print('Tokens:', example_tokens) 
    print('Tagged words:', list(zip(res_words, res_labels, res_probas)))
    print('='*80)

Training Details

Notebook with the training workflow can be found here: https://colab.research.google.com/drive/1epAx_jsuEwBrdCQHRIaoOQildGdMvM0q?usp=sharing

Training Data

Link to the training and test datasets: liminovna/KazRusCSW-G-T

Preprocessing [optional]

emoji, phone and card numbers, links and hashtags have been masked;
newlines have been replaced with '\n';
sequences of whitespaces have been replaced with a single space ' ';
model was trained on already tokenized data (see the notebook linked above)

Metrics

              precision    recall  f1-score   support

       ambig     0.5664    0.7375    0.6407       480
          kz     0.9707    0.9559    0.9633      5448
 mixed_kz-ru     0.4615    0.1579    0.2353        38
 mixed_ru-kz     0.0000    0.0000    0.0000        15
       other     0.8992    0.7985    0.8458       268
          ru     0.9644    0.9819    0.9731      5904
         skz     0.6937    0.5133    0.5900       150
        univ     0.9990    0.9803    0.9896      3200

    accuracy                         0.9542     15503
   macro avg     0.6944    0.6407    0.6547     15503
weighted avg     0.9555    0.9542    0.9541     15503

Citation [optional]

The link to the master's thesis will be linked in the future

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Model Card Contact

if you have any questions feel free to start a discussion in the community section

Downloads last month: 48

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for liminovna/KazRusCSW_mbert

Base model

google-bert/bert-base-cased

Finetuned

(2906)

this model

Dataset used to train liminovna/KazRusCSW_mbert

Collection including liminovna/KazRusCSW_mbert

KazRusCSW

Collection

Resources for Kazakh-Russian code-switching corpus • 2 items • Updated 1 day ago