You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Model Card for Model ID

THE MODEL CARD IS CURRENTLY IN PROGRESS!

This is a finetuned mBERT for the task of token-level language identification. The model was trained on the dataset liminovna/KazRusCSW-G-T, that encompasses texts (mainly comments) from social media, Telegram and Youtube specifically. The data in the dataset has been annotated manually.

The model predicts the following tags:

  • kz -- Kazakh word
  • ru -- Russian word
  • skz -- Kazakh word transliterated into Cyryllic script without specific Kazakh characters
  • mixed_kz-ru, mixed_ru-kz -- words of hybrid origin, i.e. Kazakh root + Russian inflection, or vice versa
  • ambig -- word that exists in both Kazakh and Russian
  • other -- words from another language
  • univ -- punctuation and masks ([EMOJI], [HASHTAG], [LINK], [NUMBER])

Model Details

Model Description

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

  • Developed by: [More Information Needed]
  • Model type: [More Information Needed]
  • Language(s) (NLP): [More Information Needed]
  • License: [More Information Needed]
  • Finetuned from model [optional]: [More Information Needed]

Model Sources [optional]

  • Repository: [More Information Needed]
  • Paper [optional]: [More Information Needed]
  • Demo [optional]: [More Information Needed]

Uses

This model's initial purpose was to annotate the Kazakh-Russian code-switching corpus (yet to be published). The broader aim of this model is to annotate tokens in short texts such as comments or posts in the Kazakh-Russian bilingual social media segment.

Bias, Risks, and Limitations

This is a study project, so the quality of the model is not perfect

Recommendations

The model has only gone through limited testing.

How to Get Started with the Model

# log into the huggingface hub
from huggingface_hub import notebook_login

notebook_login()

# all the necessary imports
import torch
from transformers import AutoModelForTokenClassification
from transformers import AutoTokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# loading the model
finetuned_model_path = 'liminovna/KazRusCSW_mbert'

# tokenizer has been supplemented with a few special tokens, such as `['[MENTION]', '[NUMBER]', '[HASHTAG]', '[EMOJI]', '[LINK]', '[EMAIL]', '\\n']`
tokenizer = AutoTokenizer.from_pretrained(finetuned_model_path)

model = AutoModelForTokenClassification.from_pretrained(finetuned_model_path).to(device)

examples = ['Сен вообще көрдіңба? Қандай ақпарат тарағанын?', 'Екеуі де күшті ғойй,особенно Тілеген прям унайды ше сөйлегені']

import torch.nn.functional as F
inputs = tokenizer(examples, return_tensors="pt", truncation=True, padding=True).to(device)
with torch.no_grad():
    res = model(**inputs)

input_ids = inputs['input_ids'] # input_ids
probas, label_ids = torch.max(F.softmax(res.logits, dim=-1), dim=-1) # max class probability for each token (first example only)

# printing the results
for i in range(len(input_ids)): # for each example
    example_tokens = tokenizer.convert_ids_to_tokens(input_ids[i]) # tokens
    example_probas = probas[i].tolist() # probabilities
    example_labels = list(map(model.config.id2label.get, label_ids[i].tolist())) # converting tag ids oonto tag names

    res_words = []
    res_labels = []
    res_probas = []

    for t, l, p in zip(example_tokens, example_probas, example_labels):
        if t.startswith('##'):
            res_words[-1] = res_words[-1] + t[2:]
        elif t not in ['[SEP]', '[PAD]', '[CLS]']: # ignore certani tokens
            res_words.append(t)
            res_labels.append(l)
            res_probas.append(p)
        else:
            pass

    print(f'Example {i}:')
    print('Tokens:', example_tokens) 
    print('Tagged words:', list(zip(res_words, res_labels, res_probas)))
    print('='*80)

Training Details

Notebook with the training workflow can be found here: https://colab.research.google.com/drive/1epAx_jsuEwBrdCQHRIaoOQildGdMvM0q?usp=sharing

Training Data

Link to the training and test datasets: liminovna/KazRusCSW-G-T

Preprocessing [optional]

  • emoji, phone and card numbers, links and hashtags have been masked;
  • newlines have been replaced with '\n';
  • sequences of whitespaces have been replaced with a single space ' ';
  • model was trained on already tokenized data (see the notebook linked above)

Metrics

              precision    recall  f1-score   support

       ambig     0.5664    0.7375    0.6407       480
          kz     0.9707    0.9559    0.9633      5448
 mixed_kz-ru     0.4615    0.1579    0.2353        38
 mixed_ru-kz     0.0000    0.0000    0.0000        15
       other     0.8992    0.7985    0.8458       268
          ru     0.9644    0.9819    0.9731      5904
         skz     0.6937    0.5133    0.5900       150
        univ     0.9990    0.9803    0.9896      3200

    accuracy                         0.9542     15503
   macro avg     0.6944    0.6407    0.6547     15503
weighted avg     0.9555    0.9542    0.9541     15503

Citation [optional]

The link to the master's thesis will be linked in the future

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Model Card Contact

if you have any questions feel free to start a discussion in the community section

Downloads last month
48
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for liminovna/KazRusCSW_mbert

Finetuned
(2906)
this model

Dataset used to train liminovna/KazRusCSW_mbert

Collection including liminovna/KazRusCSW_mbert