Instructions to use liminovna/KazRusCSW_mbert with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use liminovna/KazRusCSW_mbert with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="liminovna/KazRusCSW_mbert")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("liminovna/KazRusCSW_mbert") model = AutoModelForSequenceClassification.from_pretrained("liminovna/KazRusCSW_mbert") - Notebooks
- Google Colab
- Kaggle
Model Card for Model ID
THE MODEL CARD IS CURRENTLY IN PROGRESS!
This is a finetuned mBERT for the task of token-level language identification. The model was trained on the dataset liminovna/KazRusCSW-G-T, that encompasses texts (mainly comments) from social media, Telegram and Youtube specifically. The data in the dataset has been annotated manually.
The model predicts the following tags:
- kz -- Kazakh word
- ru -- Russian word
- skz -- Kazakh word transliterated into Cyryllic script without specific Kazakh characters
- mixed_kz-ru, mixed_ru-kz -- words of hybrid origin, i.e. Kazakh root + Russian inflection, or vice versa
- ambig -- word that exists in both Kazakh and Russian
- other -- words from another language
- univ -- punctuation and masks ([EMOJI], [HASHTAG], [LINK], [NUMBER])
Model Details
Model Description
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
- Developed by: [More Information Needed]
- Model type: [More Information Needed]
- Language(s) (NLP): [More Information Needed]
- License: [More Information Needed]
- Finetuned from model [optional]: [More Information Needed]
Model Sources [optional]
- Repository: [More Information Needed]
- Paper [optional]: [More Information Needed]
- Demo [optional]: [More Information Needed]
Uses
This model's initial purpose was to annotate the Kazakh-Russian code-switching corpus (yet to be published). The broader aim of this model is to annotate tokens in short texts such as comments or posts in the Kazakh-Russian bilingual social media segment.
Bias, Risks, and Limitations
This is a study project, so the quality of the model is not perfect
Recommendations
The model has only gone through limited testing.
How to Get Started with the Model
# log into the huggingface hub
from huggingface_hub import notebook_login
notebook_login()
# all the necessary imports
import torch
from transformers import AutoModelForTokenClassification
from transformers import AutoTokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# loading the model
finetuned_model_path = 'liminovna/KazRusCSW_mbert'
# tokenizer has been supplemented with a few special tokens, such as `['[MENTION]', '[NUMBER]', '[HASHTAG]', '[EMOJI]', '[LINK]', '[EMAIL]', '\\n']`
tokenizer = AutoTokenizer.from_pretrained(finetuned_model_path)
model = AutoModelForTokenClassification.from_pretrained(finetuned_model_path).to(device)
examples = ['Сен вообще көрдіңба? Қандай ақпарат тарағанын?', 'Екеуі де күшті ғойй,особенно Тілеген прям унайды ше сөйлегені']
import torch.nn.functional as F
inputs = tokenizer(examples, return_tensors="pt", truncation=True, padding=True).to(device)
with torch.no_grad():
res = model(**inputs)
input_ids = inputs['input_ids'] # input_ids
probas, label_ids = torch.max(F.softmax(res.logits, dim=-1), dim=-1) # max class probability for each token (first example only)
# printing the results
for i in range(len(input_ids)): # for each example
example_tokens = tokenizer.convert_ids_to_tokens(input_ids[i]) # tokens
example_probas = probas[i].tolist() # probabilities
example_labels = list(map(model.config.id2label.get, label_ids[i].tolist())) # converting tag ids oonto tag names
res_words = []
res_labels = []
res_probas = []
for t, l, p in zip(example_tokens, example_probas, example_labels):
if t.startswith('##'):
res_words[-1] = res_words[-1] + t[2:]
elif t not in ['[SEP]', '[PAD]', '[CLS]']: # ignore certani tokens
res_words.append(t)
res_labels.append(l)
res_probas.append(p)
else:
pass
print(f'Example {i}:')
print('Tokens:', example_tokens)
print('Tagged words:', list(zip(res_words, res_labels, res_probas)))
print('='*80)
Training Details
Notebook with the training workflow can be found here: https://colab.research.google.com/drive/1epAx_jsuEwBrdCQHRIaoOQildGdMvM0q?usp=sharing
Training Data
Link to the training and test datasets: liminovna/KazRusCSW-G-T
Preprocessing [optional]
- emoji, phone and card numbers, links and hashtags have been masked;
- newlines have been replaced with '\n';
- sequences of whitespaces have been replaced with a single space ' ';
- model was trained on already tokenized data (see the notebook linked above)
Metrics
precision recall f1-score support
ambig 0.5664 0.7375 0.6407 480
kz 0.9707 0.9559 0.9633 5448
mixed_kz-ru 0.4615 0.1579 0.2353 38
mixed_ru-kz 0.0000 0.0000 0.0000 15
other 0.8992 0.7985 0.8458 268
ru 0.9644 0.9819 0.9731 5904
skz 0.6937 0.5133 0.5900 150
univ 0.9990 0.9803 0.9896 3200
accuracy 0.9542 15503
macro avg 0.6944 0.6407 0.6547 15503
weighted avg 0.9555 0.9542 0.9541 15503
Citation [optional]
The link to the master's thesis will be linked in the future
BibTeX:
[More Information Needed]
APA:
[More Information Needed]
Model Card Contact
if you have any questions feel free to start a discussion in the community section
- Downloads last month
- 48
Model tree for liminovna/KazRusCSW_mbert
Base model
google-bert/bert-base-cased