Azerbaijani WordPiece Tokenizer (wp_2_uncased)

This repository contains the uncased Azerbaijani WordPiece tokenizer used for AzNEOBERT pretraining.

Overview

  • Tokenizer type: WordPiece
  • Vocabulary size: 64,000
  • Casing: uncased
  • Mean fertility: 1.727
  • Backend: Hugging Face tokenizers + transformers

The tokenizer was selected after evaluating six tokenizer variants on Azerbaijani corpora using fertility as the primary selection criterion.

Training Data

The tokenizer was trained on approximately:

  • ~100 GB Azerbaijani text
  • 95.7B characters
  • 10 corpus collections

The corpus includes diverse web, news, encyclopedic, and general-domain Azerbaijani text.

Tokenization Efficiency

Tokenizer Mean Fertility
mBERT 2.846
XLM-R 2.167
HPLT az-BERT 2.068
wp_2_uncased 1.727

Lower fertility indicates more efficient tokenization.

Evaluation was performed on 47,934 Azerbaijani documents.

Tokenizer Variants Evaluated

Six tokenizer variants were compared:

Family Variants
WordPiece cased / uncased
SentencePiece Unigram cased / uncased
SentencePiece BPE cased / uncased

The final selected tokenizer was the uncased WordPiece variant (wp_2_uncased).

Usage

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained(
    "raufibishov/az-wordpiece-tokenizer"
)

text = "müqavilələrindən"

print(tok.tokenize(text))

Example output:

['müqavilələrin', '##dən']

Special Tokens

Token Purpose
[UNK] Unknown token
[CLS] Classification token
[SEP] Separator token
[PAD] Padding token
[MASK] Masked language modeling token

Compatibility

This tokenizer supports:

  • AutoTokenizer
  • PreTrainedTokenizerFast
  • BertTokenizerFast

Citation

If you use this tokenizer in research, please cite the associated AzNEOBERT work.

License

Apache-2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support