Instructions to use raufibishov/az-wordpiece-tokenizer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use raufibishov/az-wordpiece-tokenizer with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="raufibishov/az-wordpiece-tokenizer")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("raufibishov/az-wordpiece-tokenizer", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Azerbaijani WordPiece Tokenizer (wp_2_uncased)
This repository contains the uncased Azerbaijani WordPiece tokenizer used for AzNEOBERT pretraining.
Overview
- Tokenizer type: WordPiece
- Vocabulary size: 64,000
- Casing: uncased
- Mean fertility: 1.727
- Backend: Hugging Face
tokenizers+transformers
The tokenizer was selected after evaluating six tokenizer variants on Azerbaijani corpora using fertility as the primary selection criterion.
Training Data
The tokenizer was trained on approximately:
- ~100 GB Azerbaijani text
- 95.7B characters
- 10 corpus collections
The corpus includes diverse web, news, encyclopedic, and general-domain Azerbaijani text.
Tokenization Efficiency
| Tokenizer | Mean Fertility |
|---|---|
| mBERT | 2.846 |
| XLM-R | 2.167 |
| HPLT az-BERT | 2.068 |
| wp_2_uncased | 1.727 |
Lower fertility indicates more efficient tokenization.
Evaluation was performed on 47,934 Azerbaijani documents.
Tokenizer Variants Evaluated
Six tokenizer variants were compared:
| Family | Variants |
|---|---|
| WordPiece | cased / uncased |
| SentencePiece Unigram | cased / uncased |
| SentencePiece BPE | cased / uncased |
The final selected tokenizer was the uncased WordPiece variant (wp_2_uncased).
Usage
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained(
"raufibishov/az-wordpiece-tokenizer"
)
text = "müqavilələrindən"
print(tok.tokenize(text))
Example output:
['müqavilələrin', '##dən']
Special Tokens
| Token | Purpose |
|---|---|
[UNK] |
Unknown token |
[CLS] |
Classification token |
[SEP] |
Separator token |
[PAD] |
Padding token |
[MASK] |
Masked language modeling token |
Compatibility
This tokenizer supports:
AutoTokenizerPreTrainedTokenizerFastBertTokenizerFast
Citation
If you use this tokenizer in research, please cite the associated AzNEOBERT work.
License
Apache-2.0