facebook/xnli
Viewer • Updated • 6.4M • 21.8k • 72
A Byte-Pair Encoding (BPE) tokenizer trained from scratch on the Turkish (tr) subset
of the facebook/xnli dataset.
| Parameter | Value |
|---|---|
| Algorithm | BPE (ByteLevel) |
| Vocabulary size | 8,000 |
| Min frequency | 2 |
| Special tokens | <s>, <pad>, </s>, <unk>, <mask> |
| Corpus | facebook/xnli / tr — all splits |
| Corpus size | 800,404 sentences |
| Normalizer | NFC Unicode |
| Pre-tokenizer | ByteLevel |
| Metric | Value |
|---|---|
| Tokens / char | 0.2581 |
| Fertility (tokens / word) | 1.9330 |
| Avg sequence length | 20.15 tokens |
| Vocabulary coverage | 1.0000 |
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("NIRVLab/xnli-bpe-tr")
tokens = tokenizer("Merhaba dünya!", return_tensors="pt")