You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

ViLLM-Tok: Vietnamese-English Code-Switching Tokenizer

A hybrid word/subword tokenizer designed for Vietnamese-English code-switching, used in the viLLM project.

Property	Value
Vocab size	113,011
Types	Word-level (VI syllables) + BPE (EN) + SentencePiece (EN subword)
Languages	Vietnamese, English, code-switching
Special tokens	`[UNK]`, `[BOS]`, `[EOS]`, `[PAD]`, `[SEP]`, `[MASK]`
Code-switch markers	`[VI→EN]`, `[EN→VI]` (optional)

How it works

Tokenization has four phases:

Language detection — per-word Vi/EN/Num/Code/Punct classification
Vietnamese Viterbi — dynamic programming over syllable runs to merge frequent bigrams into compound tokens (học_sinh, Việt_Nam)
English SentencePiece — subword tokenization for English words not in the direct vocab
Code-switch markers — optional [VI→EN] / [EN→VI] insertion at language boundaries

Usage

Slow path (pure Python, no extra packages)

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained(
    "vlinhd11/villm-tokenizer",
    trust_remote_code=True,
)
enc = tok("Học sinh giỏi tiếng Việt và học lập trình Python")
# input_ids: [88220, 82593, 83224, 83597, 94470, 84066, 94455, ...]
# tokens: ['Học_sinh', 'giỏi', 'tiếng_Việt', 'và_học', 'lập_trình', '▁Python']

Fast path (Rust backend, ~4x faster)

Install the Rust-backed fast tokenizer:

pip install villm-tok-fast

Then use PreTrainedTokenizerFast:

from villm_tok_fast import create_fast_tokenizer

hf = create_fast_tokenizer(
    "vlinhd11/villm-tokenizer",  # or local path
    add_code_switch_markers=True,
)
enc = hf("Học sinh giỏi tiếng Việt và học lập trình Python")

The fast path:

Embeds SentencePiece via a trie for English subword (no Python sentencepiece dependency)
Runs the full pipeline in Rust: language detection → Viterbi VI compounds → SP subword → byte fallback
Supports batch encoding, padding, truncation through HF's PreTrainedTokenizerFast interface

Batch encoding

# Slow path
batch = tok(["Câu thứ nhất", "Câu thứ hai"], padding=True, truncation=True)

# Fast path
batch = hf(["Câu thứ nhất", "Câu thứ hai"], padding=True, truncation=True)

Disable code-switch markers

# Slow
tok = AutoTokenizer.from_pretrained("vlinhd11/villm-tokenizer", trust_remote_code=True)
tok.add_code_switch_markers = False

# Fast
hf = create_fast_tokenizer("vlinhd11/villm-tokenizer", add_code_switch_markers=False)

Tokenization examples

Input	Tokens
`Học sinh`	`['Học_sinh']`
`xe máy Việt Nam`	`['xe_máy', 'Việt_Nam']`
`Python programming`	`['▁Python', '▁programming']`
`Học sinh giỏi tiếng Việt...`	`['Học_sinh', 'giỏi', 'tiếng_Việt', 'và_học', 'lập_trình', '[VI→EN]', '▁Python']`
`Thủ tướng Phạm Minh Chính`	`['Thủ_tướng', 'Phạm', 'Minh', 'Chính']`

Performance

Backend	Texts/sec	vs Python
Pure Python (slow)	~14k	1x
Rust batch (fast)	~55k	~3.9x

Notes

save_pretrained() works for the slow path; the fast path uses PreTrainedTokenizerFast(tokenizer_object=...) which cannot be serialized to tokenizer.json without losing custom logic. To save/load, re-create the fast tokenizer from the same base directory.
The fast backend requires the villm-tok-fast Python package (Rust native extension, Windows x64).

Citation

@software{villm_tokenizer,
  author = {vlinhd11},
  title = {ViLLM-Tok: Vietnamese-English Code-Switching Tokenizer},
  year = {2025},
  url = {https://huggingface.co/vlinhd11/villm-tokenizer}
}

Downloads last month: -; Downloads are not tracked for this model. How to track