You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

ViLLM-Tok: Vietnamese-English Code-Switching Tokenizer

A hybrid word/subword tokenizer designed for Vietnamese-English code-switching, used in the viLLM project.

Property Value
Vocab size 113,011
Types Word-level (VI syllables) + BPE (EN) + SentencePiece (EN subword)
Languages Vietnamese, English, code-switching
Special tokens [UNK], [BOS], [EOS], [PAD], [SEP], [MASK]
Code-switch markers [VI→EN], [EN→VI] (optional)

How it works

Tokenization has four phases:

  1. Language detection — per-word Vi/EN/Num/Code/Punct classification
  2. Vietnamese Viterbi — dynamic programming over syllable runs to merge frequent bigrams into compound tokens (học_sinh, Việt_Nam)
  3. English SentencePiece — subword tokenization for English words not in the direct vocab
  4. Code-switch markers — optional [VI→EN] / [EN→VI] insertion at language boundaries

Usage

Slow path (pure Python, no extra packages)

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained(
    "vlinhd11/villm-tokenizer",
    trust_remote_code=True,
)
enc = tok("Học sinh giỏi tiếng Việt và học lập trình Python")
# input_ids: [88220, 82593, 83224, 83597, 94470, 84066, 94455, ...]
# tokens: ['Học_sinh', 'giỏi', 'tiếng_Việt', 'và_học', 'lập_trình', '▁Python']

Fast path (Rust backend, ~4x faster)

Install the Rust-backed fast tokenizer:

pip install villm-tok-fast

Then use PreTrainedTokenizerFast:

from villm_tok_fast import create_fast_tokenizer

hf = create_fast_tokenizer(
    "vlinhd11/villm-tokenizer",  # or local path
    add_code_switch_markers=True,
)
enc = hf("Học sinh giỏi tiếng Việt và học lập trình Python")

The fast path:

  • Embeds SentencePiece via a trie for English subword (no Python sentencepiece dependency)
  • Runs the full pipeline in Rust: language detection → Viterbi VI compounds → SP subword → byte fallback
  • Supports batch encoding, padding, truncation through HF's PreTrainedTokenizerFast interface

Batch encoding

# Slow path
batch = tok(["Câu thứ nhất", "Câu thứ hai"], padding=True, truncation=True)

# Fast path
batch = hf(["Câu thứ nhất", "Câu thứ hai"], padding=True, truncation=True)

Disable code-switch markers

# Slow
tok = AutoTokenizer.from_pretrained("vlinhd11/villm-tokenizer", trust_remote_code=True)
tok.add_code_switch_markers = False

# Fast
hf = create_fast_tokenizer("vlinhd11/villm-tokenizer", add_code_switch_markers=False)

Tokenization examples

Input Tokens
Học sinh ['Học_sinh']
xe máy Việt Nam ['xe_máy', 'Việt_Nam']
Python programming ['▁Python', '▁programming']
Học sinh giỏi tiếng Việt... ['Học_sinh', 'giỏi', 'tiếng_Việt', 'và_học', 'lập_trình', '[VI→EN]', '▁Python']
Thủ tướng Phạm Minh Chính ['Thủ_tướng', 'Phạm', 'Minh', 'Chính']

Performance

Backend Texts/sec vs Python
Pure Python (slow) ~14k 1x
Rust batch (fast) ~55k ~3.9x

Notes

  • save_pretrained() works for the slow path; the fast path uses PreTrainedTokenizerFast(tokenizer_object=...) which cannot be serialized to tokenizer.json without losing custom logic. To save/load, re-create the fast tokenizer from the same base directory.
  • The fast backend requires the villm-tok-fast Python package (Rust native extension, Windows x64).

Citation

@software{villm_tokenizer,
  author = {vlinhd11},
  title = {ViLLM-Tok: Vietnamese-English Code-Switching Tokenizer},
  year = {2025},
  url = {https://huggingface.co/vlinhd11/villm-tokenizer}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support