Fix tokenizer: EOS bug + decode skip_special_tokens=True empty string

#1
by kashif HF Staff - opened
Hugging Face Biology Research org

Tokenizer bug fixes

Bug 1: EOS appended when add_special_tokens=True

encode(add_special_tokens=True) was appending an EOS token, which breaks lighteval's tok_encode_pair invariant. Qwen3 doesn't add BOS/EOS either β€” the EOS append is removed.

Bug 2: decode(skip_special_tokens=True) returns empty string for pure-DNA generations

The common generation scenario: <dna> is in the prompt, only k-mer tokens + </dna> are in the generated portion being decoded. The elif tid in dna_id_to_token branch was treating all DNA-vocab tokens (including k-mer content) as special tokens and dropping them when skip_special_tokens=True, returning an empty string instead of the DNA sequence.

Fix: only skip actual DNA special tokens (<dna>, </dna>, <oov>); always decode k-mer content tokens.

Also: auto_dna_tags parameter added (default False)

Allows raw DNA strings to be automatically wrapped in <dna>...</dna> for k-mer tokenization. Default is False to preserve existing behaviour (metadata BPE tokens must not be auto-wrapped).

Hugging Face Biology Research org

LGTM!

kashif changed pull request status to open
Hugging Face Biology Research org

Merging tokenizer fixes: EOS append bug, decode skip_special_tokens=True empty string, auto_dna_tags support.

kashif changed pull request status to merged

Sign up or log in to comment