Fix tokenizer: EOS bug + decode skip_special_tokens=True empty string

by kashif HF Staff - opened 9 days ago

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+54

-20

kashif

Hugging Face Biology Research org 9 days ago

Tokenizer bug fixes

Bug 1: EOS appended when `add_special_tokens=True`

encode(add_special_tokens=True) was appending an EOS token, which breaks lighteval's tok_encode_pair invariant. Qwen3 doesn't add BOS/EOS either — the EOS append is removed.

Bug 2: `decode(skip_special_tokens=True)` returns empty string for pure-DNA generations

The common generation scenario: <dna> is in the prompt, only k-mer tokens + </dna> are in the generated portion being decoded. The elif tid in dna_id_to_token branch was treating all DNA-vocab tokens (including k-mer content) as special tokens and dropping them when skip_special_tokens=True, returning an empty string instead of the DNA sequence.

Fix: only skip actual DNA special tokens (<dna>, </dna>, <oov>); always decode k-mer content tokens.

Also: `auto_dna_tags` parameter added (default `False`)

Allows raw DNA strings to be automatically wrapped in <dna>...</dna> for k-mer tokenization. Default is False to preserve existing behaviour (metadata BPE tokens must not be auto-wrapped).

tokenizer: fix EOS append bug and decode skip_special_tokens=True bugfc796726

tokenizer: add auto_dna_tags to dna_config.json437e6757

tokenizer: fix auto_dna_tags None -> False in tokenizer_config.jsone3cb1186

loubnabnl

Hugging Face Biology Research org 9 days ago

LGTM!

kashif changed pull request status to open 9 days ago

kashif

Hugging Face Biology Research org 8 days ago

Merging tokenizer fixes: EOS append bug, decode skip_special_tokens=True empty string, auto_dna_tags support.

kashif changed pull request status to merged 8 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Fix tokenizer: EOS bug + decode skip_special_tokens=True empty string

Tokenizer bug fixes

Bug 1: EOS appended when add_special_tokens=True

Bug 2: decode(skip_special_tokens=True) returns empty string for pure-DNA generations

Also: auto_dna_tags parameter added (default False)

Bug 1: EOS appended when `add_special_tokens=True`

Bug 2: `decode(skip_special_tokens=True)` returns empty string for pure-DNA generations

Also: `auto_dna_tags` parameter added (default `False`)