Fix: accept add_special_tokens in encode() for HuggingFace API compatibility

#112

Problem

The encode() method currently routes any unknown **kwargs through super().encode() with a logger.warning() call. Since add_special_tokens is a standard HuggingFace PreTrainedTokenizer.encode() parameter, many tools and libraries call tokenizer.encode(text, add_special_tokens=False) and unknowingly trigger this slow path on every invocation.

For large inputs (e.g. 20,000-token prompts), this causes severe overhead:

  1. logger.warning() is called for every single encode call, flooding logs
  2. The fallback super().encode() is significantly slower than the native tiktoken path

In practice, this caused benchmark tooling to spend 8+ hours on post-processing for a 200-request run, due to the logging and slow-path overhead accumulating across thousands of tokenizer calls.

Fix

Explicitly declare add_special_tokens as a named parameter in the encode() signature. Since Kimi's tokenizer does not prepend/append BOS/EOS tokens regardless of this flag (verified: results are identical with add_special_tokens=True and False), the parameter is accepted but intentionally unused — consistent with how many tiktoken-based tokenizers handle this standard HuggingFace kwarg.

Verification

tok = AutoTokenizer.from_pretrained("moonshotai/Kimi-K2.5", trust_remote_code=True)

tok.encode("hello world")                           # [19180, 2695]
tok.encode("hello world", add_special_tokens=False) # [19180, 2695] ✓ — no warning, no slow path
tok.encode("hello world", add_special_tokens=True)  # [19180, 2695] ✓

Any other unexpected kwargs still fall through to super().encode() with the existing warning, preserving current behavior for genuinely unsupported parameters.

Publish this branch
This branch is in draft mode, publish it to be able to merge.

Sign up or log in to comment