WhisperTokenizer Arbitrary JSON File Read via Crafted .keras Archive

Vulnerability: keras_hub.models.WhisperTokenizer reads an attacker-controlled JSON file path from config.json during model deserialization with no safe mode guard.

Impact: Loading a crafted .keras file reads any JSON-formatted file accessible to the process and exposes its content in model.language_tokens. safe_mode=True does not protect this path.

CVE status: No CVE assigned. Distinct from CVE-2025-12058 and PR #2517. PR #2517 patched BytePairTokenizer but WhisperTokenizer's _load_dict() helper bypasses that guard.

Root cause

keras_hub/src/models/whisper/whisper_tokenizer.py:

def _load_dict(dict_or_path):
    if isinstance(dict_or_path, str):
        with open(dict_or_path, "r", encoding="utf-8") as f:  # no in_safe_mode() check
            dict_or_path = json.load(f)
    return dict_or_path

Called from __init__():

if language_tokens is not None:
    language_tokens = _load_dict(language_tokens)  # fires if string path

Compare with the correctly patched BytePairTokenizer (PR #2517):

if isinstance(vocabulary, str):
    if serialization_lib.in_safe_mode():
        raise ValueError("Requested loading a vocabulary file outside model archive...")
    with open(vocabulary, "r", encoding="utf-8") as f:
        ...

Additional bypass: vocabulary parameter

WhisperTokenizer.set_vocabulary_and_merges() also calls _load_dict(vocabulary) BEFORE passing the result to the parent class. This converts a path string into a dict, bypassing BytePairTokenizer's in_safe_mode() check entirely (which only triggers when it receives a string, not a dict).

Affected versions

keras-hub 0.25.1 (PyPI) and later (WhisperTokenizer has been present since the initial keras-hub releases). Live-confirmed on 0.25.1.

Reproduction

Requirements:

pip install keras==3.12.1 keras-hub tensorflow

Step 1: Create the target file (simulates a GCP service account key, Docker config, or other JSON credentials):

echo '{"type":"service_account","project_id":"victim-proj","private_key_id":"abc123"}' > /tmp/whisper_poc_target.json

Step 2: Run the PoC:

import sys
from unittest.mock import MagicMock
sys.modules.setdefault("tensorflow_text", MagicMock())

import keras
import keras_hub  # required: registers keras_hub>WhisperTokenizer

model = keras.models.load_model("malicious_whisper.keras", safe_mode=True)
print("model.language_tokens:", repr(model.language_tokens))
# Prints the parsed JSON content of /tmp/whisper_poc_target.json

Full self-contained PoC

See poc_whisper_file_read.py in this repo.

Suggested fix

Add in_safe_mode() checks in _load_dict() and set_vocabulary_and_merges():

from keras_hub.src.saving import serialization_lib

def _load_dict(dict_or_path):
    if isinstance(dict_or_path, str):
        if serialization_lib.in_safe_mode():
            raise ValueError(
                "Requested loading a file outside the model archive. "
                "Pass safe_mode=False if you trust the source."
            )
        with open(dict_or_path, "r", encoding="utf-8") as f:
            dict_or_path = json.load(f)
    return dict_or_path
Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support