Gibberish output on Apple Silicon (MPS and CPU) — possible code/transformers incompatibility

by Inrego - opened May 15

May 15

Summary

On an Apple Silicon (M4, macOS) machine, syvai/hviske-v5.3 emits fluent-looking but completely nonsensical multilingual token-salad on every audio chunk, regardless of input duration. The same issue affects syvai/hviske-v5.2. The older syvai/hviske-v5 works fine on the same machine with transformers==4.57.

Example output

"MusελΕ Indian circum circum circum chooseervesatt qual orderokedowane τ confer favorevoli abandonokrat ..."

Different chunks produce different gibberish but all share the same multilingual character (mixing Greek, Italian, Korean, CJK, etc.) and emit a similar prefix regardless of input duration (0.22 s to 15 s), suggesting the encoder is producing degenerate features rather than the decoder looping.

Reproduction

Using the model card's own example verbatim:

import torch, librosa
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

audio, _ = librosa.load("any_danish_audio.mp3", sr=16000, mono=True)
chunk = audio[:16000 * 10]

processor = AutoProcessor.from_pretrained("syvai/hviske-v5.3", trust_remote_code=True)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    "syvai/hviske-v5.3", trust_remote_code=True, dtype=torch.bfloat16
).to("mps").eval()

with torch.no_grad():
    out = model.transcribe(
        processor=processor,
        audio_arrays=[chunk],
        sample_rates=[16000],
        language="da",
        punctuation=True,
    )
print(out[0])  # → multilingual gibberish

Environment

Hardware: Mac M4, 16 GB unified memory
OS: macOS (Darwin 25.1.0)
Python: 3.14
PyTorch: 2.11.0
transformers: tested 5.4.0 and 5.8.1
All deps installed via uv in a clean venv

What we've ruled out

Variable	Tested	Result
Device: MPS bf16	✓	gibberish
Device: MPS fp32	✓	gibberish
Device: CPU fp32	✓	gibberish
transformers 5.4.0 (model-card floor)	✓	gibberish
transformers 5.8.1 (latest)	✓	gibberish
Model revision (initial upload `17aa942` and `main`)	✓	gibberish
Model variant (v5.2, v5.3 — same Conformer arch)	✓	gibberish

The fact that CPU + fp32 produces the same garbage rules out MPS bugs, memory pressure, and dtype precision issues simultaneously. The encoder is producing degenerate features in pure CPU fp32 PyTorch.

Possible smoking gun

On every call, transformers/audio_utils.py emits:

UserWarning: At least one mel filter has all zero values. num_mel_filters (128) may be set too high, or num_frequency_bins (257) too low

(257 frequency bins comes from n_fft=512 per preprocessor_config.json, which matches the CohereAsr defaults — so this may be a benign warning, but worth confirming.)

Working configuration (for reference)

syvai/hviske-v5 produces clean Danish transcriptions on the same machine when pinned to transformers>=4.56,<5. v5.2/v5.3 are the broken ones, presumably because they use a different (Cohere-ASR Conformer) architecture with custom code in modeling_cohere_asr.py.

Question for the maintainers

Has v5.2 / v5.3 been tested on Apple Silicon, or is CUDA the only validated target? If it's expected to work on MPS/CPU, the consistent gibberish on every input (including pure CPU fp32) suggests either:

A bug in modeling_cohere_asr.py that only manifests outside CUDA, or
A config / preprocessor mismatch that happens to be hidden when CUDA-specific code paths are used.

Happy to test patches or provide more diagnostics.

Inrego

May 15

Reproduced on Linux + CUDA — root cause is `transformers` 5.x

Same gibberish symptom on a completely different platform, which rules out anything Apple-Silicon-specific:

Hardware: NVIDIA GTX 1070 (Pascal, 8 GB) — also reproduced on CPU
OS: Ubuntu 24.04
Python 3.12, PyTorch 2.6.0 + CUDA 12.4
transformers 5.4.0 and 5.8.1 → multilingual gibberish on all of: CUDA bf16, CUDA fp16, CPU fp32
transformers 4.57.6 → clean Danish output on both hviske-v5.3 and hviske-v5

Working call (unchanged from the README example), only the transformers pin differs:

import torch, librosa, numpy as np
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

processor = AutoProcessor.from_pretrained("syvai/hviske-v5.3", trust_remote_code=True)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    "syvai/hviske-v5.3", trust_remote_code=True, dtype=torch.bfloat16,
).to("cuda").eval()

audio, sr = librosa.load("clip.mp3", sr=16000, mono=True)
hyp = model.transcribe(
    processor=processor, language="da",
    audio_arrays=[audio.astype(np.float32)], sample_rates=[sr],
)[0]
print(hyp)

On transformers==4.57.6: ~28× realtime on a GTX 1070 (bf16 emulated), ~5.2 GB VRAM peak, output is fluent Danish. On transformers==5.4.0/5.8.1: same code path, same audio, same weights → multilingual token-salad.

One concrete difference I found between 4.x and 5.x

On transformers 5.x the fast tokenizer (TokenizersBackend) loads tokenizer_config.json with "split_special_tokens": true, which fragments every special token at encode time:

tok.encode("<|da|>", add_special_tokens=False)
# transformers 5.x default:                 [13764, 315, 379, 3123, 379, 317]   ← 6 subword pieces
# transformers 5.x + split_special_tokens=False: [58]                            ← correct
# transformers 4.x (slow CohereAsrTokenizer):    [58]                            ← correct

…which means the decoder prompt the model sees on 5.x is < | d a | > instead of <|da|>, plus the same corruption for <|startoftranscript|>, <|pnc|>, etc.

Setting processor.tokenizer.split_special_tokens = False after loading does fix the tokenization, but on its own it does not fix the gibberish output on 5.x — there must be at least one more 5.x incompatibility downstream (encoder side, given the symptom changes with audio/dtype). I didn't bisect further once 4.57.6 worked.

Suggestion

Either:

relax/correct the README's transformers>=5.4.0 to transformers>=4.56,<5 (matches what works for v5.0–v5.3), or
fix the remote-code path so the model works under 5.x — at minimum, set split_special_tokens=False in the shipped tokenizer_config.json, then track down the remaining encoder-side regression.

Inrego changed discussion status to closed May 15

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment