Instructions to use syvai/hviske-v5.3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use syvai/hviske-v5.3 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="syvai/hviske-v5.3", trust_remote_code=True)# Load model directly from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq processor = AutoProcessor.from_pretrained("syvai/hviske-v5.3", trust_remote_code=True) model = AutoModelForSpeechSeq2Seq.from_pretrained("syvai/hviske-v5.3", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
Gibberish output on Apple Silicon (MPS and CPU) β possible code/transformers incompatibility
Summary
On an Apple Silicon (M4, macOS) machine, syvai/hviske-v5.3 emits fluent-looking but completely nonsensical multilingual token-salad on every audio chunk, regardless of input duration. The same issue affects syvai/hviske-v5.2. The older syvai/hviske-v5 works fine on the same machine with transformers==4.57.
Example output
"MusΡλΠIndian circum circum circum chooseervesatt qual orderokedowane Ο confer favorevoli abandonokrat ..."
Different chunks produce different gibberish but all share the same multilingual character (mixing Greek, Italian, Korean, CJK, etc.) and emit a similar prefix regardless of input duration (0.22 s to 15 s), suggesting the encoder is producing degenerate features rather than the decoder looping.
Reproduction
Using the model card's own example verbatim:
import torch, librosa
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
audio, _ = librosa.load("any_danish_audio.mp3", sr=16000, mono=True)
chunk = audio[:16000 * 10]
processor = AutoProcessor.from_pretrained("syvai/hviske-v5.3", trust_remote_code=True)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
"syvai/hviske-v5.3", trust_remote_code=True, dtype=torch.bfloat16
).to("mps").eval()
with torch.no_grad():
out = model.transcribe(
processor=processor,
audio_arrays=[chunk],
sample_rates=[16000],
language="da",
punctuation=True,
)
print(out[0]) # β multilingual gibberish
Environment
- Hardware: Mac M4, 16 GB unified memory
- OS: macOS (Darwin 25.1.0)
- Python: 3.14
- PyTorch: 2.11.0
- transformers: tested 5.4.0 and 5.8.1
- All deps installed via
uvin a clean venv
What we've ruled out
| Variable | Tested | Result |
|---|---|---|
| Device: MPS bf16 | β | gibberish |
| Device: MPS fp32 | β | gibberish |
| Device: CPU fp32 | β | gibberish |
| transformers 5.4.0 (model-card floor) | β | gibberish |
| transformers 5.8.1 (latest) | β | gibberish |
Model revision (initial upload 17aa942 and main) |
β | gibberish |
| Model variant (v5.2, v5.3 β same Conformer arch) | β | gibberish |
The fact that CPU + fp32 produces the same garbage rules out MPS bugs, memory pressure, and dtype precision issues simultaneously. The encoder is producing degenerate features in pure CPU fp32 PyTorch.
Possible smoking gun
On every call, transformers/audio_utils.py emits:
UserWarning: At least one mel filter has all zero values. num_mel_filters (128) may be set too high, or num_frequency_bins (257) too low
(257 frequency bins comes from n_fft=512 per preprocessor_config.json, which matches the CohereAsr defaults β so this may be a benign warning, but worth confirming.)
Working configuration (for reference)
syvai/hviske-v5 produces clean Danish transcriptions on the same machine when pinned to transformers>=4.56,<5. v5.2/v5.3 are the broken ones, presumably because they use a different (Cohere-ASR Conformer) architecture with custom code in modeling_cohere_asr.py.
Question for the maintainers
Has v5.2 / v5.3 been tested on Apple Silicon, or is CUDA the only validated target? If it's expected to work on MPS/CPU, the consistent gibberish on every input (including pure CPU fp32) suggests either:
- A bug in
modeling_cohere_asr.pythat only manifests outside CUDA, or - A config / preprocessor mismatch that happens to be hidden when CUDA-specific code paths are used.
Happy to test patches or provide more diagnostics.
Reproduced on Linux + CUDA β root cause is transformers 5.x
Same gibberish symptom on a completely different platform, which rules out anything Apple-Silicon-specific:
- Hardware: NVIDIA GTX 1070 (Pascal, 8 GB) β also reproduced on CPU
- OS: Ubuntu 24.04
- Python 3.12, PyTorch 2.6.0 + CUDA 12.4
transformers5.4.0 and 5.8.1 β multilingual gibberish on all of: CUDA bf16, CUDA fp16, CPU fp32transformers4.57.6 β clean Danish output on bothhviske-v5.3andhviske-v5
Working call (unchanged from the README example), only the transformers pin differs:
import torch, librosa, numpy as np
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
processor = AutoProcessor.from_pretrained("syvai/hviske-v5.3", trust_remote_code=True)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
"syvai/hviske-v5.3", trust_remote_code=True, dtype=torch.bfloat16,
).to("cuda").eval()
audio, sr = librosa.load("clip.mp3", sr=16000, mono=True)
hyp = model.transcribe(
processor=processor, language="da",
audio_arrays=[audio.astype(np.float32)], sample_rates=[sr],
)[0]
print(hyp)
On transformers==4.57.6: ~28Γ realtime on a GTX 1070 (bf16 emulated), ~5.2 GB VRAM peak, output is fluent Danish. On transformers==5.4.0/5.8.1: same code path, same audio, same weights β multilingual token-salad.
One concrete difference I found between 4.x and 5.x
On transformers 5.x the fast tokenizer (TokenizersBackend) loads tokenizer_config.json with "split_special_tokens": true, which fragments every special token at encode time:
tok.encode("<|da|>", add_special_tokens=False)
# transformers 5.x default: [13764, 315, 379, 3123, 379, 317] β 6 subword pieces
# transformers 5.x + split_special_tokens=False: [58] β correct
# transformers 4.x (slow CohereAsrTokenizer): [58] β correct
β¦which means the decoder prompt the model sees on 5.x is < | d a | > instead of <|da|>, plus the same corruption for <|startoftranscript|>, <|pnc|>, etc.
Setting processor.tokenizer.split_special_tokens = False after loading does fix the tokenization, but on its own it does not fix the gibberish output on 5.x β there must be at least one more 5.x incompatibility downstream (encoder side, given the symptom changes with audio/dtype). I didn't bisect further once 4.57.6 worked.
Suggestion
Either:
- relax/correct the README's
transformers>=5.4.0totransformers>=4.56,<5(matches what works for v5.0βv5.3), or - fix the remote-code path so the model works under 5.x β at minimum, set
split_special_tokens=Falsein the shippedtokenizer_config.json, then track down the remaining encoder-side regression.