NVIDIA Canary-SpeechLM (MLX Port)

This repository contains the pure MLX port of the NVIDIA Canary SpeechLM model (canary-qwen-2.5b).

By porting the model architecture to MLX (including Conformer block, relative attention layer, and projection layers), this version runs 100% locally on Apple Silicon with zero PyTorch dependencies at inference time.

Features

  • No PyTorch at Inference: Pure MLX implementation for optimal performance and memory on macOS.
  • Fast Transcription: RTF of 0.067x (runs 14.8x faster than real-time on Apple Silicon).
  • High-Fidelity Alignment: Intermediate outputs are validated to match PyTorch/NeMo reference feature maps within float16/float32 precision limits.

Performance Statistics

Measurements taken on Apple Silicon (M5 Pro):

  • Audio Duration: 3.88s
  • Feature Extraction + Conformer Encoding: 0.0506s
  • Prefill/Time-to-First-Token (TTFT): 0.0247s (2551.55 tok/s)
  • Decode Loop Generation Speed: 58.99 tok/s (up to 80.71 tok/s raw)
  • Real-Time Factor (RTF): 0.0674x (14.8x faster than real-time)

Installation & Setup

  1. Clone this repository:

    git clone https://huggingface.co/speechllms/canary-speechlm-mlx
    cd canary-speechlm-mlx
    
  2. Install dependencies:

    pip install mlx mlx-lm librosa transformers soundfile
    
  3. Ensure you have the base Qwen3-1.7B model downloaded (which contains the base tokenizer and weights):

    python -c "from huggingface_hub import snapshot_download; snapshot_download('Qwen/Qwen3-1.7B')"
    

Quick Usage

Run transcription directly from a WAV file:

python generate.py /path/to/audio.wav

Record & Transcribe from Microphone

If you have ffmpeg installed on your Mac (brew install ffmpeg), you can run the interactive recording script:

chmod +x record_and_transcribe.sh
./record_and_transcribe.sh

Technical Details

The port translates:

  1. ConvSubsampling 8x downsampling module.
  2. Conformer Block featuring depthwise 1D convolutions and relative multi-head self-attention.
  3. Transformer-XL dynamic Relative Positional Encoding (RelPositionalEncoding).
  4. LoRA adapter weight overlay on top of Qwen Causal LM.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support