You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

RCIA Round 1 - Voxtral Mini Realtime vLLM ctx512

This repository is the Round 1 audio-to-text submission package for Voxtral Mini 4B Realtime.

The model weights are the unmodified BF16 checkpoint. The submission focuses on vLLM inference configuration: four parameters tuned away from their defaults give a measured -21% P50 latency improvement over the vLLM default configuration on our 500-sample benchmark, with a 100% completion rate.

Additional background on the Round 1 exploration process is included in ROUND1_METHODOLOGY.md.

Optimization Target

This Round 1 package is intentionally optimized for P50 latency (median user experience) while keeping transcription quality and full-run stability. Tail latencies also improve versus the baseline, but less than P50:

  • P50: about -21%
  • P90: about -5%
  • P95: about -1% (near-neutral in strict orga-like reruns)

In other words, the main gain is faster typical requests, with WER preserved and no regressions in completion rate on the 500-sample validation.

What Changed vs Default vLLM Configuration

Parameter vLLM default This submission Why
max_model_len 131072 512 Voxtral audio inputs rarely exceed 300 tokens. Allocating 131k tokens wastes VRAM on KV cache blocks that are never used, adding latency at startup and request time.
gpu_memory_utilization 0.90 0.65 The 0.60 and 0.40 benchmark profiles were measured on an RTX 4090 24GB. The submission keeps the same absolute VRAM target, about 10 GB, which maps to 0.65 on a 16GB L4.
enforce_eager false true CUDA graph capture fails or is unstable with the audio multimodal path in vLLM 0.20.0. Eager mode avoids the issue with no measurable throughput penalty at max_num_seqs=1.
max_num_seqs 256 1 Transcription is a one-request-at-a-time workload. Reducing the batch limit eliminates memory reserved for concurrent sequences that never arrive.

Validation

Local validation was run on 500 audio samples with the same client harness used throughout the strategy search.

Package Samples P50 latency P90 latency P95 latency WER Peak VRAM
strategy-4044-vllm-ctx512 500/500 2635.99 ms 6258.58 ms 7725.58 ms 0.0179907 14.52 GB
strategy-4056-vllm-ctx512-gpu040 500/500 2680.05 ms 6334.8 ms 7862.63 ms 0.0179907 9.83 GB

The benchmark table above records the two measured profiles on an RTX 4090 24GB (gpu_memory_utilization=0.6 and 0.4). Rather than reusing the fraction directly on a 16GB L4, the packaged submission keeps the validated absolute VRAM budget, about 9.8-10 GB from the low-VRAM profile, which corresponds to gpu_memory_utilization=0.65 on the contest GPU.

Submission Rationale

For Round 1, we focused on the vLLM inference configuration because we did not have sufficient compute available to run serious weight-compression work until the final week, when we gained access to an RTX 4090. Given the short validation window, we prioritized a reproducible and robust submission path: keep the BF16 checkpoint format, reduce the effective runtime footprint through vLLM configuration, and validate latency, memory use, and WER across the available benchmark set. After revisiting the absolute VRAM budget on the 16GB contest L4, we kept the validated low-VRAM footprint from the 24GB RTX 4090, about 10 GB, which led to gpu_memory_utilization=0.65 instead of reusing the earlier 0.40 fraction.

Weight-compression experiments (structured pruning, FP8) were evaluated but not retained because they either failed to start reliably or degraded WER to ~21% or higher.

Why sitecustomize.py is Required

vLLM 0.20.0 does not fully support the mistral-common tokenizer backend that Voxtral uses. Three separate crashes occur on a stock vLLM 0.20.0 installation:

1. AttributeError: CachedMistralCommonBackend has no attribute is_fast vLLM checks tokenizer.is_fast during server startup. The MistralCommonBackend class in transformers does not expose this attribute. Fix: set is_fast = False on both PreTrainedTokenizerBase and MistralCommonBackend before vLLM's tokenizer code runs.

2. TypeError: from_pretrained() got unexpected keyword argument 'subfolder' vLLM passes subfolder, tokenizer, and _from_auto kwargs to MistralCommonBackend.from_pretrained, which does not accept them. Fix: wrap from_pretrained to silently drop those three kwargs.

3. AttributeError: VoxtralRealtimeProcessor has no attribute _get_num_multimodal_tokens When the model directory contains only model.safetensors (HF format, no consolidated.safetensors), vLLM uses the HF loader path and calls processor._get_num_multimodal_tokens() to pre-allocate the audio token buffer. VoxtralRealtimeProcessor does not implement this method. Fix: add a stub that returns a conservative budget of 480 tokens (max_model_len=512 minus a small text headroom).

Python's sitecustomize.py is the standard mechanism for applying runtime patches: it is imported automatically by the interpreter before any other user code, so the patches are in place before vLLM loads its modules. The file is copied into the Docker image at /usr/lib/python3.12/sitecustomize.py by the Dockerfile included in this repository.

Both files (sitecustomize.py and Dockerfile) are included here so the deployment is fully self-contained.

Deployment

Build the patched Docker image from the repository root, then run it:

# From the cloned repository root
docker build -t vllm-voxtral .
docker run --gpus all --rm \
    --shm-size 4g \
    -v "$(pwd)":/model:ro \
    -w /model \
    -p 8000:8000 \
    vllm-voxtral \
    --config vllm_config.yaml

Challenge-side canonical launch from the HF repository root:

vllm serve --config vllm_config.yaml

Why max_model_len: 512 is explicit in vllm_config.yaml: with vLLM 0.20.0, the model default context (131072) can over-allocate KV-cache and fail initialization on 24 GB GPUs. 512 matches the validated audio workload and keeps startup reliable.

Equivalent direct-CLI form (without config file):

VLLM_DISABLE_COMPILE_CACHE=1 \
vllm serve Franck-J/rcia-voxtral-round1 \
  --compilation_config '{"cudagraph_mode":"PIECEWISE"}' \
  --max-model-len 512 \
  --gpu-memory-utilization 0.65 \
  --max-num-seqs 1 \
  --enforce-eager

Option B — apply the patch directly to the Python installation used by vLLM (adjust path to match your Python version):

cp sitecustomize.py /usr/lib/python3.12/sitecustomize.py
vllm serve --config vllm_config.yaml

Inference

Once the server is healthy:

Expected endpoint:

http://localhost:8000/v1

vLLM Configuration

model: .
dtype: bfloat16
enforce_eager: true
max_num_seqs: 1
gpu_memory_utilization: 0.65
max_model_len: 512
compilation_config:
  cudagraph_mode: PIECEWISE

Compression And Optimization Notes

The following Round 1 candidates were explored:

  • FP8 KV-cache runtime configuration: server did not become healthy.
  • Direct Hugging Face runtime wrapper: functional, but slower than the vLLM baseline.
  • Structured pruning runtime: produced a smaller checkpoint, but WER degraded to about 21.22%.
  • TensorRT wrapper track: export contract probe was prepared, but not a validated submission runtime.
  • vLLM context and memory sweep: ctx512 with gpu_memory_utilization: 0.60 was the best measured package on RTX 4090 24GB; the published submission uses 0.65 to preserve a comparable absolute VRAM budget on L4 16GB.

Because the weight-compressed candidates were not competitive, this package prioritizes technical validation and transcription quality for Round 1.

License

The base model is released under Apache-2.0. This submission keeps the same license.

Downloads last month
286
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Franck-J/rcia-voxtral-round1