RCIA Round 1 - Voxtral Mini Realtime vLLM ctx512
This repository is the Round 1 audio-to-text submission package for Voxtral Mini 4B Realtime.
The model weights are the unmodified BF16 checkpoint. The submission focuses on vLLM inference configuration: four parameters tuned away from their defaults give a measured -21% P50 latency improvement over the vLLM default configuration on our 500-sample benchmark, with a 100% completion rate.
Additional background on the Round 1 exploration process is included in
ROUND1_METHODOLOGY.md.
Optimization Target
This Round 1 package is intentionally optimized for P50 latency (median user experience) while keeping transcription quality and full-run stability. Tail latencies also improve versus the baseline, but less than P50:
- P50: about -21%
- P90: about -5%
- P95: about -1% (near-neutral in strict orga-like reruns)
In other words, the main gain is faster typical requests, with WER preserved and no regressions in completion rate on the 500-sample validation.
What Changed vs Default vLLM Configuration
| Parameter | vLLM default | This submission | Why |
|---|---|---|---|
max_model_len |
131072 | 512 | Voxtral audio inputs rarely exceed 300 tokens. Allocating 131k tokens wastes VRAM on KV cache blocks that are never used, adding latency at startup and request time. |
gpu_memory_utilization |
0.90 | 0.65 | The 0.60 and 0.40 benchmark profiles were measured on an RTX 4090 24GB. The submission keeps the same absolute VRAM target, about 10 GB, which maps to 0.65 on a 16GB L4. |
enforce_eager |
false | true | CUDA graph capture fails or is unstable with the audio multimodal path in vLLM 0.20.0. Eager mode avoids the issue with no measurable throughput penalty at max_num_seqs=1. |
max_num_seqs |
256 | 1 | Transcription is a one-request-at-a-time workload. Reducing the batch limit eliminates memory reserved for concurrent sequences that never arrive. |
Validation
Local validation was run on 500 audio samples with the same client harness used throughout the strategy search.
| Package | Samples | P50 latency | P90 latency | P95 latency | WER | Peak VRAM |
|---|---|---|---|---|---|---|
strategy-4044-vllm-ctx512 |
500/500 | 2635.99 ms | 6258.58 ms | 7725.58 ms | 0.0179907 | 14.52 GB |
strategy-4056-vllm-ctx512-gpu040 |
500/500 | 2680.05 ms | 6334.8 ms | 7862.63 ms | 0.0179907 | 9.83 GB |
The benchmark table above records the two measured profiles on an RTX 4090 24GB (gpu_memory_utilization=0.6 and 0.4). Rather than reusing the fraction directly on a 16GB L4, the packaged submission keeps the validated absolute VRAM budget, about 9.8-10 GB from the low-VRAM profile, which corresponds to gpu_memory_utilization=0.65 on the contest GPU.
Submission Rationale
For Round 1, we focused on the vLLM inference configuration because we did not have sufficient compute available to run serious weight-compression work until the final week, when we gained access to an RTX 4090. Given the short validation window, we prioritized a reproducible and robust submission path: keep the BF16 checkpoint format, reduce the effective runtime footprint through vLLM configuration, and validate latency, memory use, and WER across the available benchmark set. After revisiting the absolute VRAM budget on the 16GB contest L4, we kept the validated low-VRAM footprint from the 24GB RTX 4090, about 10 GB, which led to gpu_memory_utilization=0.65 instead of reusing the earlier 0.40 fraction.
Weight-compression experiments (structured pruning, FP8) were evaluated but not retained because they either failed to start reliably or degraded WER to ~21% or higher.
Why sitecustomize.py is Required
vLLM 0.20.0 does not fully support the mistral-common tokenizer backend that
Voxtral uses. Three separate crashes occur on a stock vLLM 0.20.0 installation:
1. AttributeError: CachedMistralCommonBackend has no attribute is_fast
vLLM checks tokenizer.is_fast during server startup. The
MistralCommonBackend class in transformers does not expose this attribute.
Fix: set is_fast = False on both PreTrainedTokenizerBase and
MistralCommonBackend before vLLM's tokenizer code runs.
2. TypeError: from_pretrained() got unexpected keyword argument 'subfolder'
vLLM passes subfolder, tokenizer, and _from_auto kwargs to
MistralCommonBackend.from_pretrained, which does not accept them.
Fix: wrap from_pretrained to silently drop those three kwargs.
3. AttributeError: VoxtralRealtimeProcessor has no attribute _get_num_multimodal_tokens
When the model directory contains only model.safetensors (HF format, no
consolidated.safetensors), vLLM uses the HF loader path and calls
processor._get_num_multimodal_tokens() to pre-allocate the audio token
buffer. VoxtralRealtimeProcessor does not implement this method.
Fix: add a stub that returns a conservative budget of 480 tokens
(max_model_len=512 minus a small text headroom).
Python's sitecustomize.py is the standard mechanism for applying runtime
patches: it is imported automatically by the interpreter before any other user
code, so the patches are in place before vLLM loads its modules. The file is
copied into the Docker image at /usr/lib/python3.12/sitecustomize.py by the
Dockerfile included in this repository.
Both files (sitecustomize.py and Dockerfile) are included here so the
deployment is fully self-contained.
Deployment
Build the patched Docker image from the repository root, then run it:
# From the cloned repository root
docker build -t vllm-voxtral .
docker run --gpus all --rm \
--shm-size 4g \
-v "$(pwd)":/model:ro \
-w /model \
-p 8000:8000 \
vllm-voxtral \
--config vllm_config.yaml
Challenge-side canonical launch from the HF repository root:
vllm serve --config vllm_config.yaml
Why max_model_len: 512 is explicit in vllm_config.yaml:
with vLLM 0.20.0, the model default context (131072) can over-allocate
KV-cache and fail initialization on 24 GB GPUs. 512 matches the validated
audio workload and keeps startup reliable.
Equivalent direct-CLI form (without config file):
VLLM_DISABLE_COMPILE_CACHE=1 \
vllm serve Franck-J/rcia-voxtral-round1 \
--compilation_config '{"cudagraph_mode":"PIECEWISE"}' \
--max-model-len 512 \
--gpu-memory-utilization 0.65 \
--max-num-seqs 1 \
--enforce-eager
Option B — apply the patch directly to the Python installation used by vLLM (adjust path to match your Python version):
cp sitecustomize.py /usr/lib/python3.12/sitecustomize.py
vllm serve --config vllm_config.yaml
Inference
Once the server is healthy:
Expected endpoint:
http://localhost:8000/v1
vLLM Configuration
model: .
dtype: bfloat16
enforce_eager: true
max_num_seqs: 1
gpu_memory_utilization: 0.65
max_model_len: 512
compilation_config:
cudagraph_mode: PIECEWISE
Compression And Optimization Notes
The following Round 1 candidates were explored:
- FP8 KV-cache runtime configuration: server did not become healthy.
- Direct Hugging Face runtime wrapper: functional, but slower than the vLLM baseline.
- Structured pruning runtime: produced a smaller checkpoint, but WER degraded to about 21.22%.
- TensorRT wrapper track: export contract probe was prepared, but not a validated submission runtime.
- vLLM context and memory sweep:
ctx512withgpu_memory_utilization: 0.60was the best measured package on RTX 4090 24GB; the published submission uses0.65to preserve a comparable absolute VRAM budget on L4 16GB.
Because the weight-compressed candidates were not competitive, this package prioritizes technical validation and transcription quality for Round 1.
License
The base model is released under Apache-2.0. This submission keeps the same license.
- Downloads last month
- 286
Model tree for Franck-J/rcia-voxtral-round1
Base model
mistralai/Ministral-3-3B-Base-2512