Instructions to use AutoArk-AI/ark-asr-0.6b-int8-onnx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AutoArk-AI/ark-asr-0.6b-int8-onnx with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="AutoArk-AI/ark-asr-0.6b-int8-onnx")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("AutoArk-AI/ark-asr-0.6b-int8-onnx", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Overview
ARK-ASR-0.6B INT8 ONNX is the ONNX Runtime package for the 0.6B ARK-ASR automatic speech recognition model. It is intended for local and edge-device ASR inference when a compact ONNX pipeline is preferred over loading the full Transformers checkpoint.
ARK-ASR is trained with the teacher-data adaptation plus online policy distillation recipe from AutoArk/open-audio-opd. The full 0.6B Transformers checkpoint is available as AutoArk-AI/ARK-ASR-0.6B.
Supported Languages
Chinese, English, German, Japanese, French, Korean, Spanish, Polish, Italian, Romanian, Hungarian, Czech, Dutch, Finnish, Croatian, Slovak, Slovene, Estonian, and Lithuanian.
Package Contents
This repository contains a self-contained INT8 ONNX inference package:
.
βββ infer_ark_audio_onnx.py
βββ README_INT8_ASR_USAGE.md
βββ build/
β βββ llm_kv_fp32_qwen_native.json
βββ model/
βββ llm_kv_cpu_fp32_int8.onnx
βββ audio_encoder_whisper_int8.onnx
βββ audio_encoder_adapter_int8.onnx
βββ embedding_fp32.onnx
βββ embedding_fp32.data
βββ runtime_manifest.json
βββ tokenizer, processor, and model configuration files
The package includes:
- INT8 ONNX files for the decoder, audio encoder, and audio adapter
- FP32 token embedding ONNX assets
- tokenizer, processor, and runtime configuration files
- a standalone ASR inference script,
infer_ark_audio_onnx.py
Installation
Python 3.10 or 3.11 is recommended.
pip install onnxruntime torch transformers librosa soundfile numpy
For GPU inference, install the onnxruntime-gpu build that matches your CUDA environment.
Quick Start
Download or clone this repository, then run inference from the repository root:
python infer_ark_audio_onnx.py \
--audio /path/to/audio.wav \
--max-new-tokens 128
You can also run the script from another directory by passing --runtime-root:
python /path/to/ark-asr-0.6b-int8-onnx/infer_ark_audio_onnx.py \
--runtime-root /path/to/ark-asr-0.6b-int8-onnx \
--audio /path/to/audio.wav \
--max-new-tokens 128
The script prints one transcription line.
Python Usage
from pathlib import Path
from infer_ark_audio_onnx import ArkAsrOnnxRuntime
runtime = ArkAsrOnnxRuntime(Path("/path/to/ark-asr-0.6b-int8-onnx"))
text = runtime.transcribe(
audio_path="/path/to/audio.wav",
max_new_tokens=128,
max_audio_seconds=30,
precision="int8",
asr_block_token_id_from=151670,
)
print(text)
Decoding Behavior
The inference script filters non-text control tokens by default while preserving the EOS token for normal generation stopping. The filter covers:
- special tokens from
tokenizer.all_special_ids, excepteos_token_id - added vocabulary entries that look like control tokens
- non-ASR text-range token IDs greater than or equal to
151670
See README_INT8_ASR_USAGE.md for the full local usage guide and decoding details.
Model Details
- Task: automatic speech recognition
- Format: INT8 ONNX Runtime package
- Base model: ARK-ASR-0.6B
- Sampling rate: 16 kHz
- License: Apache-2.0
- Training and evaluation code:
AutoArk/open-audio-opd
Citation
@misc{lin2026dataefficientopd,
title={Data-Efficient On-Policy Distillation for Automatic Speech Recognition},
author={Lin, Yu and Wang, Yiming and Cai, Runyuan and Zeng, Xiaodong},
year={2026},
eprint={2605.28139},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2605.28139}
}