πŸŽ™οΈ X-ASR-zh-en

Chinese-English offline-streaming unified ASR model artifacts for low-latency deployment.

Shanghai Jiao Tong University Shanghai Innovation Institute Fudan University Huazhong University of Science and Technology

Participating Institutions

🌐 GitHub Project | πŸͺ Hugging Face Space | 🎧 Online Demo | πŸš€ Deployment Guide

πŸ“„ X-ASR-zh-en Technical Report: Coming Soon

Model released Languages Streaming Deployment License

πŸ” Model Card Scope | πŸ“¦ Repository Contents | πŸ“Š Evaluation | ⬇️ Download | πŸš€ Deployment


πŸ” Model Card Scope

🧩 X-ASR Series

X-ASR is a series of automatic speech recognition models built with the icefall framework. The series focuses on streaming ASR and low-latency deployment, while also supporting offline recognition. The broader project roadmap, source organization, issue tracking, and bilingual documentation are maintained on the GitHub project page.

πŸ€– X-ASR-zh-en

X-ASR-zh-en is trained on approximately 1 million hours of open-source and collected speech data. It is designed as an offline-streaming unified transducer ASR model with the Zipformer architecture, supporting both offline decoding and true streaming decoding. The model provides multiple streaming chunk sizes: 160 ms, 480 ms, 960 ms, and 1920 ms, supports punctuation and casing, and can be deployed with sherpa-onnx.

Zipformer architecture

✨ HF-Specific Notes

This Hugging Face repository is the model artifact page for X-ASR-zh-en.

What this page provides What the GitHub page provides
Downloadable model artifacts Project-level overview
ONNX encoder / decoder / joiner files Bilingual README and release notes
sherpa-onnx deployment entry point Source layout and issue tracking
Model-card metadata, tags, license, and metrics Development history and contribution workflow

πŸ“¦ Repository Contents

Path Purpose
deployment/ Deployment-ready sherpa-onnx runtime files and examples
deployment/models/ Exported streaming ONNX model variants
deployment/infer_and_client/ WebSocket server, inference wrapper, and test client
figure/ Architecture figure and demo preview media
demo/ Demo video asset
zipformer/ Training/export reference files for the Zipformer-based setup

🧩 Model Variants

Each streaming variant contains a matched encoder, decoder, joiner, and tokens.txt. Do not mix files across model folders.

Directory Encoder Decoder Joiner Intended chunk
deployment/models/chunk-160ms-model encoder-160ms.onnx decoder-160ms.onnx joiner-160ms.onnx 160 ms
deployment/models/chunk-480ms-model encoder-480ms.onnx decoder-480ms.onnx joiner-480ms.onnx 480 ms
deployment/models/chunk-960ms-model encoder-960ms.onnx decoder-960ms.onnx joiner-960ms.onnx 960 ms
deployment/models/chunk-1920ms-model encoder-1920ms.onnx decoder-1920ms.onnx joiner-1920ms.onnx 1920 ms

⭐ Highlights

Category Description
Framework icefall / k2
Architecture Zipformer transducer
Runtime sherpa-onnx
Languages Chinese and English
Training scale Approximately 1 million hours of open-source and collected speech data
Recognition modes Offline decoding and true streaming decoding
Streaming chunks 160 ms, 480 ms, 960 ms, 1920 ms
Text output Supports punctuation and casing

πŸ“Š Evaluation

The following results are for the current X-ASR-zh-en release. Values are WER/CER percentages; lower is better. All results are reported with greedy search.

Mode Chunk size LibriSpeech GigaSpeech WenetSpeech
clean other net meeting
Streaming 160 ms 3.91 10.17 10.97 9.45 12.04
Streaming 480 ms 3.14 7.57 9.77 7.38 9.31
Streaming 960 ms 3.12 7.22 9.62 6.96 8.84
Streaming 1920 ms 2.84 6.47 9.46 6.42 8.03
Offline - 2.69 5.76 9.23 5.96 7.20

Note: Bold numbers indicate the best result among the listed modes for each benchmark column.

Public Benchmark Model Comparison

The following table compares representative ASR models on the same public benchmark columns. Ranks are computed by AVG across the five listed columns; lower is better. Parameter sizes are shown when provided by the source sheet.

Rank Model Params LibriSpeech GigaSpeech WenetSpeech AVG
clean other net meeting
1Qwen3-ASR1.7B1.653.458.565.295.464.882
2Qwen3-ASR0.6B2.184.548.945.976.885.702
3X-ASR-zh-en (offline)0.16B2.565.569.175.837.066.036
4SenseVoice-small234M3.167.2111.245.736.476.762
5VibeVoice-ASR9B2.185.659.4914.4517.199.792

GigaSpeechBench Vertical Domain Evaluation

The following results report GigaSpeechBench vertical-domain performance for the current X-ASR-zh-en release. Values are WER/CER percentages; lower is better. Domain abbreviations follow the GigaSpeechBench vertical-domain labels.

CH

Mode Chunk size ARG AIT ART BIO ECM ENG ENT FIN HUM LAW MED MIL
Streaming160 ms9.886.764.397.324.133.588.453.2310.426.584.252.55
Streaming480 ms8.676.173.606.223.783.047.042.789.435.843.762.11
Streaming960 ms8.005.693.446.103.692.886.712.729.075.583.692.11
Streaming1920 ms7.245.583.275.823.482.746.552.578.594.973.531.94
Offline-6.564.542.775.042.992.326.021.947.644.202.901.68

EN

Mode Chunk size ARG AIT ART BIO ECM ENG ENT FIN HUM LAW MED MIL
Streaming160 ms5.298.578.557.314.335.0116.255.587.3613.396.036.20
Streaming480 ms4.628.407.736.124.194.6514.505.216.7911.515.596.02
Streaming960 ms4.588.357.456.004.134.4413.995.126.5810.865.526.04
Streaming1920 ms4.338.326.905.894.004.3713.614.986.3910.525.455.78
Offline-4.098.286.735.484.124.3012.304.946.1710.415.355.61

🎧 Demo

A sherpa-onnx based online demo is available here:

Demo video:

X-ASR demo video preview

Open demo video

⬇️ Download

Download with HF CLI

hf download GilgameshWind/X-ASR-zh-en \
  --local-dir ./x-asr-zh-en

Clone with Git LFS

git lfs install
git clone https://huggingface.co/GilgameshWind/X-ASR-zh-en
cd X-ASR-zh-en
git lfs pull

πŸš€ Deployment

The recommended runtime is sherpa-onnx. The shortest path is to use the deployment package in this repository.

cd deployment
python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements.txt

Start a CPU streaming server with the 160 ms model:

python infer_and_client/sherpa_streaming_server.py \
  --host 0.0.0.0 \
  --port 8766 \
  --tokens models/chunk-160ms-model/tokens.txt \
  --encoder models/chunk-160ms-model/encoder-160ms.onnx \
  --decoder models/chunk-160ms-model/decoder-160ms.onnx \
  --joiner models/chunk-160ms-model/joiner-160ms.onnx \
  --provider cpu \
  --sample-rate 16000 \
  --feature-dim 80 \
  --num-threads 1 \
  --decoding-method greedy_search \
  --model-type zipformer2 \
  --enable-endpoint-detection 0 \
  --text-format none

Test it with a WAV file:

python infer_and_client/sherpa_streaming_client.py \
  --server-uri ws://127.0.0.1:8766 \
  --wav /path/to/test.wav \
  --chunk-ms 100 \
  --simulate-realtime 1

For complete runtime options, see deployment/README.md.

⚠️ Intended Use and Limitations

  • This release is intended for Chinese-English ASR research, evaluation, demos, and deployment experiments.
  • The current release focuses on streaming and offline-streaming unified recognition.
  • Production latency depends on hardware, concurrency, audio chunking, endpointing, and server configuration.
  • The technical report with training details, evaluation protocol, ablations, and additional analysis is coming soon.

πŸ“„ Citation

The X-ASR-zh-en technical report is coming soon. Please cite the report once it is released. For now, refer to this model card and the GitHub project page.

πŸ“œ License

This model is released under the Apache-2.0 License.

πŸ™ Acknowledgements

This model is trained with icefall and deployed with sherpa-onnx.

Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Spaces using GilgameshWind/X-ASR-zh-en 2