language: en
datasets:
- librispeech
metrics:
- wer
pipeline_tag: automatic-speech-recognition
tags:
- transcription
- audio
- speech
- chunkformer
- asr
- automatic-speech-recognition
- long-form transcription
- librispeech
license: cc-by-nc-4.0
model-index:
- name: ChunkFormer-Large-En-Libri-960h
results:
- task:
name: Speech Recognition
type: automatic-speech-recognition
dataset:
name: test-clean
type: librispeech
args: en
metrics:
- name: Test WER
type: wer
value: 2.69
- task:
name: Speech Recognition
type: automatic-speech-recognition
dataset:
name: test-other
type: librispeech
args: en
metrics:
- name: Test WER
type: wer
value: 6.91
ChunkFormer-Large-En-Libri-960h: Pretrained ChunkFormer-Large on 960 hours of LibriSpeech dataset
!!!ATTENTION: Input audio must be MONO (1 channel) at 16,000 sample rate
Table of contents
Model Description
ChunkFormer-Large-En-Libri-960h is an English Automatic Speech Recognition (ASR) model based on the ChunkFormer architecture, introduced at ICASSP 2025. The model has been fine-tuned on 960 hours of LibriSpeech, a widely-used dataset for ASR research.
Documentation and Implementation
The Documentation and Implementation of ChunkFormer are publicly available.
Benchmark Results
We evaluate the models using Word Error Rate (WER). To ensure a fair comparison, all models are trained exclusively with the WENET framework.
STT | Model | Test-Clean | Test-Other | Avg. |
---|---|---|---|---|
1 | ChunkFormer | 2.69 | 6.91 | 4.80 |
2 | Efficient Conformer | 2.71 | 6.95 | 4.83 |
3 | Conformer | 2.77 | 6.93 | 4.85 |
4 | Squeezeformer | 2.87 | 7.16 | 5.02 |
Quick Usage
To use the ChunkFormer model for English Automatic Speech Recognition, follow these steps:
- Download the ChunkFormer Repository
git clone https://github.com/khanld/chunkformer.git
cd chunkformer
pip install -r requirements.txt
- Download the Model Checkpoint from Hugging Face
pip install huggingface_hub
huggingface-cli download khanhld/chunkformer-large-en-libri-960h --local-dir "./chunkformer-large-en-libri-960h"
or
git lfs install
git clone https://huggingface.co/khanhld/chunkformer-large-en-libri-960h
This will download the model checkpoint to the checkpoints folder inside your chunkformer directory.
- Run the model
python decode.py \
--model_checkpoint path/to/local/chunkformer-large-en-libri-960h \
--long_form_audio path/to/audio.wav \
--total_batch_duration 14400 \ #in second, default is 1800
--chunk_size 64 \
--left_context_size 128 \
--right_context_size 128
Example Output:
[00:00:01.200] - [00:00:02.400]: this is a transcription example
[00:00:02.500] - [00:00:03.700]: testing the long-form audio
Advanced Usage can be found HERE
Citation
If you use this work in your research, please cite:
@inproceedings{chunkformer,
title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},
author={Khanh Le, Tuan Vu Ho, Dung Tran and Duc Thanh Chau},
booktitle={ICASSP},
year={2025}
}