metadata

language: en
datasets:
  - librispeech
metrics:
  - wer
pipeline_tag: automatic-speech-recognition
tags:
  - transcription
  - audio
  - speech
  - chunkformer
  - asr
  - automatic-speech-recognition
  - long-form transcription
  - librispeech
license: cc-by-nc-4.0
model-index:
  - name: ChunkFormer-Large-En-Libri-960h
    results:
      - task:
          name: Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: test-clean
          type: librispeech
          args: en
        metrics:
          - name: Test WER
            type: wer
            value: 2.69
      - task:
          name: Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: test-other
          type: librispeech
          args: en
        metrics:
          - name: Test WER
            type: wer
            value: 6.91

ChunkFormer-Large-En-Libri-960h: Pretrained ChunkFormer-Large on 960 hours of LibriSpeech dataset

!!!ATTENTION: Input audio must be MONO (1 channel) at 16,000 sample rate

Model Description
Documentation and Implementation
Benchmark Results
Usage
Citation
Contact

Model Description

ChunkFormer-Large-En-Libri-960h is an English Automatic Speech Recognition (ASR) model based on the ChunkFormer architecture, introduced at ICASSP 2025. The model has been fine-tuned on 960 hours of LibriSpeech, a widely-used dataset for ASR research.

Documentation and Implementation

The Documentation and Implementation of ChunkFormer are publicly available.

Benchmark Results

We evaluate the models using Word Error Rate (WER). To ensure a fair comparison, all models are trained exclusively with the WENET framework.

STT	Model	Test-Clean	Test-Other	Avg.
1	ChunkFormer	2.69	6.91	4.80
2	Efficient Conformer	2.71	6.95	4.83
3	Conformer	2.77	6.93	4.85
4	Squeezeformer	2.87	7.16	5.02

Quick Usage

To use the ChunkFormer model for English Automatic Speech Recognition, follow these steps:

Download the ChunkFormer Repository

git clone https://github.com/khanld/chunkformer.git
cd chunkformer
pip install -r requirements.txt

Download the Model Checkpoint from Hugging Face

pip install huggingface_hub
huggingface-cli download khanhld/chunkformer-large-en-libri-960h --local-dir "./chunkformer-large-en-libri-960h"

git lfs install
git clone https://huggingface.co/khanhld/chunkformer-large-en-libri-960h

This will download the model checkpoint to the checkpoints folder inside your chunkformer directory.

Run the model

python decode.py \
    --model_checkpoint path/to/local/chunkformer-large-en-libri-960h \
    --long_form_audio path/to/audio.wav \
    --total_batch_duration 14400 \ #in second, default is 1800
    --chunk_size 64 \
    --left_context_size 128 \
    --right_context_size 128

Example Output:

[00:00:01.200] - [00:00:02.400]: this is a transcription example
[00:00:02.500] - [00:00:03.700]: testing the long-form audio

Advanced Usage can be found HERE

Citation

If you use this work in your research, please cite:

@inproceedings{chunkformer,
  title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},
  author={Khanh Le, Tuan Vu Ho, Dung Tran and Duc Thanh Chau},
  booktitle={ICASSP},
  year={2025}
}

Contact

khanhld218@gmail.com

khanhld
/

chunkformer-large-en-libri-960h