khanhld's picture
Update README.md
33c3133 verified
metadata
language: en
datasets:
  - librispeech
metrics:
  - wer
pipeline_tag: automatic-speech-recognition
tags:
  - transcription
  - audio
  - speech
  - chunkformer
  - asr
  - automatic-speech-recognition
  - long-form transcription
  - librispeech
license: cc-by-nc-4.0
model-index:
  - name: ChunkFormer-Large-En-Libri-960h
    results:
      - task:
          name: Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: test-clean
          type: librispeech
          args: en
        metrics:
          - name: Test WER
            type: wer
            value: 2.69
      - task:
          name: Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: test-other
          type: librispeech
          args: en
        metrics:
          - name: Test WER
            type: wer
            value: 6.91

ChunkFormer-Large-En-Libri-960h: Pretrained ChunkFormer-Large on 960 hours of LibriSpeech dataset

License: CC BY-NC 4.0 GitHub Paper Model size

!!!ATTENTION: Input audio must be MONO (1 channel) at 16,000 sample rate

Table of contents

  1. Model Description
  2. Documentation and Implementation
  3. Benchmark Results
  4. Usage
  5. Citation
  6. Contact

Model Description

ChunkFormer-Large-En-Libri-960h is an English Automatic Speech Recognition (ASR) model based on the ChunkFormer architecture, introduced at ICASSP 2025. The model has been fine-tuned on 960 hours of LibriSpeech, a widely-used dataset for ASR research.


Documentation and Implementation

The Documentation and Implementation of ChunkFormer are publicly available.


Benchmark Results

We evaluate the models using Word Error Rate (WER). To ensure a fair comparison, all models are trained exclusively with the WENET framework.

STT Model Test-Clean Test-Other Avg.
1 ChunkFormer 2.69 6.91 4.80
2 Efficient Conformer 2.71 6.95 4.83
3 Conformer 2.77 6.93 4.85
4 Squeezeformer 2.87 7.16 5.02

Quick Usage

To use the ChunkFormer model for English Automatic Speech Recognition, follow these steps:

  1. Download the ChunkFormer Repository
git clone https://github.com/khanld/chunkformer.git
cd chunkformer
pip install -r requirements.txt   
  1. Download the Model Checkpoint from Hugging Face
pip install huggingface_hub
huggingface-cli download khanhld/chunkformer-large-en-libri-960h --local-dir "./chunkformer-large-en-libri-960h"

or

git lfs install
git clone https://huggingface.co/khanhld/chunkformer-large-en-libri-960h

This will download the model checkpoint to the checkpoints folder inside your chunkformer directory.

  1. Run the model
python decode.py \
    --model_checkpoint path/to/local/chunkformer-large-en-libri-960h \
    --long_form_audio path/to/audio.wav \
    --total_batch_duration 14400 \ #in second, default is 1800
    --chunk_size 64 \
    --left_context_size 128 \
    --right_context_size 128

Example Output:

[00:00:01.200] - [00:00:02.400]: this is a transcription example
[00:00:02.500] - [00:00:03.700]: testing the long-form audio

Advanced Usage can be found HERE


Citation

If you use this work in your research, please cite:

@inproceedings{chunkformer,
  title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},
  author={Khanh Le, Tuan Vu Ho, Dung Tran and Duc Thanh Chau},
  booktitle={ICASSP},
  year={2025}
}

Contact