File size: 4,811 Bytes
94da295 05abc7c 94da295 33c3133 05abc7c 94da295 f673d46 94da295 0946a47 94da295 db47a0c 94da295 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 |
---
language: en
datasets:
- librispeech
metrics:
- wer
pipeline_tag: automatic-speech-recognition
tags:
- transcription
- audio
- speech
- chunkformer
- asr
- automatic-speech-recognition
- long-form transcription
- librispeech
license: cc-by-nc-4.0
model-index:
- name: ChunkFormer-Large-En-Libri-960h
results:
- task:
name: Speech Recognition
type: automatic-speech-recognition
dataset:
name: test-clean
type: librispeech
args: en
metrics:
- name: Test WER
type: wer
value: 2.69
- task:
name: Speech Recognition
type: automatic-speech-recognition
dataset:
name: test-other
type: librispeech
args: en
metrics:
- name: Test WER
type: wer
value: 6.91
---
# **ChunkFormer-Large-En-Libri-960h: Pretrained ChunkFormer-Large on 960 hours of LibriSpeech dataset**
<style>
img {
display: inline;
}
</style>
[](https://creativecommons.org/licenses/by-nc/4.0/)
[](https://github.com/khanld/chunkformer)
[](https://arxiv.org/abs/2502.14673)
[](#description)
**!!!ATTENTION: Input audio must be MONO (1 channel) at 16,000 sample rate**
---
## Table of contents
1. [Model Description](#description)
2. [Documentation and Implementation](#implementation)
3. [Benchmark Results](#benchmark)
4. [Usage](#usage)
6. [Citation](#citation)
7. [Contact](#contact)
---
<a name = "description" ></a>
## Model Description
**ChunkFormer-Large-En-Libri-960h** is an English Automatic Speech Recognition (ASR) model based on the **ChunkFormer** architecture, introduced at **ICASSP 2025**. The model has been fine-tuned on 960 hours of LibriSpeech, a widely-used dataset for ASR research.
---
<a name = "implementation" ></a>
## Documentation and Implementation
The [Documentation]() and [Implementation](https://github.com/khanld/chunkformer) of ChunkFormer are publicly available.
---
<a name = "benchmark" ></a>
## Benchmark Results
We evaluate the models using **Word Error Rate (WER)**. To ensure a fair comparison, all models are trained exclusively with the [**WENET**](https://github.com/wenet-e2e/wenet) framework.
| STT | Model | Test-Clean | Test-Other | Avg. |
|-----|-----------------------|------------|------------|------ |
| 1 | **ChunkFormer** | 2.69 | 6.91 | 4.80 |
| 2 | **Efficient Conformer** | 2.71 | 6.95 | 4.83 |
| 3 | **Conformer** | 2.77 | 6.93 | 4.85 |
| 4 | **Squeezeformer** | 2.87 | 7.16 | 5.02 |
---
<a name = "usage" ></a>
## Quick Usage
To use the ChunkFormer model for English Automatic Speech Recognition, follow these steps:
1. **Download the ChunkFormer Repository**
```bash
git clone https://github.com/khanld/chunkformer.git
cd chunkformer
pip install -r requirements.txt
```
2. **Download the Model Checkpoint from Hugging Face**
```bash
pip install huggingface_hub
huggingface-cli download khanhld/chunkformer-large-en-libri-960h --local-dir "./chunkformer-large-en-libri-960h"
```
or
```bash
git lfs install
git clone https://huggingface.co/khanhld/chunkformer-large-en-libri-960h
```
This will download the model checkpoint to the checkpoints folder inside your chunkformer directory.
3. **Run the model**
```bash
python decode.py \
--model_checkpoint path/to/local/chunkformer-large-en-libri-960h \
--long_form_audio path/to/audio.wav \
--total_batch_duration 14400 \ #in second, default is 1800
--chunk_size 64 \
--left_context_size 128 \
--right_context_size 128
```
Example Output:
```
[00:00:01.200] - [00:00:02.400]: this is a transcription example
[00:00:02.500] - [00:00:03.700]: testing the long-form audio
```
**Advanced Usage** can be found [HERE](https://github.com/khanld/chunkformer/tree/main?tab=readme-ov-file#usage)
---
<a name = "citation" ></a>
## Citation
If you use this work in your research, please cite:
```bibtex
@inproceedings{chunkformer,
title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},
author={Khanh Le, Tuan Vu Ho, Dung Tran and Duc Thanh Chau},
booktitle={ICASSP},
year={2025}
}
```
---
<a name = "contact"></a>
## Contact
- khanhld218@gmail.com
- [](https://github.com/khanld)
- [](https://www.linkedin.com/in/khanhld257/)
|