File size: 4,811 Bytes
94da295
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
05abc7c
 
 
 
 
94da295
 
33c3133
05abc7c
94da295
f673d46
94da295
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0946a47
94da295
 
 
 
db47a0c
 
 
 
 
94da295
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
language: en
datasets:
- librispeech
metrics:
- wer
pipeline_tag: automatic-speech-recognition
tags:
- transcription
- audio
- speech
- chunkformer
- asr
- automatic-speech-recognition
- long-form transcription
- librispeech
license: cc-by-nc-4.0
model-index:
- name: ChunkFormer-Large-En-Libri-960h
  results:
  - task: 
      name: Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: test-clean
      type: librispeech
      args: en
    metrics:
       - name: Test WER
         type: wer
         value: 2.69
  - task: 
      name: Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: test-other
      type: librispeech
      args: en
    metrics:
       - name: Test WER
         type: wer
         value: 6.91
---

# **ChunkFormer-Large-En-Libri-960h: Pretrained ChunkFormer-Large on 960 hours of LibriSpeech dataset**
<style>
img {
 display: inline;
}
</style>
[![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/)
[![GitHub](https://img.shields.io/badge/GitHub-ChunkFormer-blue)](https://github.com/khanld/chunkformer)
[![Paper](https://img.shields.io/badge/Paper-ICASSP%202025-green)](https://arxiv.org/abs/2502.14673)
[![Model size](https://img.shields.io/badge/Params-110M-lightgrey#model-badge)](#description)

**!!!ATTENTION: Input audio must be MONO (1 channel) at 16,000 sample rate** 
---
## Table of contents
1. [Model Description](#description)
2. [Documentation and Implementation](#implementation)
3. [Benchmark Results](#benchmark)
4. [Usage](#usage)
6. [Citation](#citation)
7. [Contact](#contact)

---
<a name = "description" ></a>
## Model Description
**ChunkFormer-Large-En-Libri-960h** is an English Automatic Speech Recognition (ASR) model based on the **ChunkFormer** architecture, introduced at **ICASSP 2025**. The model has been fine-tuned on 960 hours of LibriSpeech, a widely-used dataset for ASR research.

---
<a name = "implementation" ></a>
## Documentation and Implementation
The [Documentation]() and [Implementation](https://github.com/khanld/chunkformer) of ChunkFormer are publicly available.

---
<a name = "benchmark" ></a>
## Benchmark Results
We evaluate the models using **Word Error Rate (WER)**. To ensure a fair comparison, all models are trained exclusively with the [**WENET**](https://github.com/wenet-e2e/wenet) framework.

| STT | Model                 | Test-Clean | Test-Other | Avg.  |
|-----|-----------------------|------------|------------|------ |
|  1  | **ChunkFormer**       | 2.69       | 6.91       | 4.80  |
|  2  | **Efficient Conformer** | 2.71     | 6.95       | 4.83  |
|  3  | **Conformer**         | 2.77       | 6.93       | 4.85  | 
|  4  | **Squeezeformer**     | 2.87       | 7.16       | 5.02  |



---
<a name = "usage" ></a>
## Quick Usage
To use the ChunkFormer model for English Automatic Speech Recognition, follow these steps:

1. **Download the ChunkFormer Repository**
```bash
git clone https://github.com/khanld/chunkformer.git
cd chunkformer
pip install -r requirements.txt   
```
2. **Download the Model Checkpoint from Hugging Face**
```bash
pip install huggingface_hub
huggingface-cli download khanhld/chunkformer-large-en-libri-960h --local-dir "./chunkformer-large-en-libri-960h"
```
or
```bash
git lfs install
git clone https://huggingface.co/khanhld/chunkformer-large-en-libri-960h
```
This will download the model checkpoint to the checkpoints folder inside your chunkformer directory.

3. **Run the model**
```bash
python decode.py \
    --model_checkpoint path/to/local/chunkformer-large-en-libri-960h \
    --long_form_audio path/to/audio.wav \
    --total_batch_duration 14400 \ #in second, default is 1800
    --chunk_size 64 \
    --left_context_size 128 \
    --right_context_size 128
```
Example Output:
```
[00:00:01.200] - [00:00:02.400]: this is a transcription example
[00:00:02.500] - [00:00:03.700]: testing the long-form audio
```
**Advanced Usage** can be found [HERE](https://github.com/khanld/chunkformer/tree/main?tab=readme-ov-file#usage)


---
<a name = "citation" ></a>
## Citation
If you use this work in your research, please cite:

```bibtex
@inproceedings{chunkformer,
  title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},
  author={Khanh Le, Tuan Vu Ho, Dung Tran and Duc Thanh Chau},
  booktitle={ICASSP},
  year={2025}
}
```

---
<a name = "contact"></a>
## Contact
- khanhld218@gmail.com
- [![GitHub](https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white)](https://github.com/khanld)
- [![LinkedIn](https://img.shields.io/badge/linkedin-%230077B5.svg?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/khanhld257/)