Automatic Speech Recognition
NeMo
PyTorch
English
speech
streaming
audio
Transducer
Conformer
CTC
NeMo
Eval Results
smajumdar94 vnoroozi commited on
Commit
61f9be7
1 Parent(s): b0f57de

Add reference to the cache-aware paper. (#2)

Browse files

- Add reference to the cache-aware paper. (7a4ebdeb9844203deb5ceea1631603a3ed32c949)


Co-authored-by: Vahid Noroozi <vnoroozi@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +9 -6
README.md CHANGED
@@ -11,9 +11,9 @@ datasets:
11
  - National-Singapore-Corpus-Part-1
12
  - National-Singapore-Corpus-Part-6
13
  - vctk
14
- - VoxPopuli-(EN)
15
- - Europarl-ASR-(EN)
16
- - Multilingual-LibriSpeech-(2000-hours)
17
  - mozilla-foundation/common_voice_8_0
18
  - MLCommons/peoples_speech
19
  thumbnail: null
@@ -66,19 +66,19 @@ img {
66
 
67
  This collection contains large-size versions of cache-aware FastConformer-Hybrid (around 114M parameters) with multiple look-ahead support, trained on a large scale english speech.
68
  These models are trained for streaming ASR, which be used for streaming applications with a variety of latencies (0ms, 80ms, 480s, 1040ms).
69
- These are the worst latency and average latency of the model for each case would be half of these numbers.
70
 
71
 
72
  ## Model Architecture
73
 
74
- These models are cache-aware versions of Hybrid FastConfomer which are trained for streaming ASR. You may find more info on cache-aware models here: [Cache-aware Streaming Conformer](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#cache-aware-streaming-conformer).
75
  The models are trained with multiple look-aheads which makes the model to be able to support different latencies.
76
  To learn on how to switch between different look-ahead, you may read the documentation on the cache-aware models.
77
 
78
  FastConformer [4] is an optimized version of the Conformer model [1], and
79
  you may find more information on the details of FastConformer here: [Fast-Conformer Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer).
80
 
81
- The model is trained in a multitask setup with joint Transducer and CTC decoder loss. You can find more about Hybrid Transducer-CTC training here: [Hybrid Transducer-CTC](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#hybrid-transducer-ctc).
82
  You may also find more on how to switch between the Transducer and CTC decoders in the documentation.
83
 
84
 
@@ -226,3 +226,6 @@ Check out [Riva live demo](https://developer.nvidia.com/riva#demos).
226
  [3] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
227
 
228
  [4] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
 
 
 
 
11
  - National-Singapore-Corpus-Part-1
12
  - National-Singapore-Corpus-Part-6
13
  - vctk
14
+ - VoxPopuli-EN
15
+ - Europarl-ASR-EN
16
+ - Multilingual-LibriSpeech-2000hours
17
  - mozilla-foundation/common_voice_8_0
18
  - MLCommons/peoples_speech
19
  thumbnail: null
 
66
 
67
  This collection contains large-size versions of cache-aware FastConformer-Hybrid (around 114M parameters) with multiple look-ahead support, trained on a large scale english speech.
68
  These models are trained for streaming ASR, which be used for streaming applications with a variety of latencies (0ms, 80ms, 480s, 1040ms).
69
+ These are the worst latency and average latency of the model for each case would be half of these numbers. You may find more detail and evalution results [here](https://arxiv.org/abs/2312.17279) [5].
70
 
71
 
72
  ## Model Architecture
73
 
74
+ These models are cache-aware versions of Hybrid FastConfomer which are trained for streaming ASR. You may find more info on cache-aware models here: [Cache-aware Streaming Conformer](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#cache-aware-streaming-conformer) [5].
75
  The models are trained with multiple look-aheads which makes the model to be able to support different latencies.
76
  To learn on how to switch between different look-ahead, you may read the documentation on the cache-aware models.
77
 
78
  FastConformer [4] is an optimized version of the Conformer model [1], and
79
  you may find more information on the details of FastConformer here: [Fast-Conformer Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer).
80
 
81
+ The model is trained in a multitask setup with joint Transducer and CTC decoder loss [5]. You can find more about Hybrid Transducer-CTC training here: [Hybrid Transducer-CTC](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#hybrid-transducer-ctc).
82
  You may also find more on how to switch between the Transducer and CTC decoders in the documentation.
83
 
84
 
 
226
  [3] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
227
 
228
  [4] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
229
+
230
+ [5] [Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition
231
+ ](https://arxiv.org/abs/2312.17279)