Update README.md
Browse files
README.md
CHANGED
@@ -11,15 +11,15 @@ tags:
|
|
11 |
- audio
|
12 |
---
|
13 |
|
14 |
-
#
|
15 |
-
|
16 |
The NEST framework is designed for speech self-supervised learning, which can be used as a frozen speech feature extractor or as weight initialization for downstream speech processing tasks. The NEST-L model has about 115M parameters and is trained on an English dataset of roughly 100K hours. <br>
|
17 |
This model is ready for commercial/non-commercial use. <br>
|
18 |
|
19 |
-
### License
|
20 |
License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). By downloading the public and release version of the model, you accept the terms and conditions of the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
|
21 |
|
22 |
-
## Reference
|
23 |
[1] [NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks](https://arxiv.org/abs/2408.13106) <br>
|
24 |
[2] [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo) <br>
|
25 |
[3] [Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition](https://arxiv.org/abs/2312.17279) <br>
|
@@ -27,7 +27,7 @@ License to use this model is covered by the [CC-BY-4.0](https://creativecommons.
|
|
27 |
[5] [Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens](https://arxiv.org/abs/2409.06656) <br>
|
28 |
[6] [Leveraging Pretrained ASR Encoders for Effective and Efficient End-to-End Speech Intent Classification and Slot Filling](https://arxiv.org/abs/2307.07057)<br>
|
29 |
|
30 |
-
## Model Architecture
|
31 |
|
32 |
**Architecture Type:** NEST [1] <br>
|
33 |
|
@@ -38,31 +38,32 @@ License to use this model is covered by the [CC-BY-4.0](https://creativecommons.
|
|
38 |
- Augmentor: Speaker/noise augmentation
|
39 |
- Loss: Cross-entropy on masked positions <br>
|
40 |
|
41 |
-
|
42 |
**Input Type(s):** Audio <br>
|
43 |
**Input Format(s):** wav files <br>
|
44 |
**Input Parameters:** One-Dimensional (1D) <br>
|
45 |
**Other Properties Related to Input:** 16000 Hz Mono-channel Audio <br>
|
46 |
|
47 |
-
|
48 |
**Output Type(s):** Audio features <br>
|
49 |
**Output Format:** Audio embeddings <br>
|
50 |
**Output Parameters:** Feature sequence (2D) <br>
|
51 |
**Other Properties Related to Output:** Audio feature sequence of shape [D,T] <br>
|
52 |
|
53 |
|
54 |
-
## Model Version(s)
|
55 |
`ssl_en_nest_large_v1.0` <br>
|
56 |
|
57 |
|
58 |
-
## How to Use the Model
|
59 |
The model is available for use in the NVIDIA NeMo Framework [2], and can be used as weight initialization for downstream tasks or as a frozen feature extractor.
|
60 |
-
|
|
|
61 |
```python
|
62 |
from nemo.collections.asr.models import EncDecDenoiseMaskedTokenPredModel
|
63 |
nest_model = EncDecDenoiseMaskedTokenPredModel.from_pretrained(model_name="nvidia/ssl_en_nest_large_v1.0")
|
64 |
```
|
65 |
-
### Using NEST
|
66 |
```bash
|
67 |
# use ASR as example:
|
68 |
python <NeMo Root>/examples/asr/asr_ctc/speech_to_text_ctc_bpe.py \
|
@@ -87,11 +88,12 @@ python <NeMo Root>/examples/asr/asr_ctc/speech_to_text_ctc_bpe.py \
|
|
87 |
exp_manager.wandb_logger_kwargs.project="<Name of project>"
|
88 |
```
|
89 |
More details can be found at [maybe_init_from_pretrained_checkpoint()](https://github.com/NVIDIA/NeMo/blob/main/nemo/core/classes/modelPT.py#L1236).
|
90 |
-
|
|
|
91 |
NEST can also be used as a frozen feature extractor for downstream tasks. For example, in the case of speaker verification, embeddings can be extracted from different layers of the NEST model, and a learned weighted combination of those embeddings can be used as input to the speaker verification model.
|
92 |
Please refer to this example [script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_pretraining/downstream/speech_classification_mfa_train.py) and [config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/ssl/nest/multi_layer_feat/nest_titanet_small.yaml) for details.
|
93 |
|
94 |
-
### Extracting
|
95 |
|
96 |
NEST supports extracting audio features from multiple layers of its encoder:
|
97 |
```bash
|
@@ -105,7 +107,7 @@ python <NeMo Root>/scripts/ssl/extract_features.py \
|
|
105 |
```
|
106 |
|
107 |
## Training
|
108 |
-
The [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo) [2] was used for training the model
|
109 |
## Training Datasets
|
110 |
- [LibriLight](https://github.com/facebookresearch/libri-light)
|
111 |
- Data Collection Method: Human
|
@@ -118,7 +120,7 @@ The [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo) [2] was used for tra
|
|
118 |
- Labeling Method: Hybrid: Automated, Human
|
119 |
<br>
|
120 |
|
121 |
-
## Inference
|
122 |
**Engine:** NVIDIA NeMo <br>
|
123 |
**Test Hardware:** <br>
|
124 |
* A6000 <br>
|
@@ -167,7 +169,7 @@ Model | Intent Acc | SLURP F1
|
|
167 |
ssl_en_nest_large_v1.0 | 89.79 | 79.61
|
168 |
ssl_en_nest_xlarge_v1.0 | 89.04 | 80.31
|
169 |
|
170 |
-
## Software Integration
|
171 |
|
172 |
**Runtime Engine(s):**
|
173 |
* [NeMo-2.0] <br>
|
@@ -187,7 +189,7 @@ ssl_en_nest_xlarge_v1.0 | 89.04 | 80.31
|
|
187 |
* [Windows] <br>
|
188 |
|
189 |
|
190 |
-
## Ethical Considerations
|
191 |
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
|
192 |
|
193 |
For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards [Insert Link to Model Card++ here].
|
|
|
11 |
- audio
|
12 |
---
|
13 |
|
14 |
+
# NVIDIA NEST Large En
|
15 |
+
|
16 |
The NEST framework is designed for speech self-supervised learning, which can be used as a frozen speech feature extractor or as weight initialization for downstream speech processing tasks. The NEST-L model has about 115M parameters and is trained on an English dataset of roughly 100K hours. <br>
|
17 |
This model is ready for commercial/non-commercial use. <br>
|
18 |
|
19 |
+
### License
|
20 |
License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). By downloading the public and release version of the model, you accept the terms and conditions of the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
|
21 |
|
22 |
+
## Reference
|
23 |
[1] [NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks](https://arxiv.org/abs/2408.13106) <br>
|
24 |
[2] [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo) <br>
|
25 |
[3] [Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition](https://arxiv.org/abs/2312.17279) <br>
|
|
|
27 |
[5] [Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens](https://arxiv.org/abs/2409.06656) <br>
|
28 |
[6] [Leveraging Pretrained ASR Encoders for Effective and Efficient End-to-End Speech Intent Classification and Slot Filling](https://arxiv.org/abs/2307.07057)<br>
|
29 |
|
30 |
+
## Model Architecture
|
31 |
|
32 |
**Architecture Type:** NEST [1] <br>
|
33 |
|
|
|
38 |
- Augmentor: Speaker/noise augmentation
|
39 |
- Loss: Cross-entropy on masked positions <br>
|
40 |
|
41 |
+
### Input
|
42 |
**Input Type(s):** Audio <br>
|
43 |
**Input Format(s):** wav files <br>
|
44 |
**Input Parameters:** One-Dimensional (1D) <br>
|
45 |
**Other Properties Related to Input:** 16000 Hz Mono-channel Audio <br>
|
46 |
|
47 |
+
### Output
|
48 |
**Output Type(s):** Audio features <br>
|
49 |
**Output Format:** Audio embeddings <br>
|
50 |
**Output Parameters:** Feature sequence (2D) <br>
|
51 |
**Other Properties Related to Output:** Audio feature sequence of shape [D,T] <br>
|
52 |
|
53 |
|
54 |
+
## Model Version(s)
|
55 |
`ssl_en_nest_large_v1.0` <br>
|
56 |
|
57 |
|
58 |
+
## How to Use the Model
|
59 |
The model is available for use in the NVIDIA NeMo Framework [2], and can be used as weight initialization for downstream tasks or as a frozen feature extractor.
|
60 |
+
|
61 |
+
### Loading the whole model
|
62 |
```python
|
63 |
from nemo.collections.asr.models import EncDecDenoiseMaskedTokenPredModel
|
64 |
nest_model = EncDecDenoiseMaskedTokenPredModel.from_pretrained(model_name="nvidia/ssl_en_nest_large_v1.0")
|
65 |
```
|
66 |
+
### Using NEST as weight initialization for downstream tasks
|
67 |
```bash
|
68 |
# use ASR as example:
|
69 |
python <NeMo Root>/examples/asr/asr_ctc/speech_to_text_ctc_bpe.py \
|
|
|
88 |
exp_manager.wandb_logger_kwargs.project="<Name of project>"
|
89 |
```
|
90 |
More details can be found at [maybe_init_from_pretrained_checkpoint()](https://github.com/NVIDIA/NeMo/blob/main/nemo/core/classes/modelPT.py#L1236).
|
91 |
+
|
92 |
+
### Using NEST as Frozen Feature Extractor
|
93 |
NEST can also be used as a frozen feature extractor for downstream tasks. For example, in the case of speaker verification, embeddings can be extracted from different layers of the NEST model, and a learned weighted combination of those embeddings can be used as input to the speaker verification model.
|
94 |
Please refer to this example [script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_pretraining/downstream/speech_classification_mfa_train.py) and [config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/ssl/nest/multi_layer_feat/nest_titanet_small.yaml) for details.
|
95 |
|
96 |
+
### Extracting Audio Features from NEST
|
97 |
|
98 |
NEST supports extracting audio features from multiple layers of its encoder:
|
99 |
```bash
|
|
|
107 |
```
|
108 |
|
109 |
## Training
|
110 |
+
The [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo) [2] was used for training the model. Model is trained with this example [script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_pretraining/masked_token_pred_pretrain.py) and [config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/ssl/nest/nest_fast-conformer.yaml).
|
111 |
## Training Datasets
|
112 |
- [LibriLight](https://github.com/facebookresearch/libri-light)
|
113 |
- Data Collection Method: Human
|
|
|
120 |
- Labeling Method: Hybrid: Automated, Human
|
121 |
<br>
|
122 |
|
123 |
+
## Inference
|
124 |
**Engine:** NVIDIA NeMo <br>
|
125 |
**Test Hardware:** <br>
|
126 |
* A6000 <br>
|
|
|
169 |
ssl_en_nest_large_v1.0 | 89.79 | 79.61
|
170 |
ssl_en_nest_xlarge_v1.0 | 89.04 | 80.31
|
171 |
|
172 |
+
## Software Integration
|
173 |
|
174 |
**Runtime Engine(s):**
|
175 |
* [NeMo-2.0] <br>
|
|
|
189 |
* [Windows] <br>
|
190 |
|
191 |
|
192 |
+
## Ethical Considerations
|
193 |
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
|
194 |
|
195 |
For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards [Insert Link to Model Card++ here].
|