nvidia
/

ssl_en_nest_large_v1.0

Self-supervised Learning

Model card Files Files and versions Community

steveheh commited on 10 days ago

Commit

0258c09

•

1 Parent(s): a969af3

Update README.md

Files changed (1) hide show

README.md +15 -2

README.md CHANGED Viewed

@@ -13,6 +13,15 @@ tags:
 # NVIDIA NEST Large En
 The NEST framework is designed for speech self-supervised learning, which can be used as a frozen speech feature extractor or as weight initialization for downstream speech processing tasks. The NEST-L model has about 115M parameters and is trained on an English dataset of roughly 100K hours.  <br>
 This model is ready for commercial/non-commercial use.  <br>
@@ -29,9 +38,13 @@ License to use this model is covered by the [CC-BY-4.0](https://creativecommons.
 ## Model Architecture
-**Architecture Type:** NEST [1]  <br>
-**Network Architecture:**
 - Encoder: FastConformer (18 layers)
 - Decoder: Linear classifier
 - Masking: Random block masking

 # NVIDIA NEST Large En
+<style>
+img {
+ display: inline;
+}
+</style>
+[![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer-lightgrey#model-badge)](#model-architecture)
+| [![Model size](https://img.shields.io/badge/Params-115M-lightgrey#model-badge)](#model-architecture)
 The NEST framework is designed for speech self-supervised learning, which can be used as a frozen speech feature extractor or as weight initialization for downstream speech processing tasks. The NEST-L model has about 115M parameters and is trained on an English dataset of roughly 100K hours.  <br>
 This model is ready for commercial/non-commercial use.  <br>
 ## Model Architecture
+The [NEST](https://arxiv.org/abs/2408.13106) framework comprises several building blocks, as illustrated in the left part of the following figure. Once trained, the NEST encoder can be used as weight initialization or feature extractor for downstream speech processing tasks.
+<div align="center">
+    <img src="nest-model.png" width="750" />
+</div>
+**Architecture Details:**
 - Encoder: FastConformer (18 layers)
 - Decoder: Linear classifier
 - Masking: Random block masking