Taejin commited on
Commit
fffa5b3
1 Parent(s): 7a6c836

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -4
README.md CHANGED
@@ -7,7 +7,7 @@ The model files in this repository are the models used in this paper [The CHiME-
7
  NeMo Team’s DASR System](https://arxiv.org/pdf/2310.12378.pdf).
8
 
9
  ## 1. Voice Activity Detection (VAD) Model:
10
- ### **MarbleNet_frame_VAD_chime7_Acrobat.nemo**
11
  - This model is based on [NeMo MarbleNet VAD model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speech_classification/models.html#marblenet-vad).
12
  - For validation, we use dataset comprises the CHiME-6 development subset as well as 50 hours of simulated audio data.
13
  - The simulated data is generated using the [NeMo multi-speaker data simulator](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tools/Multispeaker_Simulator.ipynb)
@@ -17,13 +17,13 @@ on [VoxCeleb1&2 datasets](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.ht
17
 
18
 
19
  ## 2. Speaker Diarization Model: Multi-scale Diarization Decoder (MSDD-v2)
20
- ### **MSDD_v2_PALO_100ms_intrpl_3scales.nemo**
21
 
22
  Our DASR system is based on the speaker diarization system using the multi-scale diarization decoder (MSDD).
23
  - MSDD Reference: [Park et al. (2022)](https://arxiv.org/pdf/2203.15974.pdf)
24
  - MSDD-v2 speaker diarization system employs a multi-scale embedding approach and utilizes TitaNet speaker embedding extractor.
25
  - TitaNet Reference: [Koluguri et al. (2022)](https://arxiv.org/abs/2110.04410)
26
- - TitaNet Model is included in this .nemo checkpoint file.
27
  - Unlike the system that uses a multi-layer LSTM architecture, we employ a four-layer Transformer architecture with a hidden size of 384.
28
  - This neural model generates logit values indicating speaker existence.
29
  - Our diarization model is trained on approximately 3,000 hours of simulated audio mixture data from the same multi-speaker data simulator used in VAD model training, drawing from VoxCeleb1&2 and LibriSpeech datasets.
@@ -32,7 +32,7 @@ on [VoxCeleb1&2 datasets](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.ht
32
 
33
 
34
  ## 3. Automatic Speech Recognition (ASR) model
35
- ### **FastConformerXL-RNNT-chime7-GSS-finetuned.nemo**
36
  - This ASR model is based on [NeMo FastConformer XL model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer).
37
  - Single-channel audio generated using a multi-channel front-end (Guided Source Separation, GSS) is transcribed using a 0.6B parameter Conformer-based transducer (RNNT) model.
38
  - Model Reference: [Gulati et al. (2020)](https://arxiv.org/abs/2005.08100)
@@ -47,6 +47,7 @@ on [VoxCeleb1&2 datasets](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.ht
47
  ## 4. Language Model for ASR Decoding: KenLM Model
48
  ### **[**ASR_LM_chime7_only.kenlm**](https://huggingface.co/chime-dasr/nemo_baseline_models/blob/main/ASR_LM_chime7_only.kenlm)**
49
 
 
50
  - We apply a word-piece level N-gram language model using byte-pair-encoding (BPE) tokens.
51
  - This approach utilizes the SentencePiece and KenLM toolkits, based on the transcription of CHiME-7 train and dev sets.
52
  - SentencePiece: [Kudo and Richardson (2018)](https://arxiv.org/abs/1808.06226)
 
7
  NeMo Team’s DASR System](https://arxiv.org/pdf/2310.12378.pdf).
8
 
9
  ## 1. Voice Activity Detection (VAD) Model:
10
+ ### **[**MarbleNet_frame_VAD_chime7_Acrobat.nemo**](https://huggingface.co/chime-dasr/nemo_baseline_models/blob/main/MarbleNet_frame_VAD_chime7_Acrobat.nemo)**
11
  - This model is based on [NeMo MarbleNet VAD model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speech_classification/models.html#marblenet-vad).
12
  - For validation, we use dataset comprises the CHiME-6 development subset as well as 50 hours of simulated audio data.
13
  - The simulated data is generated using the [NeMo multi-speaker data simulator](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tools/Multispeaker_Simulator.ipynb)
 
17
 
18
 
19
  ## 2. Speaker Diarization Model: Multi-scale Diarization Decoder (MSDD-v2)
20
+ ### **[**MSDD_v2_PALO_100ms_intrpl_3scales.nemo**](https://huggingface.co/chime-dasr/nemo_baseline_models/blob/main/MSDD_v2_PALO_100ms_intrpl_3scales.nemo)**
21
 
22
  Our DASR system is based on the speaker diarization system using the multi-scale diarization decoder (MSDD).
23
  - MSDD Reference: [Park et al. (2022)](https://arxiv.org/pdf/2203.15974.pdf)
24
  - MSDD-v2 speaker diarization system employs a multi-scale embedding approach and utilizes TitaNet speaker embedding extractor.
25
  - TitaNet Reference: [Koluguri et al. (2022)](https://arxiv.org/abs/2110.04410)
26
+ - TitaNet Model is included in [MSDD-v2 .nemo checkpoint file]((https://huggingface.co/chime-dasr/nemo_baseline_models/blob/main/MSDD_v2_PALO_100ms_intrpl_3scales.nemo)).
27
  - Unlike the system that uses a multi-layer LSTM architecture, we employ a four-layer Transformer architecture with a hidden size of 384.
28
  - This neural model generates logit values indicating speaker existence.
29
  - Our diarization model is trained on approximately 3,000 hours of simulated audio mixture data from the same multi-speaker data simulator used in VAD model training, drawing from VoxCeleb1&2 and LibriSpeech datasets.
 
32
 
33
 
34
  ## 3. Automatic Speech Recognition (ASR) model
35
+ ### **[**FastConformerXL-RNNT-chime7-GSS-finetuned.nemo**](https://huggingface.co/chime-dasr/nemo_baseline_models/blob/main/FastConformerXL-RNNT-chime7-GSS-finetuned.nemo)**
36
  - This ASR model is based on [NeMo FastConformer XL model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer).
37
  - Single-channel audio generated using a multi-channel front-end (Guided Source Separation, GSS) is transcribed using a 0.6B parameter Conformer-based transducer (RNNT) model.
38
  - Model Reference: [Gulati et al. (2020)](https://arxiv.org/abs/2005.08100)
 
47
  ## 4. Language Model for ASR Decoding: KenLM Model
48
  ### **[**ASR_LM_chime7_only.kenlm**](https://huggingface.co/chime-dasr/nemo_baseline_models/blob/main/ASR_LM_chime7_only.kenlm)**
49
 
50
+ - This KenLM model is trained solely on CHiME7-DASR datasets (Mixer6, CHiME6, DipCo).
51
  - We apply a word-piece level N-gram language model using byte-pair-encoding (BPE) tokens.
52
  - This approach utilizes the SentencePiece and KenLM toolkits, based on the transcription of CHiME-7 train and dev sets.
53
  - SentencePiece: [Kudo and Richardson (2018)](https://arxiv.org/abs/1808.06226)