Titouan P
commited on
Commit
•
9b2c1bc
1
Parent(s):
64bf71a
upload
Browse files- README.md +57 -0
- checkpoint_best.pt +3 -0
- config.json +75 -0
- preprocessor_config.json +8 -0
- pytorch_model.bin +3 -0
README.md
ADDED
@@ -0,0 +1,57 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: "fr"
|
3 |
+
thumbnail:
|
4 |
+
tags:
|
5 |
+
- wav2vec2
|
6 |
+
license: "apache-2.0"
|
7 |
+
---
|
8 |
+
|
9 |
+
# LeBenchmark: wav2vec2 large model trained on 3K hours of French speech
|
10 |
+
|
11 |
+
|
12 |
+
|
13 |
+
LeBenchmark provides an ensemble of pretrained wav2vec2 models on different French dataset containing spontaneous, read and broadcasted speech. For more information on the different benchmark that can be used to evaluate the wav2vec2 models, please refer to our paper at: [Not Available yet]()
|
14 |
+
|
15 |
+
|
16 |
+
|
17 |
+
## wav2vec2-FR-M-Large: model and data descriptions
|
18 |
+
|
19 |
+
|
20 |
+
We release four different models that can be found under our HuggingFace organisation. Two different wav2vec2 architectures *Base* and *Large* are coupled with our small (*S*) and medium (*M*) corpus. A larger one shoud come later. In short:
|
21 |
+
|
22 |
+
- [wav2vec2-FR-M-Large](#): Large wav2vec2 trained on 2.9K hours of French speech (1.8K Males / 1.0K Females / 0.1K unknown).
|
23 |
+
|
24 |
+
- [wav2vec2-FR-M-Base](https://huggingface.co/LeBenchmark/wav2vec2-FR-M-base): Base wav2vec2 trained on 2.9K hours of French speech (1.8K Males / 1.0K Females / 0.1K unknown).
|
25 |
+
|
26 |
+
- [wav2vec2-FR-S-Large](https://huggingface.co/LeBenchmark/wav2vec2-FR-S-large): Large wav2vec2 trained on 1K hours of French speech (0.5K Males / 0.5K Females).
|
27 |
+
|
28 |
+
- [wav2vec2-FR-S-Base](https://huggingface.co/LeBenchmark/wav2vec2-FR-S-base): Base wav2vec2 trained on 1K hours of French speech (0.5K Males / 0.5K Females).
|
29 |
+
|
30 |
+
|
31 |
+
|
32 |
+
## Intended uses & limitations
|
33 |
+
|
34 |
+
Pretrained wav2vec2 models are distributed under the apache-2.0 licence. Hence, they can be reused extensively without strict limitations. However, benchmarks and data may be linked to corpus that are not completely open-sourced.
|
35 |
+
|
36 |
+
## Fine-tune with Fairseq for ASR with CTC
|
37 |
+
|
38 |
+
As our wav2vec2 models were trained with Fairseq, then can be used in the different tools that they provide to fine-tune the model for ASR with CTC. The full procedure has been nicely summarized in [this blogpost](https://huggingface.co/blog/fine-tune-wav2vec2-english).
|
39 |
+
|
40 |
+
Please note that due to the nature of CTC, speech-to-text results aren't expected to be state-of-the-art. Moreover, future features might appear depending on the involvement of Fairseq and HuggingFace on this part.
|
41 |
+
|
42 |
+
## Integrate to SpeechBrain for ASR, Speaker, Source Separation ...
|
43 |
+
|
44 |
+
Pretrained wav2vec models recently gained in popularity. At the same time [SpeechBrain toolkit](https://speechbrain.github.io) came out, proposing a new and simpler way of dealing with state-of-the-art speech & deep-learning technologies.
|
45 |
+
|
46 |
+
While it currently is in beta, SpeechBrain offers two different ways of nicely integrating wav2vec2 models that were trained with Fairseq i.e our LeBenchmark models!
|
47 |
+
|
48 |
+
1. Extract wav2vec2 features on-the-fly (with a frozen wav2vec2 encoder) to be combined with any speech related architecture. Examples are: E2E ASR with CTC+Att+Language Models; Speaker Recognition or Verification, Source Separation ...
|
49 |
+
2. *Experimental:* To fully benefit from wav2vec2, the best solution remains to fine-tune the model while you train your downstream task. This is very simply allowed within SpeechBrain as just a flag needs to be turned on. Thus, our wav2vec2 models can be fine-tuned while training your favorite ASR pipeline or Speaker Recognizer.
|
50 |
+
|
51 |
+
**If interested, simply follow this [tutorial](https://colab.research.google.com/drive/17Hu1pxqhfMisjkSgmM2CnZxfqDyn2hSY?usp=sharing)**
|
52 |
+
|
53 |
+
## Referencing LeBenchmark
|
54 |
+
|
55 |
+
```
|
56 |
+
Reference to come
|
57 |
+
```
|
checkpoint_best.pt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:1cc0e9d86344a3f70a054859cedc9b5a8bf4782363f19a1199192f8553c9c49a
|
3 |
+
size 3808803846
|
config.json
ADDED
@@ -0,0 +1,75 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"activation_dropout": 0.0,
|
3 |
+
"apply_spec_augment": true,
|
4 |
+
"architectures": [
|
5 |
+
"Wav2Vec2Model"
|
6 |
+
],
|
7 |
+
"attention_dropout": 0.1,
|
8 |
+
"bos_token_id": 1,
|
9 |
+
"conv_bias": true,
|
10 |
+
"conv_dim": [
|
11 |
+
512,
|
12 |
+
512,
|
13 |
+
512,
|
14 |
+
512,
|
15 |
+
512,
|
16 |
+
512,
|
17 |
+
512
|
18 |
+
],
|
19 |
+
"conv_kernel": [
|
20 |
+
10,
|
21 |
+
3,
|
22 |
+
3,
|
23 |
+
3,
|
24 |
+
3,
|
25 |
+
2,
|
26 |
+
2
|
27 |
+
],
|
28 |
+
"conv_stride": [
|
29 |
+
5,
|
30 |
+
2,
|
31 |
+
2,
|
32 |
+
2,
|
33 |
+
2,
|
34 |
+
2,
|
35 |
+
2
|
36 |
+
],
|
37 |
+
"ctc_loss_reduction": "sum",
|
38 |
+
"ctc_zero_infinity": false,
|
39 |
+
"do_stable_layer_norm": true,
|
40 |
+
"eos_token_id": 2,
|
41 |
+
"feat_extract_activation": "gelu",
|
42 |
+
"feat_extract_dropout": 0.0,
|
43 |
+
"feat_extract_norm": "layer",
|
44 |
+
"feat_proj_dropout": 0.1,
|
45 |
+
"final_dropout": 0.0,
|
46 |
+
"gradient_checkpointing": false,
|
47 |
+
"hidden_act": "gelu",
|
48 |
+
"hidden_dropout": 0.1,
|
49 |
+
"hidden_size": 1024,
|
50 |
+
"initializer_range": 0.02,
|
51 |
+
"intermediate_size": 4096,
|
52 |
+
"layer_norm_eps": 1e-05,
|
53 |
+
"layerdrop": 0.1,
|
54 |
+
"mask_channel_length": 10,
|
55 |
+
"mask_channel_min_space": 1,
|
56 |
+
"mask_channel_other": 0.0,
|
57 |
+
"mask_channel_prob": 0.0,
|
58 |
+
"mask_channel_selection": "static",
|
59 |
+
"mask_feature_length": 10,
|
60 |
+
"mask_feature_prob": 0.0,
|
61 |
+
"mask_time_length": 10,
|
62 |
+
"mask_time_min_space": 1,
|
63 |
+
"mask_time_other": 0.0,
|
64 |
+
"mask_time_prob": 0.075,
|
65 |
+
"mask_time_selection": "static",
|
66 |
+
"model_type": "wav2vec2",
|
67 |
+
"num_attention_heads": 16,
|
68 |
+
"num_conv_pos_embedding_groups": 16,
|
69 |
+
"num_conv_pos_embeddings": 128,
|
70 |
+
"num_feat_extract_layers": 7,
|
71 |
+
"num_hidden_layers": 24,
|
72 |
+
"pad_token_id": 0,
|
73 |
+
"transformers_version": "4.5.1",
|
74 |
+
"vocab_size": 32
|
75 |
+
}
|
preprocessor_config.json
ADDED
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"do_normalize": true,
|
3 |
+
"feature_size": 1,
|
4 |
+
"padding_side": "right",
|
5 |
+
"padding_value": 0.0,
|
6 |
+
"return_attention_mask": true,
|
7 |
+
"sampling_rate": 16000
|
8 |
+
}
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:419e0a322a43382fb891f68f95bf36b51f91e90d6f44adbc1ce1968a73707de3
|
3 |
+
size 1261920069
|