yangwang825 commited on
Commit
333f23f
·
verified ·
1 Parent(s): bc4c590

Upload config

Browse files
Files changed (3) hide show
  1. README.md +199 -0
  2. config.json +126 -0
  3. configuration_wav2vec2_spkreg.py +344 -0
README.md ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ tags: []
4
+ ---
5
+
6
+ # Model Card for Model ID
7
+
8
+ <!-- Provide a quick summary of what the model is/does. -->
9
+
10
+
11
+
12
+ ## Model Details
13
+
14
+ ### Model Description
15
+
16
+ <!-- Provide a longer summary of what this model is. -->
17
+
18
+ This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
+
20
+ - **Developed by:** [More Information Needed]
21
+ - **Funded by [optional]:** [More Information Needed]
22
+ - **Shared by [optional]:** [More Information Needed]
23
+ - **Model type:** [More Information Needed]
24
+ - **Language(s) (NLP):** [More Information Needed]
25
+ - **License:** [More Information Needed]
26
+ - **Finetuned from model [optional]:** [More Information Needed]
27
+
28
+ ### Model Sources [optional]
29
+
30
+ <!-- Provide the basic links for the model. -->
31
+
32
+ - **Repository:** [More Information Needed]
33
+ - **Paper [optional]:** [More Information Needed]
34
+ - **Demo [optional]:** [More Information Needed]
35
+
36
+ ## Uses
37
+
38
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
+
40
+ ### Direct Use
41
+
42
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
+
44
+ [More Information Needed]
45
+
46
+ ### Downstream Use [optional]
47
+
48
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
+
50
+ [More Information Needed]
51
+
52
+ ### Out-of-Scope Use
53
+
54
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
+
56
+ [More Information Needed]
57
+
58
+ ## Bias, Risks, and Limitations
59
+
60
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
+
62
+ [More Information Needed]
63
+
64
+ ### Recommendations
65
+
66
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
+
68
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
+
70
+ ## How to Get Started with the Model
71
+
72
+ Use the code below to get started with the model.
73
+
74
+ [More Information Needed]
75
+
76
+ ## Training Details
77
+
78
+ ### Training Data
79
+
80
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
+
82
+ [More Information Needed]
83
+
84
+ ### Training Procedure
85
+
86
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
+
88
+ #### Preprocessing [optional]
89
+
90
+ [More Information Needed]
91
+
92
+
93
+ #### Training Hyperparameters
94
+
95
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
+
97
+ #### Speeds, Sizes, Times [optional]
98
+
99
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
+
101
+ [More Information Needed]
102
+
103
+ ## Evaluation
104
+
105
+ <!-- This section describes the evaluation protocols and provides the results. -->
106
+
107
+ ### Testing Data, Factors & Metrics
108
+
109
+ #### Testing Data
110
+
111
+ <!-- This should link to a Dataset Card if possible. -->
112
+
113
+ [More Information Needed]
114
+
115
+ #### Factors
116
+
117
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
+
119
+ [More Information Needed]
120
+
121
+ #### Metrics
122
+
123
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
+
125
+ [More Information Needed]
126
+
127
+ ### Results
128
+
129
+ [More Information Needed]
130
+
131
+ #### Summary
132
+
133
+
134
+
135
+ ## Model Examination [optional]
136
+
137
+ <!-- Relevant interpretability work for the model goes here -->
138
+
139
+ [More Information Needed]
140
+
141
+ ## Environmental Impact
142
+
143
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
+
145
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
+
147
+ - **Hardware Type:** [More Information Needed]
148
+ - **Hours used:** [More Information Needed]
149
+ - **Cloud Provider:** [More Information Needed]
150
+ - **Compute Region:** [More Information Needed]
151
+ - **Carbon Emitted:** [More Information Needed]
152
+
153
+ ## Technical Specifications [optional]
154
+
155
+ ### Model Architecture and Objective
156
+
157
+ [More Information Needed]
158
+
159
+ ### Compute Infrastructure
160
+
161
+ [More Information Needed]
162
+
163
+ #### Hardware
164
+
165
+ [More Information Needed]
166
+
167
+ #### Software
168
+
169
+ [More Information Needed]
170
+
171
+ ## Citation [optional]
172
+
173
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
+
175
+ **BibTeX:**
176
+
177
+ [More Information Needed]
178
+
179
+ **APA:**
180
+
181
+ [More Information Needed]
182
+
183
+ ## Glossary [optional]
184
+
185
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
+
187
+ [More Information Needed]
188
+
189
+ ## More Information [optional]
190
+
191
+ [More Information Needed]
192
+
193
+ ## Model Card Authors [optional]
194
+
195
+ [More Information Needed]
196
+
197
+ ## Model Card Contact
198
+
199
+ [More Information Needed]
config.json ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "activation_dropout": 0.0,
3
+ "adapter_attn_dim": null,
4
+ "adapter_kernel_size": 3,
5
+ "adapter_stride": 2,
6
+ "add_adapter": false,
7
+ "apply_spec_augment": true,
8
+ "architectures": [
9
+ "Wav2Vec2ForPreTraining"
10
+ ],
11
+ "attention_dropout": 0.1,
12
+ "auto_map": {
13
+ "AutoConfig": "configuration_wav2vec2_spkreg.Wav2Vec2SpkRegConfig"
14
+ },
15
+ "bos_token_id": 1,
16
+ "classifier_proj_size": 256,
17
+ "codevector_dim": 256,
18
+ "contrastive_logits_temperature": 0.1,
19
+ "conv_bias": false,
20
+ "conv_dim": [
21
+ 512,
22
+ 512,
23
+ 512,
24
+ 512,
25
+ 512,
26
+ 512,
27
+ 512
28
+ ],
29
+ "conv_kernel": [
30
+ 10,
31
+ 3,
32
+ 3,
33
+ 3,
34
+ 3,
35
+ 2,
36
+ 2
37
+ ],
38
+ "conv_stride": [
39
+ 5,
40
+ 2,
41
+ 2,
42
+ 2,
43
+ 2,
44
+ 2,
45
+ 2
46
+ ],
47
+ "ctc_loss_reduction": "sum",
48
+ "ctc_zero_infinity": false,
49
+ "diversity_loss_weight": 0.1,
50
+ "do_stable_layer_norm": false,
51
+ "easy_margin": false,
52
+ "eos_token_id": 2,
53
+ "feat_extract_activation": "gelu",
54
+ "feat_extract_norm": "group",
55
+ "feat_proj_dropout": 0.1,
56
+ "feat_quantizer_dropout": 0.0,
57
+ "final_dropout": 0.0,
58
+ "freeze_feat_extract_train": true,
59
+ "gradient_checkpointing": true,
60
+ "hidden_act": "gelu",
61
+ "hidden_dropout": 0.1,
62
+ "hidden_size": 768,
63
+ "initializer_range": 0.02,
64
+ "intermediate_size": 3072,
65
+ "label_smoothing": 0.0,
66
+ "layer_norm_eps": 1e-05,
67
+ "layerdrop": 0.0,
68
+ "loss_fct": "cross_entropy",
69
+ "margin": 0.35,
70
+ "mask_channel_length": 10,
71
+ "mask_channel_min_space": 1,
72
+ "mask_channel_other": 0.0,
73
+ "mask_channel_prob": 0.0,
74
+ "mask_channel_selection": "static",
75
+ "mask_feature_length": 10,
76
+ "mask_feature_min_masks": 0,
77
+ "mask_feature_prob": 0.0,
78
+ "mask_time_length": 10,
79
+ "mask_time_min_masks": 2,
80
+ "mask_time_min_space": 1,
81
+ "mask_time_other": 0.0,
82
+ "mask_time_prob": 0.05,
83
+ "mask_time_selection": "static",
84
+ "model_type": "wav2vec2_spkreg",
85
+ "no_mask_channel_overlap": false,
86
+ "no_mask_time_overlap": false,
87
+ "num_adapter_layers": 3,
88
+ "num_attention_heads": 12,
89
+ "num_codevector_groups": 2,
90
+ "num_codevectors_per_group": 320,
91
+ "num_conv_pos_embedding_groups": 16,
92
+ "num_conv_pos_embeddings": 128,
93
+ "num_feat_extract_layers": 7,
94
+ "num_hidden_layers": 12,
95
+ "num_negatives": 100,
96
+ "output_hidden_size": 768,
97
+ "pad_token_id": 0,
98
+ "proj_codevector_dim": 256,
99
+ "reduction": "mean",
100
+ "scale": 30.0,
101
+ "tdnn_dilation": [
102
+ 1,
103
+ 2,
104
+ 3,
105
+ 1,
106
+ 1
107
+ ],
108
+ "tdnn_dim": [
109
+ 512,
110
+ 512,
111
+ 512,
112
+ 512,
113
+ 1500
114
+ ],
115
+ "tdnn_kernel": [
116
+ 5,
117
+ 3,
118
+ 3,
119
+ 1,
120
+ 1
121
+ ],
122
+ "transformers_version": "4.46.2",
123
+ "use_weighted_layer_sum": false,
124
+ "vocab_size": 32,
125
+ "xvector_output_dim": 512
126
+ }
configuration_wav2vec2_spkreg.py ADDED
@@ -0,0 +1,344 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Wav2Vec2 model configuration"""
2
+
3
+ import functools
4
+ import operator
5
+
6
+ from transformers.configuration_utils import PretrainedConfig
7
+ from transformers.utils import logging
8
+
9
+
10
+ logger = logging.get_logger(__name__)
11
+
12
+
13
+ class Wav2Vec2SpkRegConfig(PretrainedConfig):
14
+ r"""
15
+ This is the configuration class to store the configuration of a [`Wav2Vec2Model`]. It is used to instantiate an
16
+ Wav2Vec2 model according to the specified arguments, defining the model architecture. Instantiating a configuration
17
+ with the defaults will yield a similar configuration to that of the Wav2Vec2
18
+ [facebook/wav2vec2-base-960h](https://huggingface.co/facebook/wav2vec2-base-960h) architecture.
19
+
20
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
21
+ documentation from [`PretrainedConfig`] for more information.
22
+
23
+
24
+ Args:
25
+ vocab_size (`int`, *optional*, defaults to 32):
26
+ Vocabulary size of the Wav2Vec2 model. Defines the number of different tokens that can be represented by
27
+ the `inputs_ids` passed when calling [`Wav2Vec2Model`] or [`TFWav2Vec2Model`]. Vocabulary size of the
28
+ model. Defines the different tokens that can be represented by the *inputs_ids* passed to the forward
29
+ method of [`Wav2Vec2Model`].
30
+ hidden_size (`int`, *optional*, defaults to 768):
31
+ Dimensionality of the encoder layers and the pooler layer.
32
+ num_hidden_layers (`int`, *optional*, defaults to 12):
33
+ Number of hidden layers in the Transformer encoder.
34
+ num_attention_heads (`int`, *optional*, defaults to 12):
35
+ Number of attention heads for each attention layer in the Transformer encoder.
36
+ intermediate_size (`int`, *optional*, defaults to 3072):
37
+ Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
38
+ hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
39
+ The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
40
+ `"relu"`, `"selu"` and `"gelu_new"` are supported.
41
+ hidden_dropout (`float`, *optional*, defaults to 0.1):
42
+ The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
43
+ activation_dropout (`float`, *optional*, defaults to 0.1):
44
+ The dropout ratio for activations inside the fully connected layer.
45
+ attention_dropout (`float`, *optional*, defaults to 0.1):
46
+ The dropout ratio for the attention probabilities.
47
+ final_dropout (`float`, *optional*, defaults to 0.1):
48
+ The dropout probability for the final projection layer of [`Wav2Vec2ForCTC`].
49
+ layerdrop (`float`, *optional*, defaults to 0.1):
50
+ The LayerDrop probability. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556) for more
51
+ details.
52
+ initializer_range (`float`, *optional*, defaults to 0.02):
53
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
54
+ layer_norm_eps (`float`, *optional*, defaults to 1e-12):
55
+ The epsilon used by the layer normalization layers.
56
+ feat_extract_norm (`str`, *optional*, defaults to `"group"`):
57
+ The norm to be applied to 1D convolutional layers in feature encoder. One of `"group"` for group
58
+ normalization of only the first 1D convolutional layer or `"layer"` for layer normalization of all 1D
59
+ convolutional layers.
60
+ feat_proj_dropout (`float`, *optional*, defaults to 0.0):
61
+ The dropout probability for output of the feature encoder.
62
+ feat_extract_activation (`str, `optional`, defaults to `"gelu"`):
63
+ The non-linear activation function (function or string) in the 1D convolutional layers of the feature
64
+ extractor. If string, `"gelu"`, `"relu"`, `"selu"` and `"gelu_new"` are supported.
65
+ feat_quantizer_dropout (`float`, *optional*, defaults to 0.0):
66
+ The dropout probability for quantized feature encoder states.
67
+ conv_dim (`Tuple[int]` or `List[int]`, *optional*, defaults to `(512, 512, 512, 512, 512, 512, 512)`):
68
+ A tuple of integers defining the number of input and output channels of each 1D convolutional layer in the
69
+ feature encoder. The length of *conv_dim* defines the number of 1D convolutional layers.
70
+ conv_stride (`Tuple[int]` or `List[int]`, *optional*, defaults to `(5, 2, 2, 2, 2, 2, 2)`):
71
+ A tuple of integers defining the stride of each 1D convolutional layer in the feature encoder. The length
72
+ of *conv_stride* defines the number of convolutional layers and has to match the length of *conv_dim*.
73
+ conv_kernel (`Tuple[int]` or `List[int]`, *optional*, defaults to `(10, 3, 3, 3, 3, 3, 3)`):
74
+ A tuple of integers defining the kernel size of each 1D convolutional layer in the feature encoder. The
75
+ length of *conv_kernel* defines the number of convolutional layers and has to match the length of
76
+ *conv_dim*.
77
+ conv_bias (`bool`, *optional*, defaults to `False`):
78
+ Whether the 1D convolutional layers have a bias.
79
+ num_conv_pos_embeddings (`int`, *optional*, defaults to 128):
80
+ Number of convolutional positional embeddings. Defines the kernel size of 1D convolutional positional
81
+ embeddings layer.
82
+ num_conv_pos_embedding_groups (`int`, *optional*, defaults to 16):
83
+ Number of groups of 1D convolutional positional embeddings layer.
84
+ do_stable_layer_norm (`bool`, *optional*, defaults to `False`):
85
+ Whether to apply *stable* layer norm architecture of the Transformer encoder. `do_stable_layer_norm is
86
+ True` corresponds to applying layer norm before the attention layer, whereas `do_stable_layer_norm is
87
+ False` corresponds to applying layer norm after the attention layer.
88
+ apply_spec_augment (`bool`, *optional*, defaults to `True`):
89
+ Whether to apply *SpecAugment* data augmentation to the outputs of the feature encoder. For reference see
90
+ [SpecAugment: A Simple Data Augmentation Method for Automatic Speech
91
+ Recognition](https://arxiv.org/abs/1904.08779).
92
+ mask_time_prob (`float`, *optional*, defaults to 0.05):
93
+ Percentage (between 0 and 1) of all feature vectors along the time axis which will be masked. The masking
94
+ procecure generates ''mask_time_prob*len(time_axis)/mask_time_length'' independent masks over the axis. If
95
+ reasoning from the propability of each feature vector to be chosen as the start of the vector span to be
96
+ masked, *mask_time_prob* should be `prob_vector_start*mask_time_length`. Note that overlap may decrease the
97
+ actual percentage of masked vectors. This is only relevant if `apply_spec_augment is True`.
98
+ mask_time_length (`int`, *optional*, defaults to 10):
99
+ Length of vector span along the time axis.
100
+ mask_time_min_masks (`int`, *optional*, defaults to 2),:
101
+ The minimum number of masks of length `mask_feature_length` generated along the time axis, each time step,
102
+ irrespectively of `mask_feature_prob`. Only relevant if ''mask_time_prob*len(time_axis)/mask_time_length <
103
+ mask_time_min_masks''
104
+ mask_feature_prob (`float`, *optional*, defaults to 0.0):
105
+ Percentage (between 0 and 1) of all feature vectors along the feature axis which will be masked. The
106
+ masking procecure generates ''mask_feature_prob*len(feature_axis)/mask_time_length'' independent masks over
107
+ the axis. If reasoning from the propability of each feature vector to be chosen as the start of the vector
108
+ span to be masked, *mask_feature_prob* should be `prob_vector_start*mask_feature_length`. Note that overlap
109
+ may decrease the actual percentage of masked vectors. This is only relevant if `apply_spec_augment is
110
+ True`.
111
+ mask_feature_length (`int`, *optional*, defaults to 10):
112
+ Length of vector span along the feature axis.
113
+ mask_feature_min_masks (`int`, *optional*, defaults to 0),:
114
+ The minimum number of masks of length `mask_feature_length` generated along the feature axis, each time
115
+ step, irrespectively of `mask_feature_prob`. Only relevant if
116
+ ''mask_feature_prob*len(feature_axis)/mask_feature_length < mask_feature_min_masks''
117
+ num_codevectors_per_group (`int`, *optional*, defaults to 320):
118
+ Number of entries in each quantization codebook (group).
119
+ num_codevector_groups (`int`, *optional*, defaults to 2):
120
+ Number of codevector groups for product codevector quantization.
121
+ contrastive_logits_temperature (`float`, *optional*, defaults to 0.1):
122
+ The temperature *kappa* in the contrastive loss.
123
+ feat_quantizer_dropout (`float`, *optional*, defaults to 0.0):
124
+ The dropout probability for the output of the feature encoder that's used by the quantizer.
125
+ num_negatives (`int`, *optional*, defaults to 100):
126
+ Number of negative samples for the contrastive loss.
127
+ codevector_dim (`int`, *optional*, defaults to 256):
128
+ Dimensionality of the quantized feature vectors.
129
+ proj_codevector_dim (`int`, *optional*, defaults to 256):
130
+ Dimensionality of the final projection of both the quantized and the transformer features.
131
+ diversity_loss_weight (`int`, *optional*, defaults to 0.1):
132
+ The weight of the codebook diversity loss component.
133
+ ctc_loss_reduction (`str`, *optional*, defaults to `"sum"`):
134
+ Specifies the reduction to apply to the output of `torch.nn.CTCLoss`. Only relevant when training an
135
+ instance of [`Wav2Vec2ForCTC`].
136
+ ctc_zero_infinity (`bool`, *optional*, defaults to `False`):
137
+ Whether to zero infinite losses and the associated gradients of `torch.nn.CTCLoss`. Infinite losses mainly
138
+ occur when the inputs are too short to be aligned to the targets. Only relevant when training an instance
139
+ of [`Wav2Vec2ForCTC`].
140
+ use_weighted_layer_sum (`bool`, *optional*, defaults to `False`):
141
+ Whether to use a weighted average of layer outputs with learned weights. Only relevant when using an
142
+ instance of [`Wav2Vec2ForSequenceClassification`].
143
+ classifier_proj_size (`int`, *optional*, defaults to 256):
144
+ Dimensionality of the projection before token mean-pooling for classification.
145
+ tdnn_dim (`Tuple[int]` or `List[int]`, *optional*, defaults to `(512, 512, 512, 512, 1500)`):
146
+ A tuple of integers defining the number of output channels of each 1D convolutional layer in the *TDNN*
147
+ module of the *XVector* model. The length of *tdnn_dim* defines the number of *TDNN* layers.
148
+ tdnn_kernel (`Tuple[int]` or `List[int]`, *optional*, defaults to `(5, 3, 3, 1, 1)`):
149
+ A tuple of integers defining the kernel size of each 1D convolutional layer in the *TDNN* module of the
150
+ *XVector* model. The length of *tdnn_kernel* has to match the length of *tdnn_dim*.
151
+ tdnn_dilation (`Tuple[int]` or `List[int]`, *optional*, defaults to `(1, 2, 3, 1, 1)`):
152
+ A tuple of integers defining the dilation factor of each 1D convolutional layer in *TDNN* module of the
153
+ *XVector* model. The length of *tdnn_dilation* has to match the length of *tdnn_dim*.
154
+ xvector_output_dim (`int`, *optional*, defaults to 512):
155
+ Dimensionality of the *XVector* embedding vectors.
156
+ add_adapter (`bool`, *optional*, defaults to `False`):
157
+ Whether a convolutional network should be stacked on top of the Wav2Vec2 Encoder. Can be very useful for
158
+ warm-starting Wav2Vec2 for SpeechEncoderDecoder models.
159
+ adapter_kernel_size (`int`, *optional*, defaults to 3):
160
+ Kernel size of the convolutional layers in the adapter network. Only relevant if `add_adapter is True`.
161
+ adapter_stride (`int`, *optional*, defaults to 2):
162
+ Stride of the convolutional layers in the adapter network. Only relevant if `add_adapter is True`.
163
+ num_adapter_layers (`int`, *optional*, defaults to 3):
164
+ Number of convolutional layers that should be used in the adapter network. Only relevant if `add_adapter is
165
+ True`.
166
+ adapter_attn_dim (`int`, *optional*):
167
+ Dimension of the attention adapter weights to be used in each attention block. An example of a model using
168
+ attention adapters is [facebook/mms-1b-all](https://huggingface.co/facebook/mms-1b-all).
169
+ output_hidden_size (`int`, *optional*):
170
+ Dimensionality of the encoder output layer. If not defined, this defaults to *hidden-size*. Only relevant
171
+ if `add_adapter is True`.
172
+
173
+ Example:
174
+
175
+ ```python
176
+ >>> from transformers import Wav2Vec2Config, Wav2Vec2Model
177
+
178
+ >>> # Initializing a Wav2Vec2 facebook/wav2vec2-base-960h style configuration
179
+ >>> configuration = Wav2Vec2Config()
180
+
181
+ >>> # Initializing a model (with random weights) from the facebook/wav2vec2-base-960h style configuration
182
+ >>> model = Wav2Vec2Model(configuration)
183
+
184
+ >>> # Accessing the model configuration
185
+ >>> configuration = model.config
186
+ ```"""
187
+
188
+ model_type = "wav2vec2_spkreg"
189
+
190
+ def __init__(
191
+ self,
192
+ vocab_size=32,
193
+ hidden_size=768,
194
+ num_hidden_layers=12,
195
+ num_attention_heads=12,
196
+ intermediate_size=3072,
197
+ hidden_act="gelu",
198
+ hidden_dropout=0.1,
199
+ activation_dropout=0.1,
200
+ attention_dropout=0.1,
201
+ feat_proj_dropout=0.0,
202
+ feat_quantizer_dropout=0.0,
203
+ final_dropout=0.1,
204
+ layerdrop=0.1,
205
+ initializer_range=0.02,
206
+ layer_norm_eps=1e-5,
207
+ feat_extract_norm="group",
208
+ feat_extract_activation="gelu",
209
+ conv_dim=(512, 512, 512, 512, 512, 512, 512),
210
+ conv_stride=(5, 2, 2, 2, 2, 2, 2),
211
+ conv_kernel=(10, 3, 3, 3, 3, 2, 2),
212
+ conv_bias=False,
213
+ num_conv_pos_embeddings=128,
214
+ num_conv_pos_embedding_groups=16,
215
+ do_stable_layer_norm=False,
216
+ apply_spec_augment=True,
217
+ mask_time_prob=0.05,
218
+ mask_time_length=10,
219
+ mask_time_min_masks=2,
220
+ mask_feature_prob=0.0,
221
+ mask_feature_length=10,
222
+ mask_feature_min_masks=0,
223
+ num_codevectors_per_group=320,
224
+ num_codevector_groups=2,
225
+ contrastive_logits_temperature=0.1,
226
+ num_negatives=100,
227
+ codevector_dim=256,
228
+ proj_codevector_dim=256,
229
+ diversity_loss_weight=0.1,
230
+ ctc_loss_reduction="sum",
231
+ ctc_zero_infinity=False,
232
+ use_weighted_layer_sum=False,
233
+ classifier_proj_size=256,
234
+ tdnn_dim=(512, 512, 512, 512, 1500),
235
+ tdnn_kernel=(5, 3, 3, 1, 1),
236
+ tdnn_dilation=(1, 2, 3, 1, 1),
237
+ xvector_output_dim=512,
238
+ pad_token_id=0,
239
+ bos_token_id=1,
240
+ eos_token_id=2,
241
+ add_adapter=False,
242
+ adapter_kernel_size=3,
243
+ adapter_stride=2,
244
+ num_adapter_layers=3,
245
+ output_hidden_size=None,
246
+ adapter_attn_dim=None,
247
+ loss_fct: str = 'cross_entropy', # cross_entropy, additive_margin, additive_angular_margin
248
+ label_smoothing: float = 0.0,
249
+ scale: float = 30.0,
250
+ margin: float = 0.35,
251
+ easy_margin: bool = False,
252
+ reduction: str = "mean",
253
+ **kwargs,
254
+ ):
255
+ super().__init__(**kwargs, pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id)
256
+ self.hidden_size = hidden_size
257
+ self.feat_extract_norm = feat_extract_norm
258
+ self.feat_extract_activation = feat_extract_activation
259
+ self.conv_dim = list(conv_dim)
260
+ self.conv_stride = list(conv_stride)
261
+ self.conv_kernel = list(conv_kernel)
262
+ self.conv_bias = conv_bias
263
+ self.num_conv_pos_embeddings = num_conv_pos_embeddings
264
+ self.num_conv_pos_embedding_groups = num_conv_pos_embedding_groups
265
+ self.num_feat_extract_layers = len(self.conv_dim)
266
+ self.num_hidden_layers = num_hidden_layers
267
+ self.intermediate_size = intermediate_size
268
+ self.hidden_act = hidden_act
269
+ self.num_attention_heads = num_attention_heads
270
+ self.hidden_dropout = hidden_dropout
271
+ self.attention_dropout = attention_dropout
272
+ self.activation_dropout = activation_dropout
273
+ self.feat_proj_dropout = feat_proj_dropout
274
+ self.final_dropout = final_dropout
275
+ self.layerdrop = layerdrop
276
+ self.layer_norm_eps = layer_norm_eps
277
+ self.initializer_range = initializer_range
278
+ self.vocab_size = vocab_size
279
+ self.do_stable_layer_norm = do_stable_layer_norm
280
+ self.use_weighted_layer_sum = use_weighted_layer_sum
281
+
282
+ if (
283
+ (len(self.conv_stride) != self.num_feat_extract_layers)
284
+ or (len(self.conv_kernel) != self.num_feat_extract_layers)
285
+ or (len(self.conv_dim) != self.num_feat_extract_layers)
286
+ ):
287
+ raise ValueError(
288
+ "Configuration for convolutional layers is incorrect. It is required that `len(config.conv_dim)` =="
289
+ " `len(config.conv_stride)` == `len(config.conv_kernel)`, but is `len(config.conv_dim) ="
290
+ f" {len(self.conv_dim)}`, `len(config.conv_stride) = {len(self.conv_stride)}`,"
291
+ f" `len(config.conv_kernel) = {len(self.conv_kernel)}`."
292
+ )
293
+
294
+ # fine-tuning config parameters for SpecAugment: https://arxiv.org/abs/1904.08779
295
+ self.apply_spec_augment = apply_spec_augment
296
+ self.mask_time_prob = mask_time_prob
297
+ self.mask_time_length = mask_time_length
298
+ self.mask_time_min_masks = mask_time_min_masks
299
+ self.mask_feature_prob = mask_feature_prob
300
+ self.mask_feature_length = mask_feature_length
301
+ self.mask_feature_min_masks = mask_feature_min_masks
302
+
303
+ # parameters for pretraining with codevector quantized representations
304
+ self.num_codevectors_per_group = num_codevectors_per_group
305
+ self.num_codevector_groups = num_codevector_groups
306
+ self.contrastive_logits_temperature = contrastive_logits_temperature
307
+ self.feat_quantizer_dropout = feat_quantizer_dropout
308
+ self.num_negatives = num_negatives
309
+ self.codevector_dim = codevector_dim
310
+ self.proj_codevector_dim = proj_codevector_dim
311
+ self.diversity_loss_weight = diversity_loss_weight
312
+
313
+ # ctc loss
314
+ self.ctc_loss_reduction = ctc_loss_reduction
315
+ self.ctc_zero_infinity = ctc_zero_infinity
316
+
317
+ # adapter
318
+ self.add_adapter = add_adapter
319
+ self.adapter_kernel_size = adapter_kernel_size
320
+ self.adapter_stride = adapter_stride
321
+ self.num_adapter_layers = num_adapter_layers
322
+ self.output_hidden_size = output_hidden_size or hidden_size
323
+ self.adapter_attn_dim = adapter_attn_dim
324
+
325
+ # SequenceClassification-specific parameter. Feel free to ignore for other classes.
326
+ self.classifier_proj_size = classifier_proj_size
327
+
328
+ # XVector-specific parameters. Feel free to ignore for other classes.
329
+ self.tdnn_dim = list(tdnn_dim)
330
+ self.tdnn_kernel = list(tdnn_kernel)
331
+ self.tdnn_dilation = list(tdnn_dilation)
332
+ self.xvector_output_dim = xvector_output_dim
333
+
334
+ # Loss function parameters. Feel free to ignore for other classes.
335
+ self.loss_fct = loss_fct
336
+ self.label_smoothing = label_smoothing
337
+ self.scale = scale
338
+ self.margin = margin
339
+ self.easy_margin = easy_margin
340
+ self.reduction = reduction
341
+
342
+ @property
343
+ def inputs_to_logits_ratio(self):
344
+ return functools.reduce(operator.mul, self.conv_stride, 1)