File size: 4,512 Bytes
96b8fbe 565842e 96b8fbe 565842e 96b8fbe 0a5d4dc 96b8fbe e8e9953 96b8fbe bb96b52 96b8fbe 0a5d4dc 96b8fbe bb96b52 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 |
---
license: apache-2.0
tags:
- generated_from_trainer
datasets:
- librispeech_asr
metrics:
- f1
base_model: facebook/wav2vec2-xls-r-300m
model-index:
- name: weights
results: []
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->
# wav2vec2-large-xlsr-53-gender-recognition-librispeech
This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on Librispeech-clean-100 for gender recognition.
It achieves the following results on the evaluation set:
- Loss: 0.0061
- F1: 0.9993
### Compute your inferences
```python
class DataColletor:
def __init__(
self,
processor: Wav2Vec2Processor,
sampling_rate: int = 16000,
padding: Union[bool, str] = True,
max_length: Optional[int] = None,
pad_to_multiple_of: Optional[int] = None,
label2id: Dict = None,
max_audio_len: int = 5
):
self.processor = processor
self.sampling_rate = sampling_rate
self.padding = padding
self.max_length = max_length
self.pad_to_multiple_of = pad_to_multiple_of
self.label2id = label2id
self.max_audio_len = max_audio_len
def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
# split inputs and labels since they have to be of different lenghts and need
# different padding methods
input_features = []
label_features = []
for feature in features:
speech_array, sampling_rate = torchaudio.load(feature["input_values"])
# Transform to Mono
speech_array = torch.mean(speech_array, dim=0, keepdim=True)
if sampling_rate != self.sampling_rate:
transform = torchaudio.transforms.Resample(sampling_rate, self.sampling_rate)
speech_array = transform(speech_array)
sampling_rate = self.sampling_rate
effective_size_len = sampling_rate * self.max_audio_len
if speech_array.shape[-1] > effective_size_len:
speech_array = speech_array[:, :effective_size_len]
speech_array = speech_array.squeeze().numpy()
input_tensor = self.processor(speech_array, sampling_rate=sampling_rate).input_values
input_tensor = np.squeeze(input_tensor)
input_features.append({"input_values": input_tensor})
batch = self.processor.pad(
input_features,
padding=self.padding,
max_length=self.max_length,
pad_to_multiple_of=self.pad_to_multiple_of,
return_tensors="pt",
)
return batch
label2id = {
"female": 0,
"male": 1
}
id2label = {
0: "female",
1: "male"
}
num_labels = 2
feature_extractor = AutoFeatureExtractor.from_pretrained("alefiury/wav2vec2-large-xlsr-53-gender-recognition-librispeech")
model = AutoModelForAudioClassification.from_pretrained(
pretrained_model_name_or_path="alefiury/wav2vec2-large-xlsr-53-gender-recognition-librispeech",
num_labels=num_labels,
label2id=label2id,
id2label=id2label,
)
data_collator = DataColletorTrain(
feature_extractor,
sampling_rate=16000,
padding=True,
label2id=label2id
)
test_dataloader = DataLoader(
dataset=test_dataset,
batch_size=16,
collate_fn=data_collator,
shuffle=False,
num_workers=10
)
preds = predict(test_dataloader=test_dataloader, model=model)
```
## Training and evaluation data
The Librispeech-clean-100 dataset was used to train the model, with 70% of the data used for training, 10% for validation, and 20% for testing.
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 3e-05
- train_batch_size: 4
- eval_batch_size: 4
- seed: 42
- gradient_accumulation_steps: 4
- total_train_batch_size: 16
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 1
- mixed_precision_training: Native AMP
### Training results
| Training Loss | Epoch | Step | Validation Loss | F1 |
|:-------------:|:-----:|:----:|:---------------:|:------:|
| 0.002 | 1.0 | 1248 | 0.0061 | 0.9993 |
### Framework versions
- Transformers 4.28.0
- Pytorch 2.0.0+cu118
- Tokenizers 0.13.3 |