File size: 5,722 Bytes
7bb6f38 e6b2d99 7bb6f38 ff1dfd0 7bb6f38 4277924 2559184 2081854 7bb6f38 2559184 7bb6f38 81cabbe 7bb6f38 5df9582 872c1ac 66ede47 f8024d7 66ede47 f4260de 0ccc8c8 66ede47 0ccc8c8 66ede47 422a8b5 0ccc8c8 66ede47 2559184 5df9582 2559184 7bb6f38 5df9582 7bb6f38 0763a50 7bb6f38 0763a50 7bb6f38 0763a50 7bb6f38 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
---
license: apache-2.0
language: en
tags:
- generated_from_trainer
datasets:
- speech_commands
metrics:
- accuracy
model-index:
- name: wav2vec2-conformer-rel-pos-large-finetuned-speech-commands
results:
- task:
type: audio-classification
name: audio classification
dataset:
type: speech_commands
name: speech_commands
split: v0.02
metrics:
- type: accuracy
value: 0.9724
name: accuracy
---
# wav2vec2-conformer-rel-pos-large-finetuned-speech-commands
### Model description
This model is a fine-tuned version of [facebook/wav2vec2-conformer-rel-pos-large](https://huggingface.co/facebook/wav2vec2-conformer-rel-pos-large) on the [speech_commands](https://huggingface.co/datasets/speech_commands) dataset.
It achieves the following results on the evaluation set:
- Loss: 0.5245
- Accuracy: 0.9724
#### Intended uses & limitations
The model can spot one of the following keywords: "Yes", "No", "Up", "Down", "Left", "Right", "On", "Off", "Stop", "Go", "Zero", "One", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine", "Bed", "Bird", "Cat", "Dog", "Happy", "House", "Marvin", "Sheila", "Tree", "Wow", "Backward", "Forward", "Follow", "Learn", "Visual".
The repository includes sample files that I recorded (WAV, 16Khz sampling rate, mono). The simplest way to use the model is with the ```pipeline``` API:
```
>>> from transformers import pipeline
>>> p = pipeline("audio-classification", model="juliensimon/wav2vec2-conformer-rel-pos-large-finetuned-speech-commands")
>>> p("up16k.wav")
[{'score': 0.7008192539215088, 'label': 'up'}, {'score': 0.04346614331007004, 'label': 'off'}, {'score': 0.029526518657803535, 'label': 'left'}, {'score': 0.02905120886862278, 'label': 'stop'}, {'score': 0.027142534032464027, 'label': 'on'}]
>>> p("stop16k.wav")
[{'score': 0.6969656944274902, 'label': 'stop'}, {'score': 0.03391443192958832, 'label': 'up'}, {'score': 0.027382319793105125, 'label': 'seven'}, {'score': 0.020835857838392258, 'label': 'five'}, {'score': 0.018051736056804657, 'label': 'down'}]
>>> p("marvin16k.wav")
[{'score': 0.5276530981063843, 'label': 'marvin'}, {'score': 0.04645705968141556, 'label': 'down'}, {'score': 0.038583893328905106, 'label': 'backward'}, {'score': 0.03578080236911774, 'label': 'wow'}, {'score': 0.03178196772933006, 'label': 'bird'}]
```
You can also use them with the ```Auto```API:
```
>>> import torch, librosa
>>> from transformers import AutoModelForAudioClassification, Wav2Vec2FeatureExtractor
>>> feature_extractor = Wav2Vec2FeatureExtractor()
>>> model = AutoModelForAudioClassification.from_pretrained("juliensimon/wav2vec2-conformer-rel-pos-large-finetuned-speech-commands")
>>> audio, rate = librosa.load("up16k.wav", sr = 16000)
>>> inputs = feature_extractor(audio, sampling_rate=16000, return_tensors = "pt")
>>> logits = model(inputs['input_values'])
>>> logits
SequenceClassifierOutput(loss=None, logits=tensor([[-0.4635, -1.0112, 4.7935, 0.8528, 1.6265, 0.6456, 1.5423, 2.0132,
1.6103, 0.5847, -2.2526, 0.8839, 0.8163, -1.5655, -1.4160, -0.4196,
-0.1097, -1.8827, 0.6609, -0.2022, 0.0971, -0.6205, 0.4492, 0.0926,
-2.4848, 0.2630, -0.4584, -2.4327, -1.1654, 0.3897, -0.3374, -1.2418,
-0.1045, 0.2827, -1.5667, -0.0963]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
>>> classes = torch.softmax(logits.logits, dim=1)
>>> torch.set_printoptions(precision=3, sci_mode=False)
>>> classes
tensor([[ 0.004, 0.002, 0.701, 0.014, 0.030, 0.011,
0.027, 0.043, 0.029, 0.010, 0.001, 0.014,
0.013, 0.001, 0.001, 0.004, 0.005, 0.001,
0.011, 0.005, 0.006, 0.003, 0.009, 0.006,
0.000, 0.008, 0.004, 0.001, 0.002, 0.009,
0.004, 0.002, 0.005, 0.008, 0.001, 0.005]],
grad_fn=<SoftmaxBackward0>)
>>> top_class = torch.argmax(logits.logits, dim=1)
>>> top_class
tensor([2])
>>> model.config.id2label[top_class.numpy()[0]]
'up'
```
### Training and evaluation data
- subset: v0.02
- full training set
- full validation set
### Training procedure
The model was fine-tuned on [Amazon SageMaker](https://aws.amazon.com/sagemaker), using an [ml.p3dn.24xlarge](https://aws.amazon.com/fr/ec2/instance-types/p3/) instance (8 NVIDIA V100 GPUs). Total training time for 10 epochs was 4.5 hours.
#### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 3e-05
- train_batch_size: 256
- eval_batch_size: 256
- seed: 42
- gradient_accumulation_steps: 4
- total_train_batch_size: 1024
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 10
#### Training results
| Training Loss | Epoch | Step | Validation Loss | Accuracy |
|:-------------:|:-----:|:----:|:---------------:|:--------:|
| 2.2901 | 1.0 | 83 | 2.0542 | 0.8875 |
| 1.8375 | 2.0 | 166 | 1.5610 | 0.9316 |
| 1.4957 | 3.0 | 249 | 1.1850 | 0.9558 |
| 1.1917 | 4.0 | 332 | 0.9159 | 0.9695 |
| 1.0449 | 5.0 | 415 | 0.7624 | 0.9687 |
| 0.9319 | 6.0 | 498 | 0.6444 | 0.9715 |
| 0.8559 | 7.0 | 581 | 0.5806 | 0.9711 |
| 0.8199 | 8.0 | 664 | 0.5394 | 0.9721 |
| 0.7949 | 9.0 | 747 | 0.5245 | 0.9724 |
| 0.7975 | 10.0 | 830 | 0.5256 | 0.9721 |
#### Framework versions
- Transformers 4.20.1
- Pytorch 1.11.0+cu102
- Datasets 2.3.2
- Tokenizers 0.12.1
|