File size: 2,263 Bytes
2eedd93
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5c3a4f4
2eedd93
 
 
3d33521
 
 
 
2eedd93
 
5c3a4f4
2eedd93
3d33521
 
 
2eedd93
0470f26
 
705e6a7
0470f26
705e6a7
0470f26
 
5c3a4f4
 
 
 
 
 
 
 
 
 
 
 
 
2eedd93
3d33521
 
 
 
 
 
 
 
 
 
 
5c3a4f4
 
2eedd93
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
---
license: apache-2.0
---
# UK & Ireland Accent Classification Model

This is a model to classify and identify the accent of a UK or Ireland speaker among one of the following accents:
* Irish English
* Midlands English
* Northern English
* Scottish English
* Southern English
* Welsh English

The model implements transfer learning feature extraction using [Yamnet](https://tfhub.dev/google/yamnet/1) model in order to train a model.

### Yamnet Model
Yamnet is an audio event classifier trained on the AudioSet dataset to predict audio events from the AudioSet ontology. It is available on TensorFlow Hub.
Yamnet accepts a 1-D tensor of audio samples with a sample rate of 16 kHz.   
As output, the model returns a 3-tuple:   
- Scores of shape `(N, 521)` representing the scores of the 521 classes.
- Embeddings of shape `(N, 1024)`.
- The log-mel spectrogram of the entire audio frame.

We will use the embeddings, which are the features extracted from the audio samples, as the input to our dense model. 

### Dense Model
The dense model that we used consists of:
- An input layer which is embedding output of the Yamnet classifier.
- 4 dense hidden layers and 4 dropout layers
- An output dense layer.

<details>
<summary>View Model Plot</summary>

![Model Image](./model.png)

</details>

### Results
The model achieved the following results:

Results    | Training  | Validation 
-----------|-----------|------------
Accuracy   | 55%       | 51%
AUC        | 0.9090    | 0.8911 
d-prime    | 1.887     | 1.743 

And the confusion matrix for the validation set is:
![Model Image](./confusion_matrix.png)

---
## Dataset

The dataset used is the
[Crowdsourced high-quality UK and Ireland English Dialect speech data set](https://openslr.org/83/)
which consists of a total of 17,877 high-quality audio wav files.

This dataset includes over 31 hours of recording from 120 vounteers who self-identify as
native speakers of Southern England, Midlands, Northern England, Wales, Scotland and Ireland.

For more info, please refer to the above link or to the following paper:
[Open-source Multi-speaker Corpora of the English Accents in the British Isles](https://aclanthology.org/2020.lrec-1.804.pdf)

---
## Demo
A demo is available in HuggingFace Spaces ...