Update README.md
Browse files
README.md
CHANGED
@@ -17,16 +17,17 @@ The model implements transfer learning feature extraction using [Yamnet](https:/
|
|
17 |
Yamnet is an audio event classifier trained on the AudioSet dataset to predict audio events from the AudioSet ontology. It is available on TensorFlow Hub.
|
18 |
Yamnet accepts a 1-D tensor of audio samples with a sample rate of 16 kHz.
|
19 |
As output, the model returns a 3-tuple:
|
20 |
-
-
|
21 |
-
-
|
22 |
-
-
|
|
|
23 |
We will use the embeddings, which are the features extracted from the audio samples, as the input to our dense model.
|
24 |
|
25 |
## Dense Model
|
26 |
The dense model that we used consists of:
|
27 |
-
- An input layer which is embedding output of the Yamnet
|
28 |
-
- 4
|
29 |
-
- An output dense layer
|
30 |
|
31 |
<details>
|
32 |
<summary>View Model Plot</summary>
|
@@ -36,20 +37,17 @@ The dense model that we used consists of:
|
|
36 |
</details>
|
37 |
|
38 |
## Dataset
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
publisher = {European Language Resources Association (ELRA)},
|
51 |
-
url = {https://www.aclweb.org/anthology/2020.lrec-1.804},\n\ ISBN = {979-10-95546-34-4},
|
52 |
-
}
|
53 |
|
54 |
# Demo
|
55 |
A demo is available in HuggingFace Spaces ...
|
|
|
17 |
Yamnet is an audio event classifier trained on the AudioSet dataset to predict audio events from the AudioSet ontology. It is available on TensorFlow Hub.
|
18 |
Yamnet accepts a 1-D tensor of audio samples with a sample rate of 16 kHz.
|
19 |
As output, the model returns a 3-tuple:
|
20 |
+
- Scores of shape `(N, 521)` representing the scores of the 521 classes.
|
21 |
+
- Embeddings of shape `(N, 1024)`.
|
22 |
+
- The log-mel spectrogram of the entire audio frame.
|
23 |
+
|
24 |
We will use the embeddings, which are the features extracted from the audio samples, as the input to our dense model.
|
25 |
|
26 |
## Dense Model
|
27 |
The dense model that we used consists of:
|
28 |
+
- An input layer which is embedding output of the Yamnet classifier.
|
29 |
+
- 4 dense hidden layers and 4 dropout layers
|
30 |
+
- An output dense layer.
|
31 |
|
32 |
<details>
|
33 |
<summary>View Model Plot</summary>
|
|
|
37 |
</details>
|
38 |
|
39 |
## Dataset
|
40 |
+
|
41 |
+
The dataset used is the
|
42 |
+
[Crowdsourced high-quality UK and Ireland English Dialect speech data set](https://openslr.org/83/)
|
43 |
+
which consists of a total of 17,877 high-quality audio wav files.
|
44 |
+
|
45 |
+
This dataset includes over 31 hours of recording from 120 vounteers who self-identify as
|
46 |
+
native speakers of Southern England, Midlands, Northern England, Wales, Scotland and Ireland.
|
47 |
+
|
48 |
+
For more info, please refer to the above link or to the following paper:
|
49 |
+
[Open-source Multi-speaker Corpora of the English Accents in the British Isles](https://aclanthology.org/2020.lrec-1.804.pdf)
|
50 |
+
|
|
|
|
|
|
|
51 |
|
52 |
# Demo
|
53 |
A demo is available in HuggingFace Spaces ...
|