File size: 5,194 Bytes
97b6013
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
# YAMNet

YAMNet is a pretrained deep net that predicts 521 audio event classes based on
the [AudioSet-YouTube corpus](http://g.co/audioset), and employing the
[Mobilenet_v1](https://arxiv.org/pdf/1704.04861.pdf) depthwise-separable
convolution architecture.

This directory contains the Keras code to construct the model, and example code
for applying the model to input sound files.

## Installation

YAMNet depends on the following Python packages:

* [`numpy`](http://www.numpy.org/)
* [`resampy`](http://resampy.readthedocs.io/en/latest/)
* [`tensorflow`](http://www.tensorflow.org/)
* [`pysoundfile`](https://pysoundfile.readthedocs.io/)

These are all easily installable via, e.g., `pip install numpy` (as in the
example command sequence below).

Any reasonably recent version of these packages should work. TensorFlow should
be at least version 1.8 to ensure Keras support is included. Note that while
the code works fine with TensorFlow v1.x or v2.x, we explicitly enable v1.x
behavior.

YAMNet also requires downloading the following data file:

* [YAMNet model weights](https://storage.googleapis.com/audioset/yamnet.h5)
  in Keras saved weights in HDF5 format.

After downloading this file into the same directory as this README, the
installation can be tested by running `python yamnet_test.py` which
runs some synthetic signals through the model and checks the outputs.

Here's a sample installation and test session:

```shell
# Upgrade pip first. Also make sure wheel is installed.
python -m pip install --upgrade pip wheel

# Install dependences.
pip install numpy resampy tensorflow soundfile

# Clone TensorFlow models repo into a 'models' directory.
git clone https://github.com/tensorflow/models.git
cd models/research/audioset/yamnet
# Download data file into same directory as code.
curl -O https://storage.googleapis.com/audioset/yamnet.h5

# Installation ready, let's test it.
python yamnet_test.py
# If we see "Ran 4 tests ... OK ...", then we're all set.
```

## Usage

You can run the model over existing soundfiles using inference.py:

```shell
python inference.py input_sound.wav
```
The code will report the top-5 highest-scoring classes averaged over all the
frames of the input.  You can access greater detail by modifying the example
code in inference.py.

See the jupyter notebook `yamnet_visualization.ipynb` for an example of
displaying the per-frame model output scores.


## About the Model

The YAMNet code layout is as follows:

* `yamnet.py`: Model definition in Keras.
* `params.py`: Hyperparameters.  You can usefully modify PATCH_HOP_SECONDS.
* `features.py`: Audio feature extraction helpers.
* `inference.py`: Example code to classify input wav files.
* `yamnet_test.py`: Simple test of YAMNet installation

### Input: Audio Features

See `features.py`.

As with our previous release
[VGGish](https://github.com/tensorflow/models/tree/master/research/audioset/vggish),
YAMNet was trained with audio features computed as follows:

* All audio is resampled to 16 kHz mono.
* A spectrogram is computed using magnitudes of the Short-Time Fourier Transform
  with a window size of 25 ms, a window hop of 10 ms, and a periodic Hann
  window.
* A mel spectrogram is computed by mapping the spectrogram to 64 mel bins
  covering the range 125-7500 Hz.
* A stabilized log mel spectrogram is computed by applying
  log(mel-spectrum + 0.001) where the offset is used to avoid taking a logarithm
  of zero.
* These features are then framed into 50%-overlapping examples of 0.96 seconds,
  where each example covers 64 mel bands and 96 frames of 10 ms each.

These 96x64 patches are then fed into the Mobilenet_v1 model to yield a 3x2
array of activations for 1024 kernels at the top of the convolution.  These are
averaged to give a 1024-dimension embedding, then put through a single logistic
layer to get the 521 per-class output scores corresponding to the 960 ms input
waveform segment.  (Because of the window framing, you need at least 975 ms of
input waveform to get the first frame of output scores.)

### Class vocabulary

The file `yamnet_class_map.csv` describes the audio event classes associated
with each of the 521 outputs of the network.  Its format is:

```text
index,mid,display_name
```

where `index` is the model output index (0..520), `mid` is the machine
identifier for that class (e.g. `/m/09x0r`), and display_name is a
human-readable description of the class (e.g. `Speech`).

The original Audioset data release had 527 classes.  This model drops six of
them on the recommendation of our Fairness reviewers to avoid potentially
offensive mislabelings.  We dropped the gendered versions (Male/Female) of
Speech and Singing.  We also dropped Battle cry and Funny music.

### Performance

On the 20,366-segment AudioSet eval set, over the 521 included classes, the
balanced average d-prime is 2.318, balanced mAP is 0.306, and the balanced
average lwlrap is 0.393.

According to our calculations, the classifier has 3.7M weights and performs
69.2M multiplies for each 960ms input frame.

### Contact information

This model repository is maintained by [Manoj Plakal](https://github.com/plakal) and [Dan Ellis](https://github.com/dpwe).