@@ -0,0 +1,76 @@
1 |
# Contributor Covenant Code of Conduct
2 |
3 |
## Our Pledge
4 |
5 |
In the interest of fostering an open and welcoming environment, we as
6 |
contributors and maintainers pledge to making participation in our project and
7 |
our community a harassment-free experience for everyone, regardless of age, body
8 |
size, disability, ethnicity, sex characteristics, gender identity and expression,
9 |
level of experience, education, socio-economic status, nationality, personal
10 |
appearance, race, religion, or sexual identity and orientation.
11 |
12 |
## Our Standards
13 |
14 |
Examples of behavior that contributes to creating a positive environment
15 |
16 |
17 |
* Using welcoming and inclusive language
18 |
* Being respectful of differing viewpoints and experiences
19 |
* Gracefully accepting constructive criticism
20 |
* Focusing on what is best for the community
21 |
* Showing empathy towards other community members
22 |
23 |
Examples of unacceptable behavior by participants include:
24 |
25 |
* The use of sexualized language or imagery and unwelcome sexual attention or
26 |
27 |
* Trolling, insulting/derogatory comments, and personal or political attacks
28 |
* Public or private harassment
29 |
* Publishing others' private information, such as a physical or electronic
30 |
address, without explicit permission
31 |
* Other conduct which could reasonably be considered inappropriate in a
32 |
professional setting
33 |
34 |
## Our Responsibilities
35 |
36 |
Project maintainers are responsible for clarifying the standards of acceptable
37 |
behavior and are expected to take appropriate and fair corrective action in
38 |
response to any instances of unacceptable behavior.
39 |
40 |
Project maintainers have the right and responsibility to remove, edit, or
41 |
reject comments, commits, code, wiki edits, issues, and other contributions
42 |
that are not aligned to this Code of Conduct, or to ban temporarily or
43 |
permanently any contributor for other behaviors that they deem inappropriate,
44 |
threatening, offensive, or harmful.
45 |
46 |
## Scope
47 |
48 |
This Code of Conduct applies both within project spaces and in public spaces
49 |
when an individual is representing the project or its community. Examples of
50 |
representing a project or community include using an official project e-mail
51 |
address, posting via an official social media account, or acting as an appointed
52 |
representative at an online or offline event. Representation of a project may be
53 |
further defined and clarified by project maintainers.
54 |
55 |
## Enforcement
56 |
57 |
Instances of abusive, harassing, or otherwise unacceptable behavior may be
58 |
reported by contacting the project team at All
59 |
complaints will be reviewed and investigated and will result in a response that
60 |
is deemed necessary and appropriate to the circumstances. The project team is
61 |
obligated to maintain confidentiality with regard to the reporter of an incident.
62 |
Further details of specific enforcement policies may be posted separately.
63 |
64 |
Project maintainers who do not follow or enforce the Code of Conduct in good
65 |
faith may face temporary or permanent repercussions as determined by other
66 |
members of the project's leadership.
67 |
68 |
## Attribution
69 |
70 |
This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
71 |
available at
72 |
73 |
74 |
75 |
For answers to common questions about this code of conduct, see
76 |
@@ -0,0 +1,113 @@
1 |
[![Mailing list : test](]( [![Mailing list : test](]( [![License: CC BY-NC 4.0](](
2 |
3 |
[![Open In Colab](](
4 |
5 |
6 |
7 |
8 |
<h1 align="center">Silero VAD</h1>
9 |
10 |
11 |
**Silero VAD** - pre-trained enterprise-grade [Voice Activity Detector]( (also see our [STT models](
12 |
13 |
14 |
15 |
<p align="center">
16 |
<img src="" />
17 |
18 |
19 |
20 |
21 |
<summary>Real Time Example</summary>
22 |
23 |
24 |
25 |
26 |
27 |
28 |
<h2 align="center">Key Features</h2>
29 |
30 |
31 |
- **Stellar accuracy**
32 |
33 |
Silero VAD has [excellent results]( on speech detection tasks.
34 |
35 |
- **Fast**
36 |
37 |
One audio chunk (30+ ms) [takes]( less than **1ms** to be processed on a single CPU thread. Using batching or GPU can also improve performance considerably. Under certain conditions ONNX may even run up to 4-5x faster.
38 |
39 |
- **Lightweight**
40 |
41 |
JIT model is around one megabyte in size.
42 |
43 |
- **General**
44 |
45 |
Silero VAD was trained on huge corpora that include over **100** languages and it performs well on audios from different domains with various background noise and quality levels.
46 |
47 |
- **Flexible sampling rate**
48 |
49 |
Silero VAD [supports]( **8000 Hz** and **16000 Hz** [sampling rates](
50 |
51 |
- **Flexible chunk size**
52 |
53 |
Model was trained on **30 ms**. Longer chunks are supported directly, others may work as well.
54 |
55 |
- **Highly Portable**
56 |
57 |
Silero VAD reaps benefits from the rich ecosystems built around **PyTorch** and **ONNX** running everywhere where these runtimes are available.
58 |
59 |
- **No Strings Attached**
60 |
61 |
Published under permissive license (MIT) Silero VAD has zero strings attached - no telemetry, no keys, no registration, no built-in expiration, no keys or vendor lock.
62 |
63 |
64 |
<h2 align="center">Typical Use Cases</h2>
65 |
66 |
67 |
- Voice activity detection for IOT / edge / mobile use cases
68 |
- Data cleaning and preparation, voice detection in general
69 |
- Telephony and call-center automation, voice bots
70 |
- Voice interfaces
71 |
72 |
73 |
<h2 align="center">Links</h2>
74 |
75 |
76 |
77 |
- [Examples and Dependencies](
78 |
- [Quality Metrics](
79 |
- [Performance Metrics](
80 |
- [Versions and Available Models](
81 |
- [Further reading](
82 |
- [FAQ](
83 |
84 |
85 |
<h2 align="center">Get In Touch</h2>
86 |
87 |
88 |
Try our models, create an [issue](, start a [discussion](, join our telegram [chat](, [email]( us, read our [news](
89 |
90 |
Please see our [wiki]( and [tiers]( for relevant information and [email]( us directly.
91 |
92 |
93 |
94 |
95 |
@misc{Silero VAD,
96 |
author = {Silero Team},
97 |
title = {Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier},
98 |
year = {2021},
99 |
publisher = {GitHub},
100 |
journal = {GitHub repository},
101 |
howpublished = {\url{}},
102 |
commit = {insert_some_commit_here},
103 |
email = {}
104 |
105 |
106 |
107 |
108 |
<h2 align="center">Examples and VAD-based Community Apps</h2>
109 |
110 |
111 |
- Example of VAD ONNX Runtime model usage in [C++](
112 |
113 |
- Voice activity detection for the [browser]( using ONNX Runtime Web
@@ -0,0 +1,84 @@
1 |
# Датасет Silero-VAD
2 |
3 |
> Датасет создан при поддержке Фонда содействия инновациям в рамках федерального проекта «Искусственный
4 |
интеллект» национальной программы «Цифровая экономика Российской Федерации».
5 |
6 |
По ссылкам ниже представлены `.feather` файлы, содержащие размеченные с помощью Silero VAD открытые наборы аудиоданных, а также короткое описание каждого набора данных с примерами загрузки. `.feather` файлы можно открыть с помощью библиотеки `pandas`:
7 |
8 |
import pandas as pd
9 |
dataframe = pd.read_feather(PATH_TO_FEATHER_FILE)
10 |
11 |
12 |
Каждый `.feather` файл с разметкой содержит следующие колонки:
13 |
- `speech_timings` - разметка данного аудио. Это список, содержащий словари вида `{'start': START_SECOND, 'end': END_SECOND}`, где `START_SECOND` и `END_SECOND` - время начала и конца речи в секундах. Количество данных словарей равно количеству речевых аудио отрывков, найденных в данном аудио;
14 |
- `language` - ISO код языка данного аудио.
15 |
16 |
Колонки, содержащие информацию о загрузке аудио файла различаются и описаны для каждого набора данных ниже.
17 |
18 |
**Все данные размечены при временной дискретизации в ~30 миллисекунд (`num_samples` - 512)**
19 |
20 |
| Название | Число часов | Число языков | Ссылка | Лицензия | md5sum |
21 |
22 |
| **** | 53,138 | 1,596 | [URL]( | [Уникальная]( | ea404eeaf2cd283b8223f63002be11f9 |
23 |
| **** | 9,743 | 6,171[^1] | [URL]( | CC BY-NC-SA 4.0 | 3c5c0f31b0abd9fe94ddbe8b1e2eb326 |
24 |
| **VoxLingua107** | 6,628 | 107 | [URL]( | CC BY 4.0 | 5dfef33b4d091b6d399cfaf3d05f2140 |
25 |
| **Common Voice** | 30,329 | 120 | [URL]( | CC0 | 5e30a85126adf74a5fd1496e6ac8695d |
26 |
| **MLS** | 50,709 | 8 | [URL]( | CC BY 4.0 | a339d0e94bdf41bba3c003756254ac4e |
27 |
| **Итого** | **150,547** | **6,171+** | | | |
28 |
29 |
30 |
31 |
[Ссылка на `.feather` файл с разметкой](
32 |
33 |
- Колонка `audio_link` содержит ссылки на конкретные аудио файлы.
34 |
35 |
36 |
37 |
[Ссылка на `.feather` файл с разметкой](
38 |
39 |
- Колонка `folder_link` содержит ссылки на скачивание `.zip` архива для конкретного языка. Внимание! Ссылки на архивы дублируются, т.к каждый архив может содержать множество аудио.
40 |
- Колонка `audio_path` содержит пути до конкретного аудио после распаковки соответствующего архива из колонки `folder_link`
41 |
42 |
``Количество уникальных ISO кодов данного датасета не совпадает с фактическим количеством представленных языков, т.к некоторые близкие языки могут кодироваться одним и тем же ISO кодом.``
43 |
44 |
## VoxLingua107
45 |
46 |
[Ссылка на `.feather` файл с разметкой](
47 |
48 |
- Колонка `folder_link` содержит ссылки на скачивание `.zip` архива для конкретного языка. Внимание! Ссылки на архивы дублируются, т.к каждый архив может содержать множество аудио.
49 |
- Колонка `audio_path` содержит пути до конкретного аудио после распаковки соответствующего архива из колонки `folder_link`
50 |
51 |
## Common Voice
52 |
53 |
[Ссылка на `.feather` файл с разметкой](
54 |
55 |
Этот датасет невозможно скачать по статичным ссылкам. Для загрузки необходимо перейти по [ссылке]( и, получив доступ в соответствующей форме, скачать архивы для каждого доступного языка. Внимание! Представленная разметка актуальна для версии исходного датасета `Common Voice Corpus 16.1`.
56 |
57 |
- Колонка `audio_path` содержит уникальные названия `.mp3` файлов, полученных после скачивания соответствующего датасета.
58 |
59 |
## MLS
60 |
61 |
[Ссылка на `.feather` файл с разметкой](
62 |
63 |
- Колонка `folder_link` содержит ссылки на скачивание `.zip` архива для конкретного языка. Внимание! Ссылки на архивы дублируются, т.к каждый архив может содержать множество аудио.
64 |
- Колонка `audio_path` содержит пути до конкретного аудио после распаковки соответствующего архива из колонки `folder_link`
65 |
66 |
## Лицензия
67 |
68 |
Данный датасет распространяется под [лицензией]( `CC BY-NC-SA 4.0`.
69 |
70 |
## Цитирование
71 |
72 |
73 |
@misc{Silero VAD Dataset,
74 |
author = {Silero Team},
75 |
title = {Silero-VAD Dataset: a large public Internet-scale dataset for voice activity detection for 6000+ languages},
76 |
year = {2024},
77 |
publisher = {GitHub},
78 |
journal = {GitHub repository},
79 |
howpublished = {\url{}},
80 |
email = {}
81 |
82 |
83 |
84 |
[^1]: ``Количество уникальных ISO кодов данного датасета не совпадает с фактическим количеством представленных языков, т.к некоторые близкие языки могут кодироваться одним и тем же ISO кодом.``
1 |
2 |
"cells": [
3 |
4 |
"cell_type": "markdown",
5 |
"metadata": {
6 |
"id": "bccAucKjnPHm"
7 |
8 |
"source": [
9 |
"### Dependencies and inputs"
10 |
11 |
12 |
13 |
"cell_type": "code",
14 |
"execution_count": null,
15 |
"metadata": {
16 |
"id": "cSih95WFmwgi"
17 |
18 |
"outputs": [],
19 |
"source": [
20 |
"!pip -q install pydub\n",
21 |
"from google.colab import output\n",
22 |
"from base64 import b64decode, b64encode\n",
23 |
"from io import BytesIO\n",
24 |
"import numpy as np\n",
25 |
"from pydub import AudioSegment\n",
26 |
"from IPython.display import HTML, display\n",
27 |
"import torch\n",
28 |
"import matplotlib.pyplot as plt\n",
29 |
"import moviepy.editor as mpe\n",
30 |
"from matplotlib.animation import FuncAnimation, FFMpegWriter\n",
31 |
"import matplotlib\n",
32 |
33 |
34 |
35 |
36 |
"model, _ = torch.hub.load(repo_or_dir='snakers4/silero-vad',\n",
37 |
" model='silero_vad',\n",
38 |
" force_reload=True)\n",
39 |
40 |
"def int2float(sound):\n",
41 |
" abs_max = np.abs(sound).max()\n",
42 |
" sound = sound.astype('float32')\n",
43 |
" if abs_max > 0:\n",
44 |
" sound *= 1/32768\n",
45 |
" sound = sound.squeeze()\n",
46 |
" return sound\n",
47 |
48 |
"AUDIO_HTML = \"\"\"\n",
49 |
50 |
"var my_div = document.createElement(\"DIV\");\n",
51 |
"var my_p = document.createElement(\"P\");\n",
52 |
"var my_btn = document.createElement(\"BUTTON\");\n",
53 |
"var t = document.createTextNode(\"Press to start recording\");\n",
54 |
55 |
56 |
57 |
58 |
59 |
60 |
"var base64data = 0;\n",
61 |
"var reader;\n",
62 |
"var recorder, gumStream;\n",
63 |
"var recordButton = my_btn;\n",
64 |
65 |
"var handleSuccess = function(stream) {\n",
66 |
" gumStream = stream;\n",
67 |
" var options = {\n",
68 |
" //bitsPerSecond: 8000, //chrome seems to ignore, always 48k\n",
69 |
" mimeType : 'audio/webm;codecs=opus'\n",
70 |
" //mimeType : 'audio/webm;codecs=pcm'\n",
71 |
" }; \n",
72 |
" //recorder = new MediaRecorder(stream, options);\n",
73 |
" recorder = new MediaRecorder(stream);\n",
74 |
" recorder.ondataavailable = function(e) { \n",
75 |
" var url = URL.createObjectURL(;\n",
76 |
" // var preview = document.createElement('audio');\n",
77 |
" // preview.controls = true;\n",
78 |
" // preview.src = url;\n",
79 |
" // document.body.appendChild(preview);\n",
80 |
81 |
" reader = new FileReader();\n",
82 |
" reader.readAsDataURL(; \n",
83 |
" reader.onloadend = function() {\n",
84 |
" base64data = reader.result;\n",
85 |
" //console.log(\"Inside FileReader:\" + base64data);\n",
86 |
" }\n",
87 |
" };\n",
88 |
" recorder.start();\n",
89 |
" };\n",
90 |
91 |
"recordButton.innerText = \"Recording... press to stop\";\n",
92 |
93 |
"navigator.mediaDevices.getUserMedia({audio: true}).then(handleSuccess);\n",
94 |
95 |
96 |
"function toggleRecording() {\n",
97 |
" if (recorder && recorder.state == \"recording\") {\n",
98 |
" recorder.stop();\n",
99 |
" gumStream.getAudioTracks()[0].stop();\n",
100 |
" recordButton.innerText = \"Saving recording...\"\n",
101 |
" }\n",
102 |
103 |
104 |
105 |
"function sleep(ms) {\n",
106 |
" return new Promise(resolve => setTimeout(resolve, ms));\n",
107 |
108 |
109 |
"var data = new Promise(resolve=>{\n",
110 |
"//recordButton.addEventListener(\"click\", toggleRecording);\n",
111 |
"recordButton.onclick = ()=>{\n",
112 |
113 |
114 |
"sleep(2000).then(() => {\n",
115 |
" // wait 2000ms for the data to be available...\n",
116 |
" // ideally this should use something like await...\n",
117 |
" //console.log(\"Inside data:\" + base64data)\n",
118 |
" resolve(base64data.toString())\n",
119 |
120 |
121 |
122 |
123 |
124 |
" \n",
125 |
126 |
127 |
128 |
"def record(sec=10):\n",
129 |
" display(HTML(AUDIO_HTML))\n",
130 |
" s = output.eval_js(\"data\")\n",
131 |
" b = b64decode(s.split(',')[1])\n",
132 |
" audio = AudioSegment.from_file(BytesIO(b))\n",
133 |
" audio.export('test.mp3', format='mp3')\n",
134 |
" audio = audio.set_channels(1)\n",
135 |
" audio = audio.set_frame_rate(16000)\n",
136 |
" audio_float = int2float(np.array(audio.get_array_of_samples()))\n",
137 |
" audio_tens = torch.tensor(audio_float )\n",
138 |
" return audio_tens\n",
139 |
140 |
"def make_animation(probs, audio_duration, interval=40):\n",
141 |
" fig = plt.figure(figsize=(16, 9))\n",
142 |
" ax = plt.axes(xlim=(0, audio_duration), ylim=(0, 1.02))\n",
143 |
" line, = ax.plot([], [], lw=2)\n",
144 |
" x = [i / 16000 * 512 for i in range(len(probs))]\n",
145 |
" plt.xlabel('Time, seconds', fontsize=16)\n",
146 |
" plt.ylabel('Speech Probability', fontsize=16)\n",
147 |
148 |
" def init():\n",
149 |
" plt.fill_between(x, probs, color='#064273')\n",
150 |
" line.set_data([], [])\n",
151 |
" line.set_color('#990000')\n",
152 |
" return line,\n",
153 |
154 |
" def animate(i):\n",
155 |
" x = i * interval / 1000 - 0.04\n",
156 |
" y = np.linspace(0, 1.02, 2)\n",
157 |
" \n",
158 |
" line.set_data(x, y)\n",
159 |
" line.set_color('#990000')\n",
160 |
" return line,\n",
161 |
162 |
" anim = FuncAnimation(fig, animate, init_func=init, interval=interval, save_count=audio_duration / (interval / 1000))\n",
163 |
164 |
" f = r\"animation.mp4\" \n",
165 |
" writervideo = FFMpegWriter(fps=1000/interval) \n",
166 |
", writer=writervideo)\n",
167 |
" plt.close('all')\n",
168 |
169 |
"def combine_audio(vidname, audname, outname, fps=25): \n",
170 |
" my_clip = mpe.VideoFileClip(vidname, verbose=False)\n",
171 |
" audio_background = mpe.AudioFileClip(audname)\n",
172 |
" final_clip = my_clip.set_audio(audio_background)\n",
173 |
" final_clip.write_videofile(outname,fps=fps,verbose=False)\n",
174 |
175 |
"def record_make_animation():\n",
176 |
" tensor = record()\n",
177 |
178 |
" print('Calculating probabilities...')\n",
179 |
" speech_probs = []\n",
180 |
" window_size_samples = 512\n",
181 |
" for i in range(0, len(tensor), window_size_samples):\n",
182 |
" if len(tensor[i: i+ window_size_samples]) < window_size_samples:\n",
183 |
" break\n",
184 |
" speech_prob = model(tensor[i: i+ window_size_samples], 16000).item()\n",
185 |
" speech_probs.append(speech_prob)\n",
186 |
" model.reset_states()\n",
187 |
" print('Making animation...')\n",
188 |
" make_animation(speech_probs, len(tensor) / 16000)\n",
189 |
190 |
" print('Merging your voice with animation...')\n",
191 |
" combine_audio('animation.mp4', 'test.mp3', 'merged.mp4')\n",
192 |
" print('Done!')\n",
193 |
" mp4 = open('merged.mp4','rb').read()\n",
194 |
" data_url = \"data:video/mp4;base64,\" + b64encode(mp4).decode()\n",
195 |
" display(HTML(\"\"\"\n",
196 |
" <video width=800 controls>\n",
197 |
" <source src=\"%s\" type=\"video/mp4\">\n",
198 |
" </video>\n",
199 |
" \"\"\" % data_url))"
200 |
201 |
202 |
203 |
"cell_type": "markdown",
204 |
"metadata": {
205 |
"id": "IFVs3GvTnpB1"
206 |
207 |
"source": [
208 |
"## Record example"
209 |
210 |
211 |
212 |
"cell_type": "code",
213 |
"execution_count": null,
214 |
"metadata": {
215 |
"id": "5EBjrTwiqAaQ"
216 |
217 |
"outputs": [],
218 |
"source": [
219 |
220 |
221 |
222 |
223 |
"metadata": {
224 |
"colab": {
225 |
"collapsed_sections": [
226 |
227 |
228 |
"name": "Untitled2.ipynb",
229 |
"provenance": []
230 |
231 |
"kernelspec": {
232 |
"display_name": "Python 3",
233 |
"name": "python3"
234 |
235 |
"language_info": {
236 |
"name": "python"
237 |
238 |
239 |
"nbformat": 4,
240 |
"nbformat_minor": 0
241 |
1 |
# Stream example in C++
2 |
3 |
Here's a simple example of the vad model in c++ onnxruntime.
4 |
5 |
6 |
7 |
## Requirements
8 |
9 |
Code are tested in the environments bellow, feel free to try others.
10 |
11 |
- WSL2 + Debian-bullseye (docker)
12 |
- gcc 12.2.0
13 |
- onnxruntime-linux-x64-1.12.1
14 |
15 |
16 |
17 |
## Usage
18 |
19 |
1. Install gcc 12.2.0, or just pull the docker image with `docker pull gcc:12.2.0-bullseye`
20 |
21 |
2. Install onnxruntime-linux-x64-1.12.1
22 |
23 |
- Download lib onnxruntime:
24 |
25 |
26 |
27 |
- Unzip. Assume the path is `/root/onnxruntime-linux-x64-1.12.1`
28 |
29 |
3. Modify wav path & Test configs in main function
30 |
31 |
`wav::WavReader wav_reader("${path_to_your_wav_file}");`
32 |
33 |
test sample rate, frame per ms, threshold...
34 |
35 |
4. Build with gcc and run
36 |
37 |
38 |
# Build
39 |
g++ silero-vad-onnx.cpp -I /root/onnxruntime-linux-x64-1.12.1/include/ -L /root/onnxruntime-linux-x64-1.12.1/lib/ -lonnxruntime -Wl,-rpath,/root/onnxruntime-linux-x64-1.12.1/lib/ -o test
40 |
41 |
# Run
43 |
1 |
#include <iostream>
2 |
#include <vector>
3 |
#include <sstream>
4 |
#include <cstring>
5 |
#include <limits>
6 |
#include <chrono>
7 |
#include <memory>
8 |
#include <string>
9 |
#include <stdexcept>
10 |
#include <iostream>
11 |
#include <string>
12 |
#include "onnxruntime_cxx_api.h"
13 |
#include "wav.h"
14 |
#include <cstdio>
15 |
#include <cstdarg>
16 |
#if __cplusplus < 201703L
17 |
#include <memory>
18 |
19 |
20 |
//#define __DEBUG_SPEECH_PROB___
21 |
22 |
23 |
24 |
25 |
int start;
26 |
int end;
27 |
28 |
// default + parameterized constructor
29 |
timestamp_t(int start = -1, int end = -1)
30 |
: start(start), end(end)
31 |
32 |
33 |
34 |
// assignment operator modifies object, therefore non-const
35 |
timestamp_t& operator=(const timestamp_t& a)
36 |
37 |
start = a.start;
38 |
end = a.end;
39 |
return *this;
40 |
41 |
42 |
// equality comparison. doesn't modify object. therefore const.
43 |
bool operator==(const timestamp_t& a) const
44 |
45 |
return (start == a.start && end == a.end);
46 |
47 |
std::string c_str()
48 |
//return std::format("timestamp {:08d}, {:08d}", start, end);
50 |
return format("{start:%08d,end:%08d}", start, end);
51 |
52 |
53 |
54 |
std::string format(const char* fmt, ...)
55 |
56 |
char buf[256];
57 |
58 |
va_list args;
59 |
va_start(args, fmt);
60 |
const auto r = std::vsnprintf(buf, sizeof buf, fmt, args);
61 |
62 |
63 |
if (r < 0)
64 |
// conversion failed
65 |
return {};
66 |
67 |
const size_t len = r;
68 |
if (len < sizeof buf)
69 |
// we fit in the buffer
70 |
return { buf, len };
71 |
72 |
#if __cplusplus >= 201703L
73 |
// C++17: Create a string and write to its underlying array
74 |
std::string s(len, '\0');
75 |
va_start(args, fmt);
76 |
std::vsnprintf(, len + 1, fmt, args);
77 |
78 |
79 |
return s;
80 |
81 |
// C++11 or C++14: We need to allocate scratch memory
82 |
auto vbuf = std::unique_ptr<char[]>(new char[len + 1]);
83 |
va_start(args, fmt);
84 |
std::vsnprintf(vbuf.get(), len + 1, fmt, args);
85 |
86 |
87 |
return { vbuf.get(), len };
88 |
89 |
90 |
91 |
92 |
93 |
class VadIterator
94 |
95 |
96 |
// OnnxRuntime resources
97 |
Ort::Env env;
98 |
Ort::SessionOptions session_options;
99 |
std::shared_ptr<Ort::Session> session = nullptr;
100 |
Ort::AllocatorWithDefaultOptions allocator;
101 |
Ort::MemoryInfo memory_info = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeCPU);
102 |
103 |
104 |
void init_engine_threads(int inter_threads, int intra_threads)
105 |
106 |
// The method should be called in each thread/proc in multi-thread/proc work
107 |
108 |
109 |
110 |
111 |
112 |
void init_onnx_model(const std::wstring& model_path)
113 |
114 |
// Init threads = 1 for
115 |
init_engine_threads(1, 1);
116 |
// Load model
117 |
session = std::make_shared<Ort::Session>(env, model_path.c_str(), session_options);
118 |
119 |
120 |
void reset_states()
121 |
122 |
// Call reset before each audio start
123 |
std::memset(, 0.0f, _h.size() * sizeof(float));
124 |
std::memset(, 0.0f, _c.size() * sizeof(float));
125 |
triggered = false;
126 |
temp_end = 0;
127 |
current_sample = 0;
128 |
129 |
prev_end = next_start = 0;
130 |
131 |
132 |
current_speech = timestamp_t();
133 |
134 |
135 |
void predict(const std::vector<float> &data)
136 |
137 |
// Infer
138 |
// Create ort tensors
139 |
input.assign(data.begin(), data.end());
140 |
Ort::Value input_ort = Ort::Value::CreateTensor<float>(
141 |
memory_info,, input.size(), input_node_dims, 2);
142 |
Ort::Value sr_ort = Ort::Value::CreateTensor<int64_t>(
143 |
memory_info,, sr.size(), sr_node_dims, 1);
144 |
Ort::Value h_ort = Ort::Value::CreateTensor<float>(
145 |
memory_info,, _h.size(), hc_node_dims, 3);
146 |
Ort::Value c_ort = Ort::Value::CreateTensor<float>(
147 |
memory_info,, _c.size(), hc_node_dims, 3);
148 |
149 |
// Clear and add inputs
150 |
151 |
152 |
153 |
154 |
155 |
156 |
// Infer
157 |
ort_outputs = session->Run(
158 |
159 |
+,, ort_inputs.size(),
160 |
+, output_node_names.size());
161 |
162 |
// Output probability & update h,c recursively
163 |
float speech_prob = ort_outputs[0].GetTensorMutableData<float>()[0];
164 |
float *hn = ort_outputs[1].GetTensorMutableData<float>();
165 |
std::memcpy(, hn, size_hc * sizeof(float));
166 |
float *cn = ort_outputs[2].GetTensorMutableData<float>();
167 |
std::memcpy(, cn, size_hc * sizeof(float));
168 |
169 |
// Push forward sample index
170 |
current_sample += window_size_samples;
171 |
172 |
// Reset temp_end when > threshold
173 |
if ((speech_prob >= threshold))
174 |
175 |
176 |
float speech = current_sample - window_size_samples; // minus window_size_samples to get precise start time point.
177 |
printf("{ start: %.3f s (%.3f) %08d}\n", 1.0 * speech / sample_rate, speech_prob, current_sample- window_size_samples);
178 |
#endif //__DEBUG_SPEECH_PROB___
179 |
if (temp_end != 0)
180 |
181 |
temp_end = 0;
182 |
if (next_start < prev_end)
183 |
next_start = current_sample - window_size_samples;
184 |
185 |
if (triggered == false)
186 |
187 |
triggered = true;
188 |
189 |
current_speech.start = current_sample - window_size_samples;
190 |
191 |
192 |
193 |
194 |
if (
195 |
(triggered == true)
196 |
&& ((current_sample - current_speech.start) > max_speech_samples)
197 |
) {
198 |
if (prev_end > 0) {
199 |
current_speech.end = prev_end;
200 |
201 |
current_speech = timestamp_t();
202 |
203 |
// previously reached silence(< neg_thres) and is still not speech(< thres)
204 |
if (next_start < prev_end)
205 |
triggered = false;
206 |
207 |
current_speech.start = next_start;
208 |
209 |
prev_end = 0;
210 |
next_start = 0;
211 |
temp_end = 0;
212 |
213 |
214 |
215 |
current_speech.end = current_sample;
216 |
217 |
current_speech = timestamp_t();
218 |
prev_end = 0;
219 |
next_start = 0;
220 |
temp_end = 0;
221 |
triggered = false;
222 |
223 |
224 |
225 |
226 |
if ((speech_prob >= (threshold - 0.15)) && (speech_prob < threshold))
227 |
228 |
if (triggered) {
229 |
230 |
float speech = current_sample - window_size_samples; // minus window_size_samples to get precise start time point.
231 |
printf("{ speeking: %.3f s (%.3f) %08d}\n", 1.0 * speech / sample_rate, speech_prob, current_sample - window_size_samples);
232 |
#endif //__DEBUG_SPEECH_PROB___
233 |
234 |
else {
235 |
236 |
float speech = current_sample - window_size_samples; // minus window_size_samples to get precise start time point.
237 |
printf("{ silence: %.3f s (%.3f) %08d}\n", 1.0 * speech / sample_rate, speech_prob, current_sample - window_size_samples);
238 |
#endif //__DEBUG_SPEECH_PROB___
239 |
240 |
241 |
242 |
243 |
244 |
// 4) End
245 |
if ((speech_prob < (threshold - 0.15)))
246 |
247 |
248 |
float speech = current_sample - window_size_samples - speech_pad_samples; // minus window_size_samples to get precise start time point.
249 |
printf("{ end: %.3f s (%.3f) %08d}\n", 1.0 * speech / sample_rate, speech_prob, current_sample - window_size_samples);
250 |
#endif //__DEBUG_SPEECH_PROB___
251 |
if (triggered == true)
252 |
253 |
if (temp_end == 0)
254 |
255 |
temp_end = current_sample;
256 |
257 |
if (current_sample - temp_end > min_silence_samples_at_max_speech)
258 |
prev_end = temp_end;
// a. silence < min_slience_samples, continue speaking
260 |
if ((current_sample - temp_end) < min_silence_samples)
261 |
262 |
263 |
264 |
// b. silence >= min_slience_samples, end speaking
265 |
266 |
267 |
current_speech.end = temp_end;
268 |
if (current_speech.end - current_speech.start > min_speech_samples)
269 |
270 |
271 |
current_speech = timestamp_t();
272 |
prev_end = 0;
273 |
next_start = 0;
274 |
temp_end = 0;
275 |
triggered = false;
276 |
277 |
278 |
279 |
else {
280 |
// may first windows see end state.
281 |
282 |
283 |
284 |
285 |
286 |
void process(const std::vector<float>& input_wav)
287 |
288 |
289 |
290 |
audio_length_samples = input_wav.size();
291 |
292 |
for (int j = 0; j < audio_length_samples; j += window_size_samples)
293 |
294 |
if (j + window_size_samples > audio_length_samples)
295 |
296 |
std::vector<float> r{ &input_wav[0] + j, &input_wav[0] + j + window_size_samples };
297 |
298 |
299 |
300 |
if (current_speech.start >= 0) {
301 |
current_speech.end = audio_length_samples;
302 |
303 |
current_speech = timestamp_t();
304 |
prev_end = 0;
305 |
next_start = 0;
306 |
temp_end = 0;
triggered = false;
308 |
309 |
310 |
311 |
void process(const std::vector<float>& input_wav, std::vector<float>& output_wav)
312 |
313 |
314 |
collect_chunks(input_wav, output_wav);
315 |
316 |
317 |
void collect_chunks(const std::vector<float>& input_wav, std::vector<float>& output_wav)
318 |
319 |
320 |
for (int i = 0; i < speeches.size(); i++) {
321 |
322 |
std::cout << speeches[i].c_str() << std::endl;
323 |
#endif //#ifdef __DEBUG_SPEECH_PROB___
324 |
std::vector<float> slice(&input_wav[speeches[i].start], &input_wav[speeches[i].end]);
325 |
326 |
327 |
328 |
329 |
const std::vector<timestamp_t> get_speech_timestamps() const
330 |
331 |
return speeches;
332 |
333 |
334 |
void drop_chunks(const std::vector<float>& input_wav, std::vector<float>& output_wav)
335 |
336 |
337 |
int current_start = 0;
338 |
for (int i = 0; i < speeches.size(); i++) {
339 |
340 |
std::vector<float> slice(&input_wav[current_start],&input_wav[speeches[i].start]);
341 |
output_wav.insert(output_wav.end(), slice.begin(), slice.end());
342 |
current_start = speeches[i].end;
343 |
344 |
345 |
std::vector<float> slice(&input_wav[current_start], &input_wav[input_wav.size()]);
346 |
output_wav.insert(output_wav.end(), slice.begin(), slice.end());
347 |
348 |
349 |
350 |
// model config
351 |
int64_t window_size_samples; // Assign when init, support 256 512 768 for 8k; 512 1024 1536 for 16k.
352 |
int sample_rate; //Assign when init support 16000 or 8000
353 |
int sr_per_ms; // Assign when init, support 8 or 16
354 |
float threshold;
355 |
int min_silence_samples; // sr_per_ms * #ms
356 |
int min_silence_samples_at_max_speech; // sr_per_ms * #98
357 |
int min_speech_samples; // sr_per_ms * #ms
358 |
float max_speech_samples;
359 |
int speech_pad_samples; // usually a
360 |
int audio_length_samples;
361 |
362 |
// model states
363 |
bool triggered = false;
364 |
unsigned int temp_end = 0;
365 |
unsigned int current_sample = 0;
366 |
// MAX 4294967295 samples / 8sample per ms / 1000 / 60 = 8947 minutes
367 |
int prev_end;
368 |
int next_start = 0;
369 |
370 |
//Output timestamp
371 |
std::vector<timestamp_t> speeches;
372 |
timestamp_t current_speech;
373 |
374 |
375 |
// Onnx model
376 |
// Inputs
377 |
std::vector<Ort::Value> ort_inputs;
378 |
379 |
std::vector<const char *> input_node_names = {"input", "sr", "h", "c"};
380 |
std::vector<float> input;
381 |
std::vector<int64_t> sr;
382 |
unsigned int size_hc = 2 * 1 * 64; // It's FIXED.
383 |
std::vector<float> _h;
384 |
std::vector<float> _c;
385 |
386 |
int64_t input_node_dims[2] = {};
387 |
const int64_t sr_node_dims[1] = {1};
388 |
const int64_t hc_node_dims[3] = {2, 1, 64};
389 |
390 |
// Outputs
391 |
std::vector<Ort::Value> ort_outputs;
392 |
std::vector<const char *> output_node_names = {"output", "hn", "cn"};
393 |
394 |
395 |
// Construction
396 |
VadIterator(const std::wstring ModelPath,
397 |
int Sample_rate = 16000, int windows_frame_size = 64,
398 |
float Threshold = 0.5, int min_silence_duration_ms = 0,
399 |
int speech_pad_ms = 64, int min_speech_duration_ms = 64,
400 |
float max_speech_duration_s = std::numeric_limits<float>::infinity())
401 |
402 |
403 |
threshold = Threshold;
404 |
sample_rate = Sample_rate;
405 |
sr_per_ms = sample_rate / 1000;
406 |
407 |
window_size_samples = windows_frame_size * sr_per_ms;
408 |
409 |
min_speech_samples = sr_per_ms * min_speech_duration_ms;
410 |
speech_pad_samples = sr_per_ms * speech_pad_ms;
411 |
412 |
max_speech_samples = (
413 |
sample_rate * max_speech_duration_s
414 |
- window_size_samples
415 |
- 2 * speech_pad_samples
416 |
417 |
418 |
min_silence_samples = sr_per_ms * min_silence_duration_ms;
419 |
min_silence_samples_at_max_speech = sr_per_ms * 98;
420 |
421 |
422 |
input_node_dims[0] = 1;
423 |
input_node_dims[1] = window_size_samples;
424 |
425 |
426 |
427 |
428 |
sr[0] = sample_rate;
429 |
430 |
431 |
432 |
int main()
433 |
434 |
std::vector<timestamp_t> stamps;
435 |
436 |
// Read wav
437 |
wav::WavReader wav_reader("recorder.wav"); //16000,1,32float
438 |
std::vector<float> input_wav(wav_reader.num_samples());
439 |
std::vector<float> output_wav;
440 |
441 |
for (int i = 0; i < wav_reader.num_samples(); i++)
442 |
443 |
input_wav[i] = static_cast<float>(*( + i));
444 |
445 |
446 |
447 |
448 |
// ===== Test configs =====
449 |
std::wstring path = L"silero_vad.onnx";
450 |
VadIterator vad(path);
451 |
452 |
// ==============================================
453 |
// ==== = Example 1 of full function =====
454 |
// ==============================================
455 |
456 |
457 |
// 1.a get_speech_timestamps
458 |
stamps = vad.get_speech_timestamps();
459 |
for (int i = 0; i < stamps.size(); i++) {
460 |
461 |
std::cout << stamps[i].c_str() << std::endl;
462 |
463 |
464 |
// 1.b collect_chunks output wav
465 |
vad.collect_chunks(input_wav, output_wav);
466 |
467 |
// 1.c drop_chunks output wav
468 |
vad.drop_chunks(input_wav, output_wav);
469 |
470 |
// ==============================================
471 |
// ===== Example 2 of simple full function =====
472 |
// ==============================================
473 |
vad.process(input_wav, output_wav);
474 |
475 |
stamps = vad.get_speech_timestamps();
476 |
for (int i = 0; i < stamps.size(); i++) {
477 |
478 |
std::cout << stamps[i].c_str() << std::endl;
479 |
480 |
481 |
// ==============================================
482 |
// ===== Example 3 of full function =====
483 |
// ==============================================
484 |
for(int i = 0; i<2; i++)
485 |
vad.process(input_wav, output_wav);
486 |
1 |
// Copyright (c) 2016 Personal (Binbin Zhang)
2 |
3 |
// Licensed under the Apache License, Version 2.0 (the "License");
4 |
// you may not use this file except in compliance with the License.
5 |
// You may obtain a copy of the License at
6 |
7 |
8 |
9 |
// Unless required by applicable law or agreed to in writing, software
10 |
// distributed under the License is distributed on an "AS IS" BASIS,
11 |
12 |
// See the License for the specific language governing permissions and
13 |
// limitations under the License.
14 |
15 |
16 |
17 |
18 |
19 |
#include <assert.h>
20 |
#include <stdint.h>
21 |
#include <stdio.h>
22 |
#include <stdlib.h>
23 |
#include <string.h>
24 |
25 |
#include <string>
26 |
27 |
// #include "utils/log.h"
28 |
29 |
namespace wav {
30 |
31 |
struct WavHeader {
32 |
char riff[4]; // "riff"
33 |
unsigned int size;
34 |
char wav[4]; // "WAVE"
35 |
char fmt[4]; // "fmt "
36 |
unsigned int fmt_size;
37 |
uint16_t format;
38 |
uint16_t channels;
39 |
unsigned int sample_rate;
40 |
unsigned int bytes_per_second;
41 |
uint16_t block_size;
42 |
uint16_t bit;
43 |
char data[4]; // "data"
44 |
unsigned int data_size;
45 |
46 |
47 |
class WavReader {
48 |
49 |
WavReader() : data_(nullptr) {}
50 |
explicit WavReader(const std::string& filename) { Open(filename); }
51 |
52 |
bool Open(const std::string& filename) {
53 |
FILE* fp = fopen(filename.c_str(), "rb"); //文件读取
54 |
if (NULL == fp) {
55 |
std::cout << "Error in read " << filename;
56 |
return false;
57 |
58 |
59 |
WavHeader header;
60 |
fread(&header, 1, sizeof(header), fp);
61 |
if (header.fmt_size < 16) {
62 |
printf("WaveData: expect PCM format data "
63 |
"to have fmt chunk of at least size 16.\n");
64 |
return false;
65 |
} else if (header.fmt_size > 16) {
66 |
int offset = 44 - 8 + header.fmt_size - 16;
67 |
fseek(fp, offset, SEEK_SET);
68 |
fread(, 8, sizeof(char), fp);
69 |
70 |
// check "riff" "WAVE" "fmt " "data"
71 |
72 |
// Skip any sub-chunks between "fmt" and "data". Usually there will
73 |
// be a single "fact" sub chunk, but on Windows there can also be a
74 |
// "list" sub chunk.
75 |
while (0 != strncmp(, "data", 4)) {
76 |
// We will just ignore the data in these chunks.
77 |
fseek(fp, header.data_size, SEEK_CUR);
78 |
// read next sub chunk
79 |
fread(, 8, sizeof(char), fp);
80 |
81 |
82 |
if (header.data_size == 0) {
83 |
int offset = ftell(fp);
84 |
fseek(fp, 0, SEEK_END);
85 |
header.data_size = ftell(fp) - offset;
86 |
fseek(fp, offset, SEEK_SET);
87 |
88 |
89 |
num_channel_ = header.channels;
90 |
sample_rate_ = header.sample_rate;
91 |
bits_per_sample_ = header.bit;
92 |
int num_data = header.data_size / (bits_per_sample_ / 8);
93 |
data_ = new float[num_data]; // Create 1-dim array
94 |
num_samples_ = num_data / num_channel_;
95 |
96 |
std::cout << "num_channel_ :" << num_channel_ << std::endl;
97 |
std::cout << "sample_rate_ :" << sample_rate_ << std::endl;
98 |
std::cout << "bits_per_sample_:" << bits_per_sample_ << std::endl;
99 |
std::cout << "num_samples :" << num_data << std::endl;
100 |
std::cout << "num_data_size :" << header.data_size << std::endl;
101 |
102 |
switch (bits_per_sample_) {
103 |
case 8: {
104 |
char sample;
105 |
for (int i = 0; i < num_data; ++i) {
106 |
fread(&sample, 1, sizeof(char), fp);
107 |
data_[i] = static_cast<float>(sample) / 32768;
108 |
109 |
110 |
111 |
case 16: {
112 |
int16_t sample;
113 |
for (int i = 0; i < num_data; ++i) {
114 |
fread(&sample, 1, sizeof(int16_t), fp);
115 |
data_[i] = static_cast<float>(sample) / 32768;
116 |
117 |
118 |
119 |
case 32:
120 |
121 |
if (header.format == 1) //S32
122 |
123 |
int sample;
124 |
for (int i = 0; i < num_data; ++i) {
125 |
fread(&sample, 1, sizeof(int), fp);
126 |
data_[i] = static_cast<float>(sample) / 32768;
127 |
128 |
129 |
else if (header.format == 3) // IEEE-float
130 |
131 |
float sample;
132 |
for (int i = 0; i < num_data; ++i) {
133 |
fread(&sample, 1, sizeof(float), fp);
134 |
data_[i] = static_cast<float>(sample);
135 |
136 |
137 |
else {
138 |
printf("unsupported quantization bits\n");
139 |
140 |
141 |
142 |
143 |
printf("unsupported quantization bits\n");
144 |
145 |
146 |
147 |
148 |
return true;
149 |
150 |
151 |
int num_channel() const { return num_channel_; }
152 |
int sample_rate() const { return sample_rate_; }
153 |
int bits_per_sample() const { return bits_per_sample_; }
154 |
int num_samples() const { return num_samples_; }
155 |
156 |
~WavReader() {
157 |
delete[] data_;
158 |
159 |
160 |
const float* data() const { return data_; }
161 |
162 |
163 |
int num_channel_;
164 |
int sample_rate_;
165 |
int bits_per_sample_;
166 |
int num_samples_; // sample points per channel
167 |
float* data_;
168 |
169 |
170 |
class WavWriter {
171 |
172 |
WavWriter(const float* data, int num_samples, int num_channel,
173 |
int sample_rate, int bits_per_sample)
174 |
: data_(data),
175 |
176 |
177 |
178 |
bits_per_sample_(bits_per_sample) {}
179 |
180 |
void Write(const std::string& filename) {
181 |
FILE* fp = fopen(filename.c_str(), "w");
182 |
// init char 'riff' 'WAVE' 'fmt ' 'data'
183 |
WavHeader header;
184 |
char wav_header[44] = {0x52, 0x49, 0x46, 0x46, 0x00, 0x00, 0x00, 0x00, 0x57,
185 |
0x41, 0x56, 0x45, 0x66, 0x6d, 0x74, 0x20, 0x10, 0x00,
186 |
0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
187 |
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
188 |
0x64, 0x61, 0x74, 0x61, 0x00, 0x00, 0x00, 0x00};
189 |
memcpy(&header, wav_header, sizeof(header));
190 |
header.channels = num_channel_;
191 |
header.bit = bits_per_sample_;
192 |
header.sample_rate = sample_rate_;
193 |
header.data_size = num_samples_ * num_channel_ * (bits_per_sample_ / 8);
194 |
header.size = sizeof(header) - 8 + header.data_size;
195 |
header.bytes_per_second =
196 |
sample_rate_ * num_channel_ * (bits_per_sample_ / 8);
197 |
header.block_size = num_channel_ * (bits_per_sample_ / 8);
198 |
199 |
fwrite(&header, 1, sizeof(header), fp);
200 |
201 |
for (int i = 0; i < num_samples_; ++i) {
202 |
for (int j = 0; j < num_channel_; ++j) {
203 |
switch (bits_per_sample_) {
204 |
case 8: {
205 |
char sample = static_cast<char>(data_[i * num_channel_ + j]);
206 |
fwrite(&sample, 1, sizeof(sample), fp);
207 |
208 |
209 |
case 16: {
210 |
int16_t sample = static_cast<int16_t>(data_[i * num_channel_ + j]);
211 |
fwrite(&sample, 1, sizeof(sample), fp);
212 |
213 |
214 |
case 32: {
215 |
int sample = static_cast<int>(data_[i * num_channel_ + j]);
216 |
fwrite(&sample, 1, sizeof(sample), fp);
217 |
218 |
219 |
220 |
221 |
222 |
223 |
224 |
225 |
226 |
const float* data_;
227 |
int num_samples_; // total float points in data_
228 |
int num_channel_;
229 |
int sample_rate_;
230 |
int bits_per_sample_;
231 |
232 |
233 |
} // namespace wenet
234 |
235 |
#endif // FRONTEND_WAV_H_
1 |
## Golang Example
2 |
3 |
This is a sample program of how to run speech detection using `silero-vad` from Golang (CGO + ONNX Runtime).
4 |
5 |
### Requirements
6 |
7 |
- Golang >= v1.21
8 |
- ONNX Runtime
9 |
10 |
### Usage
11 |
12 |
13 |
go run ./cmd/main.go test.wav
14 |
15 |
16 |
> **_Note_**
17 |
18 |
> Make sure you have the ONNX Runtime library and C headers installed in your path.
19 |
1 |
package main
2 |
3 |
import (
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
func main() {
13 |
sd, err := speech.NewDetector(speech.DetectorConfig{
14 |
ModelPath: "../../files/silero_vad.onnx",
15 |
SampleRate: 16000,
16 |
WindowSize: 1536,
17 |
Threshold: 0.5,
18 |
MinSilenceDurationMs: 0,
19 |
SpeechPadMs: 0,
20 |
21 |
if err != nil {
22 |
log.Fatalf("failed to create speech detector: %s", err)
23 |
24 |
25 |
f, err := os.Open(os.Args[1])
26 |
if err != nil {
27 |
log.Fatalf("failed to open sample audio file: %s", err)
28 |
29 |
defer f.Close()
30 |
31 |
dec := wav.NewDecoder(f)
32 |
33 |
if ok := dec.IsValidFile(); !ok {
34 |
log.Fatalf("invalid WAV file")
35 |
36 |
37 |
buf, err := dec.FullPCMBuffer()
38 |
if err != nil {
39 |
log.Fatalf("failed to get PCM buffer")
40 |
41 |
42 |
pcmBuf := buf.AsFloat32Buffer()
43 |
44 |
segments, err := sd.Detect(pcmBuf.Data)
45 |
if err != nil {
46 |
log.Fatalf("Detect failed: %s", err)
47 |
48 |
49 |
for _, s := range segments {
50 |
log.Printf("speech starts at %0.2fs", s.SpeechStartAt)
51 |
if s.SpeechEndAt > 0 {
52 |
log.Printf("speech ends at %0.2fs", s.SpeechEndAt)
53 |
54 |
55 |
56 |
err = sd.Destroy()
57 |
if err != nil {
58 |
log.Fatalf("failed to destroy detector: %s", err)
59 |
60 |
1 |
module silero
2 |
3 |
go 1.21.4
4 |
5 |
require (
6 |
+ v1.1.0
7 |
+ v0.1.0
8 |
9 |
10 |
require (
11 |
+ v1.0.0 // indirect
12 |
+ v1.0.0 // indirect
13 |
1 |
+ v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=
2 |
+ v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
3 |
+ v1.0.0 h1:zS9vebldgbQqktK4H0lUqWrG8P0NxCJVqcj7ZpNnwd4=
4 |
+ v1.0.0/go.mod h1:6uAu0+H2lHkwdGsAY+j2wHPNPpPoeg5AaEFh9FlA+Zs=
5 |
+ v1.0.0 h1:d8iCGbDvox9BfLagY94fBynxSPHO80LmZCaOsmKxokA=
6 |
+ v1.0.0/go.mod h1:l3cQwc85y79NQFCRB7TiPoNiaijp6q8Z0Uv38rVG498=
7 |
+ v1.1.0 h1:jQgLtbqBzY7G+BM8fXF7AHUk1uHUviWS4X39d5rsL2g=
8 |
+ v1.1.0/go.mod h1:mpe9qfwbScEbkd8uybLuIpTgHyrISw/OTuvjUW2iGtE=
9 |
+ v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
10 |
+ v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
11 |
+ v0.1.0 h1:0nGZ6VT3LKOkBG/w+4kljIB6brxtgQn6YuNjTVYjOQ4=
12 |
+ v0.1.0/go.mod h1:B+2FXs/5fZ6pzl6unUZYhZqkYdOB+3saBVzjOzdZnUs=
13 |
+ v1.8.4 h1:CcVxjf3Q8PM0mHUKJCdn+eZZtm5yQwehR5yeSVQQcUk=
14 |
+ v1.8.4/go.mod h1:sz/lmYIOXD/1dqDmKjjqLyZ2RngseejIcXlSw2iwfAo=
15 |
+ v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=
16 |
+ v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
1 |
<project xmlns="" xmlns:xsi=""
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 |
24 |
25 |
26 |
27 |
28 |
29 |
30 |
1 |
package org.example;
2 |
3 |
import ai.onnxruntime.OrtException;
4 |
import javax.sound.sampled.*;
5 |
import java.util.Map;
6 |
7 |
public class App {
8 |
9 |
private static final String MODEL_PATH = "src/main/resources/silero_vad.onnx";
10 |
private static final int SAMPLE_RATE = 16000;
11 |
private static final float START_THRESHOLD = 0.6f;
12 |
private static final float END_THRESHOLD = 0.45f;
13 |
private static final int MIN_SILENCE_DURATION_MS = 600;
14 |
private static final int SPEECH_PAD_MS = 500;
15 |
private static final int WINDOW_SIZE_SAMPLES = 2048;
16 |
17 |
public static void main(String[] args) {
18 |
// Initialize the Voice Activity Detector
19 |
SlieroVadDetector vadDetector;
20 |
try {
21 |
22 |
} catch (OrtException e) {
23 |
System.err.println("Error initializing the VAD detector: " + e.getMessage());
24 |
25 |
26 |
27 |
// Set audio format
28 |
AudioFormat format = new AudioFormat(SAMPLE_RATE, 16, 1, true, false);
29 |
DataLine.Info info = new DataLine.Info(TargetDataLine.class, format);
30 |
31 |
// Get the target data line and open it with the specified format
32 |
TargetDataLine targetDataLine;
33 |
try {
34 |
targetDataLine = (TargetDataLine) AudioSystem.getLine(info);
35 |
36 |
37 |
} catch (LineUnavailableException e) {
38 |
System.err.println("Error opening target data line: " + e.getMessage());
39 |
40 |
41 |
42 |
// Main loop to continuously read data and apply Voice Activity Detection
43 |
while (targetDataLine.isOpen()) {
44 |
byte[] data = new byte[WINDOW_SIZE_SAMPLES];
45 |
46 |
int numBytesRead =, 0, data.length);
47 |
if (numBytesRead <= 0) {
48 |
System.err.println("Error reading data from target data line.");
49 |
50 |
51 |
52 |
// Apply the Voice Activity Detector to the data and get the result
53 |
Map<String, Double> detectResult;
54 |
try {
55 |
detectResult = vadDetector.apply(data, true);
56 |
} catch (Exception e) {
57 |
System.err.println("Error applying VAD detector: " + e.getMessage());
58 |
59 |
60 |
61 |
if (!detectResult.isEmpty()) {
62 |
63 |
64 |
65 |
66 |
// Close the target data line to release audio resources
67 |
68 |
69 |
1 |
package org.example;
2 |
3 |
import ai.onnxruntime.OrtException;
4 |
5 |
import java.math.BigDecimal;
6 |
import java.math.RoundingMode;
7 |
import java.util.Collections;
8 |
import java.util.HashMap;
9 |
import java.util.Map;
10 |
11 |
12 |
public class SlieroVadDetector {
13 |
// OnnxModel model used for speech processing
14 |
private final SlieroVadOnnxModel model;
15 |
// Threshold for speech start
16 |
private final float startThreshold;
17 |
// Threshold for speech end
18 |
private final float endThreshold;
19 |
// Sampling rate
20 |
private final int samplingRate;
21 |
// Minimum number of silence samples to determine the end threshold of speech
22 |
private final float minSilenceSamples;
23 |
// Additional number of samples for speech start or end to calculate speech start or end time
24 |
private final float speechPadSamples;
25 |
// Whether in the triggered state (i.e. whether speech is being detected)
26 |
private boolean triggered;
27 |
// Temporarily stored number of speech end samples
28 |
private int tempEnd;
29 |
// Number of samples currently being processed
30 |
private int currentSample;
31 |
32 |
33 |
public SlieroVadDetector(String modelPath,
34 |
float startThreshold,
35 |
float endThreshold,
36 |
int samplingRate,
37 |
int minSilenceDurationMs,
38 |
int speechPadMs) throws OrtException {
39 |
// Check if the sampling rate is 8000 or 16000, if not, throw an exception
40 |
if (samplingRate != 8000 && samplingRate != 16000) {
41 |
throw new IllegalArgumentException("does not support sampling rates other than [8000, 16000]");
42 |
43 |
44 |
// Initialize the parameters
45 |
this.model = new SlieroVadOnnxModel(modelPath);
46 |
this.startThreshold = startThreshold;
47 |
this.endThreshold = endThreshold;
48 |
this.samplingRate = samplingRate;
49 |
this.minSilenceSamples = samplingRate * minSilenceDurationMs / 1000f;
50 |
this.speechPadSamples = samplingRate * speechPadMs / 1000f;
51 |
// Reset the state
52 |
53 |
54 |
55 |
// Method to reset the state, including the model state, trigger state, temporary end time, and current sample count
56 |
public void reset() {
57 |
58 |
triggered = false;
59 |
tempEnd = 0;
60 |
currentSample = 0;
61 |
62 |
63 |
// apply method for processing the audio array, returning possible speech start or end times
64 |
public Map<String, Double> apply(byte[] data, boolean returnSeconds) {
65 |
66 |
// Convert the byte array to a float array
67 |
float[] audioData = new float[data.length / 2];
68 |
for (int i = 0; i < audioData.length; i++) {
69 |
audioData[i] = ((data[i * 2] & 0xff) | (data[i * 2 + 1] << 8)) / 32767.0f;
70 |
71 |
72 |
// Get the length of the audio array as the window size
73 |
int windowSizeSamples = audioData.length;
74 |
// Update the current sample count
75 |
currentSample += windowSizeSamples;
76 |
77 |
// Call the model to get the prediction probability of speech
78 |
float speechProb = 0;
79 |
try {
80 |
speechProb = float[][]{audioData}, samplingRate)[0];
81 |
} catch (OrtException e) {
82 |
throw new RuntimeException(e);
83 |
84 |
85 |
// If the speech probability is greater than the threshold and the temporary end time is not 0, reset the temporary end time
86 |
// This indicates that the speech duration has exceeded expectations and needs to recalculate the end time
87 |
if (speechProb >= startThreshold && tempEnd != 0) {
88 |
tempEnd = 0;
89 |
90 |
91 |
// If the speech probability is greater than the threshold and not in the triggered state, set to triggered state and calculate the speech start time
92 |
if (speechProb >= startThreshold && !triggered) {
93 |
triggered = true;
94 |
int speechStart = (int) (currentSample - speechPadSamples);
95 |
speechStart = Math.max(speechStart, 0);
96 |
Map<String, Double> result = new HashMap<>();
97 |
// Decide whether to return the result in seconds or sample count based on the returnSeconds parameter
98 |
if (returnSeconds) {
99 |
double speechStartSeconds = speechStart / (double) samplingRate;
100 |
double roundedSpeechStart = BigDecimal.valueOf(speechStartSeconds).setScale(1, RoundingMode.HALF_UP).doubleValue();
101 |
result.put("start", roundedSpeechStart);
102 |
} else {
103 |
result.put("start", (double) speechStart);
104 |
105 |
106 |
return result;
107 |
108 |
109 |
// If the speech probability is less than a certain threshold and in the triggered state, calculate the speech end time
110 |
if (speechProb < endThreshold && triggered) {
111 |
// Initialize or update the temporary end time
112 |
if (tempEnd == 0) {
113 |
tempEnd = currentSample;
114 |
115 |
// If the number of silence samples between the current sample and the temporary end time is less than the minimum silence samples, return null
116 |
// This indicates that it is not yet possible to determine whether the speech has ended
117 |
if (currentSample - tempEnd < minSilenceSamples) {
118 |
return Collections.emptyMap();
119 |
} else {
120 |
// Calculate the speech end time, reset the trigger state and temporary end time
121 |
int speechEnd = (int) (tempEnd + speechPadSamples);
122 |
tempEnd = 0;
123 |
triggered = false;
124 |
Map<String, Double> result = new HashMap<>();
125 |
126 |
if (returnSeconds) {
127 |
double speechEndSeconds = speechEnd / (double) samplingRate;
128 |
double roundedSpeechEnd = BigDecimal.valueOf(speechEndSeconds).setScale(1, RoundingMode.HALF_UP).doubleValue();
129 |
result.put("end", roundedSpeechEnd);
130 |
} else {
131 |
result.put("end", (double) speechEnd);
132 |
133 |
return result;
134 |
135 |
136 |
137 |
// If the above conditions are not met, return null by default
138 |
return Collections.emptyMap();
139 |
140 |
141 |
public void close() throws OrtException {
142 |
143 |
144 |
145 |
1 |
package org.example;
2 |
3 |
import ai.onnxruntime.OnnxTensor;
4 |
import ai.onnxruntime.OrtEnvironment;
5 |
import ai.onnxruntime.OrtException;
6 |
import ai.onnxruntime.OrtSession;
7 |
import java.util.Arrays;
8 |
import java.util.HashMap;
9 |
import java.util.List;
10 |
import java.util.Map;
11 |
12 |
public class SlieroVadOnnxModel {
13 |
// Define private variable OrtSession
14 |
private final OrtSession session;
15 |
private float[][][] h;
16 |
private float[][][] c;
17 |
// Define the last sample rate
18 |
private int lastSr = 0;
19 |
// Define the last batch size
20 |
private int lastBatchSize = 0;
21 |
// Define a list of supported sample rates
22 |
private static final List<Integer> SAMPLE_RATES = Arrays.asList(8000, 16000);
23 |
24 |
// Constructor
25 |
public SlieroVadOnnxModel(String modelPath) throws OrtException {
26 |
// Get the ONNX runtime environment
27 |
OrtEnvironment env = OrtEnvironment.getEnvironment();
28 |
// Create an ONNX session options object
29 |
OrtSession.SessionOptions opts = new OrtSession.SessionOptions();
30 |
// Set the InterOp thread count to 1, InterOp threads are used for parallel processing of different computation graph operations
31 |
32 |
// Set the IntraOp thread count to 1, IntraOp threads are used for parallel processing within a single operation
33 |
34 |
// Add a CPU device, setting to false disables CPU execution optimization
35 |
36 |
// Create an ONNX session using the environment, model path, and options
37 |
session = env.createSession(modelPath, opts);
38 |
// Reset states
39 |
40 |
41 |
42 |
43 |
* Reset states
44 |
45 |
void resetStates() {
46 |
h = new float[2][1][64];
47 |
c = new float[2][1][64];
48 |
lastSr = 0;
49 |
lastBatchSize = 0;
50 |
51 |
52 |
public void close() throws OrtException {
53 |
54 |
55 |
56 |
57 |
* Define inner class ValidationResult
58 |
59 |
public static class ValidationResult {
60 |
public final float[][] x;
61 |
public final int sr;
62 |
63 |
// Constructor
64 |
public ValidationResult(float[][] x, int sr) {
65 |
this.x = x;
66 |
+ = sr;
67 |
68 |
69 |
70 |
71 |
* Function to validate input data
72 |
73 |
private ValidationResult validateInput(float[][] x, int sr) {
74 |
// Process the input data with dimension 1
75 |
if (x.length == 1) {
76 |
x = new float[][]{x[0]};
77 |
78 |
// Throw an exception when the input data dimension is greater than 2
79 |
if (x.length > 2) {
80 |
throw new IllegalArgumentException("Incorrect audio data dimension: " + x[0].length);
81 |
82 |
83 |
// Process the input data when the sample rate is not equal to 16000 and is a multiple of 16000
84 |
if (sr != 16000 && (sr % 16000 == 0)) {
85 |
int step = sr / 16000;
86 |
float[][] reducedX = new float[x.length][];
87 |
88 |
for (int i = 0; i < x.length; i++) {
89 |
float[] current = x[i];
90 |
float[] newArr = new float[(current.length + step - 1) / step];
91 |
92 |
for (int j = 0, index = 0; j < current.length; j += step, index++) {
93 |
newArr[index] = current[j];
94 |
95 |
96 |
reducedX[i] = newArr;
97 |
98 |
99 |
x = reducedX;
100 |
sr = 16000;
101 |
102 |
103 |
// If the sample rate is not in the list of supported sample rates, throw an exception
104 |
if (!SAMPLE_RATES.contains(sr)) {
105 |
throw new IllegalArgumentException("Only supports sample rates " + SAMPLE_RATES + " (or multiples of 16000)");
106 |
107 |
108 |
// If the input audio block is too short, throw an exception
109 |
if (((float) sr) / x[0].length > 31.25) {
110 |
throw new IllegalArgumentException("Input audio is too short");
111 |
112 |
113 |
// Return the validated result
114 |
return new ValidationResult(x, sr);
115 |
116 |
117 |
118 |
* Method to call the ONNX model
119 |
120 |
public float[] call(float[][] x, int sr) throws OrtException {
121 |
ValidationResult result = validateInput(x, sr);
122 |
x = result.x;
123 |
sr =;
124 |
125 |
int batchSize = x.length;
126 |
127 |
if (lastBatchSize == 0 || lastSr != sr || lastBatchSize != batchSize) {
128 |
129 |
130 |
131 |
OrtEnvironment env = OrtEnvironment.getEnvironment();
132 |
133 |
OnnxTensor inputTensor = null;
134 |
OnnxTensor hTensor = null;
135 |
OnnxTensor cTensor = null;
136 |
OnnxTensor srTensor = null;
137 |
OrtSession.Result ortOutputs = null;
138 |
139 |
try {
140 |
// Create input tensors
141 |
inputTensor = OnnxTensor.createTensor(env, x);
142 |
hTensor = OnnxTensor.createTensor(env, h);
143 |
cTensor = OnnxTensor.createTensor(env, c);
144 |
srTensor = OnnxTensor.createTensor(env, new long[]{sr});
145 |
146 |
Map<String, OnnxTensor> inputs = new HashMap<>();
147 |
inputs.put("input", inputTensor);
148 |
inputs.put("sr", srTensor);
149 |
inputs.put("h", hTensor);
150 |
inputs.put("c", cTensor);
151 |
152 |
// Call the ONNX model for calculation
153 |
ortOutputs =;
154 |
// Get the output results
155 |
float[][] output = (float[][]) ortOutputs.get(0).getValue();
156 |
h = (float[][][]) ortOutputs.get(1).getValue();
157 |
c = (float[][][]) ortOutputs.get(2).getValue();
158 |
159 |
lastSr = sr;
160 |
lastBatchSize = batchSize;
161 |
return output[0];
162 |
} finally {
163 |
if (inputTensor != null) {
164 |
165 |
166 |
if (hTensor != null) {
167 |
168 |
169 |
if (cTensor != null) {
170 |
171 |
172 |
if (srTensor != null) {
173 |
174 |
175 |
if (ortOutputs != null) {
176 |
177 |
178 |
179 |
180 |
@@ -0,0 +1,28 @@
1 |
2 |
In this example, an integration with the microphone and the webRTC VAD has been done. I used [this]( as a draft.
3 |
Here a short video to present the results:
4 |
5 |
6 |
7 |
# Requirements:
8 |
The libraries used for the following example are:
9 |
10 |
Python == 3.6.9
11 |
webrtcvad >= 2.0.10
12 |
torchaudio >= 0.8.1
13 |
torch >= 1.8.1
14 |
halo >= 0.0.31
15 |
Soundfile >= 0.13.3
16 |
17 |
Using pip3:
18 |
19 |
pip3 install webrtcvad
20 |
pip3 install torchaudio
21 |
pip3 install torch
22 |
pip3 install halo
23 |
pip3 install soundfile
24 |
25 |
Moreover, to make the code easier, the default sample_rate is 16KHz without resampling.
26 |
27 |
This example has been tested on ``` ubuntu 18.04.3 LTS```
28 |
1 |
import collections, queue
2 |
import numpy as np
3 |
import pyaudio
4 |
import webrtcvad
5 |
from halo import Halo
6 |
import torch
7 |
import torchaudio
8 |
9 |
class Audio(object):
10 |
"""Streams raw audio from microphone. Data is received in a separate thread, and stored in a buffer, to be read from."""
11 |
12 |
FORMAT = pyaudio.paInt16
13 |
# Network/VAD rate-space
14 |
15 |
16 |
17 |
18 |
def __init__(self, callback=None, device=None, input_rate=RATE_PROCESS):
19 |
def proxy_callback(in_data, frame_count, time_info, status):
20 |
#pylint: disable=unused-argument
21 |
22 |
return (None, pyaudio.paContinue)
23 |
if callback is None: callback = lambda in_data: self.buffer_queue.put(in_data)
24 |
self.buffer_queue = queue.Queue()
25 |
self.device = device
26 |
self.input_rate = input_rate
27 |
self.sample_rate = self.RATE_PROCESS
28 |
self.block_size = int(self.RATE_PROCESS / float(self.BLOCKS_PER_SECOND))
29 |
self.block_size_input = int(self.input_rate / float(self.BLOCKS_PER_SECOND))
30 |
+ = pyaudio.PyAudio()
31 |
32 |
kwargs = {
33 |
'format': self.FORMAT,
34 |
'channels': self.CHANNELS,
35 |
'rate': self.input_rate,
36 |
'input': True,
37 |
'frames_per_buffer': self.block_size_input,
38 |
'stream_callback': proxy_callback,
39 |
40 |
41 |
self.chunk = None
42 |
# if not default device
43 |
if self.device:
44 |
kwargs['input_device_index'] = self.device
45 |
46 |
+ =**kwargs)
47 |
48 |
49 |
def read(self):
50 |
"""Return a block of audio data, blocking if necessary."""
51 |
return self.buffer_queue.get()
52 |
53 |
def destroy(self):
54 |
55 |
56 |
57 |
58 |
frame_duration_ms = property(lambda self: 1000 * self.block_size // self.sample_rate)
59 |
60 |
61 |
class VADAudio(Audio):
62 |
"""Filter & segment audio with voice activity detection."""
63 |
64 |
def __init__(self, aggressiveness=3, device=None, input_rate=None):
65 |
super().__init__(device=device, input_rate=input_rate)
66 |
self.vad = webrtcvad.Vad(aggressiveness)
67 |
68 |
def frame_generator(self):
69 |
"""Generator that yields all audio frames from microphone."""
70 |
if self.input_rate == self.RATE_PROCESS:
71 |
while True:
72 |
73 |
74 |
raise Exception("Resampling required")
75 |
76 |
def vad_collector(self, padding_ms=300, ratio=0.75, frames=None):
77 |
"""Generator that yields series of consecutive audio frames comprising each utterence, separated by yielding a single None.
78 |
Determines voice activity by ratio of frames in padding_ms. Uses a buffer to include padding_ms prior to being triggered.
79 |
Example: (frame, ..., frame, None, frame, ..., frame, None, ...)
80 |
|---utterence---| |---utterence---|
81 |
82 |
if frames is None: frames = self.frame_generator()
83 |
num_padding_frames = padding_ms // self.frame_duration_ms
84 |
ring_buffer = collections.deque(maxlen=num_padding_frames)
85 |
triggered = False
86 |
87 |
for frame in frames:
88 |
if len(frame) < 640:
89 |
90 |
91 |
is_speech = self.vad.is_speech(frame, self.sample_rate)
92 |
93 |
if not triggered:
94 |
ring_buffer.append((frame, is_speech))
95 |
num_voiced = len([f for f, speech in ring_buffer if speech])
96 |
if num_voiced > ratio * ring_buffer.maxlen:
97 |
triggered = True
98 |
for f, s in ring_buffer:
99 |
yield f
100 |
101 |
102 |
103 |
yield frame
104 |
ring_buffer.append((frame, is_speech))
105 |
num_unvoiced = len([f for f, speech in ring_buffer if not speech])
106 |
if num_unvoiced > ratio * ring_buffer.maxlen:
107 |
triggered = False
108 |
yield None
109 |
110 |
111 |
def main(ARGS):
112 |
# Start audio with VAD
113 |
vad_audio = VADAudio(aggressiveness=ARGS.webRTC_aggressiveness,
114 |
115 |
116 |
117 |
print("Listening (ctrl-C to exit)...")
118 |
frames = vad_audio.vad_collector()
119 |
120 |
# load silero VAD
121 |
122 |
model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
123 |
124 |
force_reload= ARGS.reload)
125 |
(get_speech_ts,_,_, _,_, _, _) = utils
126 |
127 |
128 |
# Stream from microphone to DeepSpeech using VAD
129 |
spinner = None
130 |
if not ARGS.nospinner:
131 |
spinner = Halo(spinner='line')
132 |
wav_data = bytearray()
133 |
for frame in frames:
134 |
if frame is not None:
135 |
if spinner: spinner.start()
136 |
137 |
138 |
139 |
if spinner: spinner.stop()
140 |
print("webRTC has detected a possible speech")
141 |
142 |
newsound= np.frombuffer(wav_data,np.int16)
143 |
144 |
time_stamps =get_speech_ts(audio_float32, model,num_steps=ARGS.num_steps,trig_sum=ARGS.trig_sum,neg_trig_sum=ARGS.neg_trig_sum,
145 |
146 |
147 |
148 |
149 |
print("silero VAD has detected a possible speech")
150 |
151 |
print("silero VAD has detected a noise")
152 |
153 |
wav_data = bytearray()
154 |
155 |
156 |
def Int2Float(sound):
157 |
_sound = np.copy(sound) #
158 |
abs_max = np.abs(_sound).max()
159 |
_sound = _sound.astype('float32')
160 |
if abs_max > 0:
161 |
_sound *= 1/abs_max
162 |
audio_float32 = torch.from_numpy(_sound.squeeze())
163 |
return audio_float32
164 |
165 |
if __name__ == '__main__':
166 |
167 |
168 |
import argparse
169 |
parser = argparse.ArgumentParser(description="Stream from microphone to webRTC and silero VAD")
170 |
171 |
parser.add_argument('-v', '--webRTC_aggressiveness', type=int, default=3,
172 |
help="Set aggressiveness of webRTC: an integer between 0 and 3, 0 being the least aggressive about filtering out non-speech, 3 the most aggressive. Default: 3")
173 |
parser.add_argument('--nospinner', action='store_true',
174 |
help="Disable spinner")
175 |
parser.add_argument('-d', '--device', type=int, default=None,
176 |
help="Device input index (Int) as listed by pyaudio.PyAudio.get_device_info_by_index(). If not provided, falls back to PyAudio.get_default_device().")
177 |
178 |
parser.add_argument('-name', '--silaro_model_name', type=str, default="silero_vad",
179 |
help="select the name of the model. You can select between 'silero_vad',''silero_vad_micro','silero_vad_micro_8k','silero_vad_mini','silero_vad_mini_8k'")
180 |
parser.add_argument('--reload', action='store_true',help="download the last version of the silero vad")
181 |
182 |
parser.add_argument('-ts', '--trig_sum', type=float, default=0.25,
183 |
help="overlapping windows are used for each audio chunk, trig sum defines average probability among those windows for switching into triggered state (speech state)")
184 |
185 |
parser.add_argument('-nts', '--neg_trig_sum', type=float, default=0.07,
186 |
help="same as trig_sum, but for switching from triggered to non-triggered state (non-speech)")
187 |
188 |
parser.add_argument('-N', '--num_steps', type=int, default=8,
189 |
help="nubmer of overlapping windows to split audio chunk into (we recommend 4 or 8)")
190 |
191 |
parser.add_argument('-nspw', '--num_samples_per_window', type=int, default=4000,
192 |
help="number of samples in each window, our models were trained using 4000 samples (250 ms) per window, so this is preferable value (lesser values reduce quality)")
193 |
194 |
parser.add_argument('-msps', '--min_speech_samples', type=int, default=10000,
195 |
help="minimum speech chunk duration in samples")
196 |
197 |
parser.add_argument('-msis', '--min_silence_samples', type=int, default=500,
198 |
help=" minimum silence duration in samples between to separate speech chunks")
199 |
ARGS = parser.parse_args()
200 |
201 |
1 |
2 |
"cells": [
3 |
4 |
"attachments": {},
5 |
"cell_type": "markdown",
6 |
"metadata": {},
7 |
"source": [
8 |
"## Install Dependencies"
9 |
10 |
11 |
12 |
"cell_type": "code",
13 |
"execution_count": null,
14 |
"metadata": {},
15 |
"outputs": [],
16 |
"source": [
17 |
"# !pip install -q torchaudio\n",
18 |
"SAMPLING_RATE = 16000\n",
19 |
"import torch\n",
20 |
"from pprint import pprint\n",
21 |
22 |
23 |
"NUM_PROCESS=4 # set to the number of CPU cores in the machine\n",
24 |
25 |
"# download wav files, make multiple copies\n",
26 |
"for idx in range(NUM_COPIES):\n",
27 |
" torch.hub.download_url_to_file('', f\"en_example{idx}.wav\")\n"
28 |
29 |
30 |
31 |
"attachments": {},
32 |
"cell_type": "markdown",
33 |
"metadata": {},
34 |
"source": [
35 |
"## Load VAD model from torch hub"
36 |
37 |
38 |
39 |
"cell_type": "code",
40 |
"execution_count": null,
41 |
"metadata": {},
42 |
"outputs": [],
43 |
"source": [
44 |
"model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',\n",
45 |
" model='silero_vad',\n",
46 |
" force_reload=True,\n",
47 |
" onnx=False)\n",
48 |
49 |
50 |
51 |
52 |
53 |
"collect_chunks) = utils"
54 |
55 |
56 |
57 |
"attachments": {},
58 |
"cell_type": "markdown",
59 |
"metadata": {},
60 |
"source": [
61 |
"## Define a vad process function"
62 |
63 |
64 |
65 |
"cell_type": "code",
66 |
"execution_count": null,
67 |
"metadata": {},
68 |
"outputs": [],
69 |
"source": [
70 |
"import multiprocessing\n",
71 |
72 |
"vad_models = dict()\n",
73 |
74 |
"def init_model(model):\n",
75 |
" pid = multiprocessing.current_process().pid\n",
76 |
" model, _ = torch.hub.load(repo_or_dir='snakers4/silero-vad',\n",
77 |
" model='silero_vad',\n",
78 |
" force_reload=False,\n",
79 |
" onnx=False)\n",
80 |
" vad_models[pid] = model\n",
81 |
82 |
"def vad_process(audio_file: str):\n",
83 |
" \n",
84 |
" pid = multiprocessing.current_process().pid\n",
85 |
" \n",
86 |
" with torch.no_grad():\n",
87 |
" wav = read_audio(audio_file, sampling_rate=SAMPLING_RATE)\n",
88 |
" return get_speech_timestamps(\n",
89 |
" wav,\n",
90 |
" vad_models[pid],\n",
91 |
" 0.46, # speech prob threshold\n",
92 |
" 16000, # sample rate\n",
93 |
" 300, # min speech duration in ms\n",
94 |
" 20, # max speech duration in seconds\n",
95 |
" 600, # min silence duration\n",
96 |
" 512, # window size\n",
97 |
" 200, # spech pad ms\n",
98 |
" )"
99 |
100 |
101 |
102 |
"attachments": {},
103 |
"cell_type": "markdown",
104 |
"metadata": {},
105 |
"source": [
106 |
"## Parallelization"
107 |
108 |
109 |
110 |
"cell_type": "code",
111 |
"execution_count": null,
112 |
"metadata": {},
113 |
"outputs": [],
114 |
"source": [
115 |
"from concurrent.futures import ProcessPoolExecutor, as_completed\n",
116 |
117 |
"futures = []\n",
118 |
119 |
"with ProcessPoolExecutor(max_workers=NUM_PROCESS, initializer=init_model, initargs=(model,)) as ex:\n",
120 |
" for i in range(NUM_COPIES):\n",
121 |
" futures.append(ex.submit(vad_process, f\"en_example{idx}.wav\"))\n",
122 |
123 |
"for finished in as_completed(futures):\n",
124 |
" pprint(finished.result())"
125 |
126 |
127 |
128 |
"metadata": {
129 |
"kernelspec": {
130 |
"display_name": "diarization",
131 |
"language": "python",
132 |
"name": "python3"
133 |
134 |
# Pyaudio Streaming Example
This example notebook shows how micophone audio fetched by pyaudio can be processed with Silero-VAD.
It has been designed as a low-level example for binary real-time streaming using only the prediction of the model, processing the binary data and plotting the speech probabilities at the end to visualize it.
Currently, the notebook consits of two examples:
9 |
11 |
## Example Video for the Real-Time Visualization
"cells": [
"cell_type": "markdown",
6 |
"source": [
10 |
12 |
14 |
19 |
"id": "64cbe1eb",
22 |
"## Dependencies\n",
24 |
"The cell below lists all used dependencies and the used versions. Uncomment to install them from within the notebook."
"cell_type": "code",
30 |
"metadata": {},
33 |
"#!pip install numpy==1.20.2\n",
36 |
"#!pip install torchaudio==0.9.0\n",
39 |
"cell_type": "markdown",
45 |
"source": [
48 |
"cell_type": "code",
53 |
