Spaces:
Runtime error
Runtime error
Delete hub
Browse files- hub/snakers4_silero-vad_master/.github/ISSUE_TEMPLATE/bug_report.md +0 -52
- hub/snakers4_silero-vad_master/.github/ISSUE_TEMPLATE/feature_request.md +0 -27
- hub/snakers4_silero-vad_master/.github/ISSUE_TEMPLATE/questions---help---support.md +0 -12
- hub/snakers4_silero-vad_master/CODE_OF_CONDUCT.md +0 -76
- hub/snakers4_silero-vad_master/LICENSE +0 -21
- hub/snakers4_silero-vad_master/README.md +0 -113
- hub/snakers4_silero-vad_master/examples/colab_record_example.ipynb +0 -241
- hub/snakers4_silero-vad_master/examples/microphone_and_webRTC_integration/README.md +0 -28
- hub/snakers4_silero-vad_master/examples/microphone_and_webRTC_integration/microphone_and_webRTC_integration.py +0 -201
- hub/snakers4_silero-vad_master/examples/pyaudio-streaming/README.md +0 -20
- hub/snakers4_silero-vad_master/examples/pyaudio-streaming/pyaudio-streaming-examples.ipynb +0 -331
- hub/snakers4_silero-vad_master/files/lang_dict_95.json +0 -1
- hub/snakers4_silero-vad_master/files/lang_group_dict_95.json +0 -1
- hub/snakers4_silero-vad_master/files/silero_logo.jpg +0 -0
- hub/snakers4_silero-vad_master/files/silero_vad.jit +0 -3
- hub/snakers4_silero-vad_master/files/silero_vad.onnx +0 -3
- hub/snakers4_silero-vad_master/hubconf.py +0 -105
- hub/snakers4_silero-vad_master/silero-vad.ipynb +0 -445
- hub/snakers4_silero-vad_master/utils_vad.py +0 -488
- hub/trusted_list +0 -0
hub/snakers4_silero-vad_master/.github/ISSUE_TEMPLATE/bug_report.md
DELETED
@@ -1,52 +0,0 @@
|
|
1 |
-
---
|
2 |
-
name: Bug report
|
3 |
-
about: Create a report to help us improve
|
4 |
-
title: Bug report - [X]
|
5 |
-
labels: bug
|
6 |
-
assignees: snakers4
|
7 |
-
|
8 |
-
---
|
9 |
-
|
10 |
-
## 🐛 Bug
|
11 |
-
|
12 |
-
<!-- A clear and concise description of what the bug is. -->
|
13 |
-
|
14 |
-
## To Reproduce
|
15 |
-
|
16 |
-
Steps to reproduce the behavior:
|
17 |
-
|
18 |
-
1.
|
19 |
-
2.
|
20 |
-
3.
|
21 |
-
|
22 |
-
<!-- If you have a code sample, error messages, stack traces, please provide it here as well -->
|
23 |
-
|
24 |
-
## Expected behavior
|
25 |
-
|
26 |
-
<!-- A clear and concise description of what you expected to happen. -->
|
27 |
-
|
28 |
-
## Environment
|
29 |
-
|
30 |
-
Please copy and paste the output from this
|
31 |
-
[environment collection script](https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py)
|
32 |
-
(or fill out the checklist below manually).
|
33 |
-
|
34 |
-
You can get the script and run it with:
|
35 |
-
```
|
36 |
-
wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
|
37 |
-
# For security purposes, please check the contents of collect_env.py before running it.
|
38 |
-
python collect_env.py
|
39 |
-
```
|
40 |
-
|
41 |
-
- PyTorch Version (e.g., 1.0):
|
42 |
-
- OS (e.g., Linux):
|
43 |
-
- How you installed PyTorch (`conda`, `pip`, source):
|
44 |
-
- Build command you used (if compiling from source):
|
45 |
-
- Python version:
|
46 |
-
- CUDA/cuDNN version:
|
47 |
-
- GPU models and configuration:
|
48 |
-
- Any other relevant information:
|
49 |
-
|
50 |
-
## Additional context
|
51 |
-
|
52 |
-
<!-- Add any other context about the problem here. -->
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
hub/snakers4_silero-vad_master/.github/ISSUE_TEMPLATE/feature_request.md
DELETED
@@ -1,27 +0,0 @@
|
|
1 |
-
---
|
2 |
-
name: Feature request
|
3 |
-
about: Suggest an idea for this project
|
4 |
-
title: Feature request - [X]
|
5 |
-
labels: enhancement
|
6 |
-
assignees: snakers4
|
7 |
-
|
8 |
-
---
|
9 |
-
|
10 |
-
## 🚀 Feature
|
11 |
-
<!-- A clear and concise description of the feature proposal -->
|
12 |
-
|
13 |
-
## Motivation
|
14 |
-
|
15 |
-
<!-- Please outline the motivation for the proposal. Is your feature request related to a problem? e.g., I'm always frustrated when [...]. If this is related to another GitHub issue, please link here too -->
|
16 |
-
|
17 |
-
## Pitch
|
18 |
-
|
19 |
-
<!-- A clear and concise description of what you want to happen. -->
|
20 |
-
|
21 |
-
## Alternatives
|
22 |
-
|
23 |
-
<!-- A clear and concise description of any alternative solutions or features you've considered, if any. -->
|
24 |
-
|
25 |
-
## Additional context
|
26 |
-
|
27 |
-
<!-- Add any other context or screenshots about the feature request here. -->
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
hub/snakers4_silero-vad_master/.github/ISSUE_TEMPLATE/questions---help---support.md
DELETED
@@ -1,12 +0,0 @@
|
|
1 |
-
---
|
2 |
-
name: Questions / Help / Support
|
3 |
-
about: Ask for help, support or ask a question
|
4 |
-
title: "❓ Questions / Help / Support"
|
5 |
-
labels: help wanted
|
6 |
-
assignees: snakers4
|
7 |
-
|
8 |
-
---
|
9 |
-
|
10 |
-
## ❓ Questions and Help
|
11 |
-
|
12 |
-
We have a [wiki](https://github.com/snakers4/silero-models/wiki) available for our users. Please make sure you have checked it out first.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
hub/snakers4_silero-vad_master/CODE_OF_CONDUCT.md
DELETED
@@ -1,76 +0,0 @@
|
|
1 |
-
# Contributor Covenant Code of Conduct
|
2 |
-
|
3 |
-
## Our Pledge
|
4 |
-
|
5 |
-
In the interest of fostering an open and welcoming environment, we as
|
6 |
-
contributors and maintainers pledge to making participation in our project and
|
7 |
-
our community a harassment-free experience for everyone, regardless of age, body
|
8 |
-
size, disability, ethnicity, sex characteristics, gender identity and expression,
|
9 |
-
level of experience, education, socio-economic status, nationality, personal
|
10 |
-
appearance, race, religion, or sexual identity and orientation.
|
11 |
-
|
12 |
-
## Our Standards
|
13 |
-
|
14 |
-
Examples of behavior that contributes to creating a positive environment
|
15 |
-
include:
|
16 |
-
|
17 |
-
* Using welcoming and inclusive language
|
18 |
-
* Being respectful of differing viewpoints and experiences
|
19 |
-
* Gracefully accepting constructive criticism
|
20 |
-
* Focusing on what is best for the community
|
21 |
-
* Showing empathy towards other community members
|
22 |
-
|
23 |
-
Examples of unacceptable behavior by participants include:
|
24 |
-
|
25 |
-
* The use of sexualized language or imagery and unwelcome sexual attention or
|
26 |
-
advances
|
27 |
-
* Trolling, insulting/derogatory comments, and personal or political attacks
|
28 |
-
* Public or private harassment
|
29 |
-
* Publishing others' private information, such as a physical or electronic
|
30 |
-
address, without explicit permission
|
31 |
-
* Other conduct which could reasonably be considered inappropriate in a
|
32 |
-
professional setting
|
33 |
-
|
34 |
-
## Our Responsibilities
|
35 |
-
|
36 |
-
Project maintainers are responsible for clarifying the standards of acceptable
|
37 |
-
behavior and are expected to take appropriate and fair corrective action in
|
38 |
-
response to any instances of unacceptable behavior.
|
39 |
-
|
40 |
-
Project maintainers have the right and responsibility to remove, edit, or
|
41 |
-
reject comments, commits, code, wiki edits, issues, and other contributions
|
42 |
-
that are not aligned to this Code of Conduct, or to ban temporarily or
|
43 |
-
permanently any contributor for other behaviors that they deem inappropriate,
|
44 |
-
threatening, offensive, or harmful.
|
45 |
-
|
46 |
-
## Scope
|
47 |
-
|
48 |
-
This Code of Conduct applies both within project spaces and in public spaces
|
49 |
-
when an individual is representing the project or its community. Examples of
|
50 |
-
representing a project or community include using an official project e-mail
|
51 |
-
address, posting via an official social media account, or acting as an appointed
|
52 |
-
representative at an online or offline event. Representation of a project may be
|
53 |
-
further defined and clarified by project maintainers.
|
54 |
-
|
55 |
-
## Enforcement
|
56 |
-
|
57 |
-
Instances of abusive, harassing, or otherwise unacceptable behavior may be
|
58 |
-
reported by contacting the project team at aveysov@gmail.com. All
|
59 |
-
complaints will be reviewed and investigated and will result in a response that
|
60 |
-
is deemed necessary and appropriate to the circumstances. The project team is
|
61 |
-
obligated to maintain confidentiality with regard to the reporter of an incident.
|
62 |
-
Further details of specific enforcement policies may be posted separately.
|
63 |
-
|
64 |
-
Project maintainers who do not follow or enforce the Code of Conduct in good
|
65 |
-
faith may face temporary or permanent repercussions as determined by other
|
66 |
-
members of the project's leadership.
|
67 |
-
|
68 |
-
## Attribution
|
69 |
-
|
70 |
-
This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
|
71 |
-
available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
|
72 |
-
|
73 |
-
[homepage]: https://www.contributor-covenant.org
|
74 |
-
|
75 |
-
For answers to common questions about this code of conduct, see
|
76 |
-
https://www.contributor-covenant.org/faq
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
hub/snakers4_silero-vad_master/LICENSE
DELETED
@@ -1,21 +0,0 @@
|
|
1 |
-
MIT License
|
2 |
-
|
3 |
-
Copyright (c) 2020-present Silero Team
|
4 |
-
|
5 |
-
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6 |
-
of this software and associated documentation files (the "Software"), to deal
|
7 |
-
in the Software without restriction, including without limitation the rights
|
8 |
-
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
9 |
-
copies of the Software, and to permit persons to whom the Software is
|
10 |
-
furnished to do so, subject to the following conditions:
|
11 |
-
|
12 |
-
The above copyright notice and this permission notice shall be included in all
|
13 |
-
copies or substantial portions of the Software.
|
14 |
-
|
15 |
-
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
16 |
-
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
17 |
-
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
18 |
-
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
19 |
-
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
20 |
-
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
21 |
-
SOFTWARE.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
hub/snakers4_silero-vad_master/README.md
DELETED
@@ -1,113 +0,0 @@
|
|
1 |
-
[![Mailing list : test](http://img.shields.io/badge/Email-gray.svg?style=for-the-badge&logo=gmail)](mailto:hello@silero.ai) [![Mailing list : test](http://img.shields.io/badge/Telegram-blue.svg?style=for-the-badge&logo=telegram)](https://t.me/silero_speech) [![License: CC BY-NC 4.0](https://img.shields.io/badge/License-MIT-lightgrey.svg?style=for-the-badge)](https://github.com/snakers4/silero-vad/blob/master/LICENSE)
|
2 |
-
|
3 |
-
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snakers4/silero-vad/blob/master/silero-vad.ipynb)
|
4 |
-
|
5 |
-
![header](https://user-images.githubusercontent.com/12515440/89997349-b3523080-dc94-11ea-9906-ca2e8bc50535.png)
|
6 |
-
|
7 |
-
<br/>
|
8 |
-
<h1 align="center">Silero VAD</h1>
|
9 |
-
<br/>
|
10 |
-
|
11 |
-
**Silero VAD** - pre-trained enterprise-grade [Voice Activity Detector](https://en.wikipedia.org/wiki/Voice_activity_detection) (also see our [STT models](https://github.com/snakers4/silero-models)).
|
12 |
-
|
13 |
-
This repository also includes Number Detector and Language classifier [models](https://github.com/snakers4/silero-vad/wiki/Other-Models)
|
14 |
-
|
15 |
-
<br/>
|
16 |
-
|
17 |
-
<p align="center">
|
18 |
-
<img src="https://user-images.githubusercontent.com/36505480/198026365-8da383e0-5398-4a12-b7f8-22c2c0059512.png" />
|
19 |
-
</p>
|
20 |
-
|
21 |
-
<details>
|
22 |
-
<summary>Real Time Example</summary>
|
23 |
-
|
24 |
-
https://user-images.githubusercontent.com/36505480/144874384-95f80f6d-a4f1-42cc-9be7-004c891dd481.mp4
|
25 |
-
|
26 |
-
</details>
|
27 |
-
|
28 |
-
<br/>
|
29 |
-
<h2 align="center">Key Features</h2>
|
30 |
-
<br/>
|
31 |
-
|
32 |
-
- **Stellar accuracy**
|
33 |
-
|
34 |
-
Silero VAD has [excellent results](https://github.com/snakers4/silero-vad/wiki/Quality-Metrics#vs-other-available-solutions) on speech detection tasks.
|
35 |
-
|
36 |
-
- **Fast**
|
37 |
-
|
38 |
-
One audio chunk (30+ ms) [takes](https://github.com/snakers4/silero-vad/wiki/Performance-Metrics#silero-vad-performance-metrics) less than **1ms** to be processed on a single CPU thread. Using batching or GPU can also improve performance considerably. Under certain conditions ONNX may even run up to 4-5x faster.
|
39 |
-
|
40 |
-
- **Lightweight**
|
41 |
-
|
42 |
-
JIT model is around one megabyte in size.
|
43 |
-
|
44 |
-
- **General**
|
45 |
-
|
46 |
-
Silero VAD was trained on huge corpora that include over **100** languages and it performs well on audios from different domains with various background noise and quality levels.
|
47 |
-
|
48 |
-
- **Flexible sampling rate**
|
49 |
-
|
50 |
-
Silero VAD [supports](https://github.com/snakers4/silero-vad/wiki/Quality-Metrics#sample-rate-comparison) **8000 Hz** and **16000 Hz** [sampling rates](https://en.wikipedia.org/wiki/Sampling_(signal_processing)#Sampling_rate).
|
51 |
-
|
52 |
-
- **Flexible chunk size**
|
53 |
-
|
54 |
-
Model was trained on **30 ms**. Longer chunks are supported directly, others may work as well.
|
55 |
-
|
56 |
-
- **Highly Portable**
|
57 |
-
|
58 |
-
Silero VAD reaps benefits from the rich ecosystems built around **PyTorch** and **ONNX** running everywhere where these runtimes are available.
|
59 |
-
|
60 |
-
- **No Strings Attached**
|
61 |
-
|
62 |
-
Published under permissive license (MIT) Silero VAD has zero strings attached - no telemetry, no keys, no registration, no built-in expiration, no keys or vendor lock.
|
63 |
-
|
64 |
-
<br/>
|
65 |
-
<h2 align="center">Typical Use Cases</h2>
|
66 |
-
<br/>
|
67 |
-
|
68 |
-
- Voice activity detection for IOT / edge / mobile use cases
|
69 |
-
- Data cleaning and preparation, voice detection in general
|
70 |
-
- Telephony and call-center automation, voice bots
|
71 |
-
- Voice interfaces
|
72 |
-
|
73 |
-
<br/>
|
74 |
-
<h2 align="center">Links</h2>
|
75 |
-
<br/>
|
76 |
-
|
77 |
-
|
78 |
-
- [Examples and Dependencies](https://github.com/snakers4/silero-vad/wiki/Examples-and-Dependencies#dependencies)
|
79 |
-
- [Quality Metrics](https://github.com/snakers4/silero-vad/wiki/Quality-Metrics)
|
80 |
-
- [Performance Metrics](https://github.com/snakers4/silero-vad/wiki/Performance-Metrics)
|
81 |
-
- [Number Detector and Language classifier models](https://github.com/snakers4/silero-vad/wiki/Other-Models)
|
82 |
-
- [Versions and Available Models](https://github.com/snakers4/silero-vad/wiki/Version-history-and-Available-Models)
|
83 |
-
- [Further reading](https://github.com/snakers4/silero-models#further-reading)
|
84 |
-
- [FAQ](https://github.com/snakers4/silero-vad/wiki/FAQ)
|
85 |
-
|
86 |
-
<br/>
|
87 |
-
<h2 align="center">Get In Touch</h2>
|
88 |
-
<br/>
|
89 |
-
|
90 |
-
Try our models, create an [issue](https://github.com/snakers4/silero-vad/issues/new), start a [discussion](https://github.com/snakers4/silero-vad/discussions/new), join our telegram [chat](https://t.me/silero_speech), [email](mailto:hello@silero.ai) us, read our [news](https://t.me/silero_news).
|
91 |
-
|
92 |
-
Please see our [wiki](https://github.com/snakers4/silero-models/wiki) and [tiers](https://github.com/snakers4/silero-models/wiki/Licensing-and-Tiers) for relevant information and [email](mailto:hello@silero.ai) us directly.
|
93 |
-
|
94 |
-
**Citations**
|
95 |
-
|
96 |
-
```
|
97 |
-
@misc{Silero VAD,
|
98 |
-
author = {Silero Team},
|
99 |
-
title = {Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier},
|
100 |
-
year = {2021},
|
101 |
-
publisher = {GitHub},
|
102 |
-
journal = {GitHub repository},
|
103 |
-
howpublished = {\url{https://github.com/snakers4/silero-vad}},
|
104 |
-
commit = {insert_some_commit_here},
|
105 |
-
email = {hello@silero.ai}
|
106 |
-
}
|
107 |
-
```
|
108 |
-
|
109 |
-
<br/>
|
110 |
-
<h2 align="center">VAD-based Community Apps</h2>
|
111 |
-
<br/>
|
112 |
-
|
113 |
-
- Voice activity detection for the [browser](https://github.com/ricky0123/vad) using ONNX Runtime Web
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
hub/snakers4_silero-vad_master/examples/colab_record_example.ipynb
DELETED
@@ -1,241 +0,0 @@
|
|
1 |
-
{
|
2 |
-
"cells": [
|
3 |
-
{
|
4 |
-
"cell_type": "markdown",
|
5 |
-
"metadata": {
|
6 |
-
"id": "bccAucKjnPHm"
|
7 |
-
},
|
8 |
-
"source": [
|
9 |
-
"### Dependencies and inputs"
|
10 |
-
]
|
11 |
-
},
|
12 |
-
{
|
13 |
-
"cell_type": "code",
|
14 |
-
"execution_count": null,
|
15 |
-
"metadata": {
|
16 |
-
"id": "cSih95WFmwgi"
|
17 |
-
},
|
18 |
-
"outputs": [],
|
19 |
-
"source": [
|
20 |
-
"!pip -q install pydub\n",
|
21 |
-
"from google.colab import output\n",
|
22 |
-
"from base64 import b64decode, b64encode\n",
|
23 |
-
"from io import BytesIO\n",
|
24 |
-
"import numpy as np\n",
|
25 |
-
"from pydub import AudioSegment\n",
|
26 |
-
"from IPython.display import HTML, display\n",
|
27 |
-
"import torch\n",
|
28 |
-
"import matplotlib.pyplot as plt\n",
|
29 |
-
"import moviepy.editor as mpe\n",
|
30 |
-
"from matplotlib.animation import FuncAnimation, FFMpegWriter\n",
|
31 |
-
"import matplotlib\n",
|
32 |
-
"matplotlib.use('Agg')\n",
|
33 |
-
"\n",
|
34 |
-
"torch.set_num_threads(1)\n",
|
35 |
-
"\n",
|
36 |
-
"model, _ = torch.hub.load(repo_or_dir='snakers4/silero-vad',\n",
|
37 |
-
" model='silero_vad',\n",
|
38 |
-
" force_reload=True)\n",
|
39 |
-
"\n",
|
40 |
-
"def int2float(sound):\n",
|
41 |
-
" abs_max = np.abs(sound).max()\n",
|
42 |
-
" sound = sound.astype('float32')\n",
|
43 |
-
" if abs_max > 0:\n",
|
44 |
-
" sound *= 1/abs_max\n",
|
45 |
-
" sound = sound.squeeze()\n",
|
46 |
-
" return sound\n",
|
47 |
-
"\n",
|
48 |
-
"AUDIO_HTML = \"\"\"\n",
|
49 |
-
"<script>\n",
|
50 |
-
"var my_div = document.createElement(\"DIV\");\n",
|
51 |
-
"var my_p = document.createElement(\"P\");\n",
|
52 |
-
"var my_btn = document.createElement(\"BUTTON\");\n",
|
53 |
-
"var t = document.createTextNode(\"Press to start recording\");\n",
|
54 |
-
"\n",
|
55 |
-
"my_btn.appendChild(t);\n",
|
56 |
-
"//my_p.appendChild(my_btn);\n",
|
57 |
-
"my_div.appendChild(my_btn);\n",
|
58 |
-
"document.body.appendChild(my_div);\n",
|
59 |
-
"\n",
|
60 |
-
"var base64data = 0;\n",
|
61 |
-
"var reader;\n",
|
62 |
-
"var recorder, gumStream;\n",
|
63 |
-
"var recordButton = my_btn;\n",
|
64 |
-
"\n",
|
65 |
-
"var handleSuccess = function(stream) {\n",
|
66 |
-
" gumStream = stream;\n",
|
67 |
-
" var options = {\n",
|
68 |
-
" //bitsPerSecond: 8000, //chrome seems to ignore, always 48k\n",
|
69 |
-
" mimeType : 'audio/webm;codecs=opus'\n",
|
70 |
-
" //mimeType : 'audio/webm;codecs=pcm'\n",
|
71 |
-
" }; \n",
|
72 |
-
" //recorder = new MediaRecorder(stream, options);\n",
|
73 |
-
" recorder = new MediaRecorder(stream);\n",
|
74 |
-
" recorder.ondataavailable = function(e) { \n",
|
75 |
-
" var url = URL.createObjectURL(e.data);\n",
|
76 |
-
" // var preview = document.createElement('audio');\n",
|
77 |
-
" // preview.controls = true;\n",
|
78 |
-
" // preview.src = url;\n",
|
79 |
-
" // document.body.appendChild(preview);\n",
|
80 |
-
"\n",
|
81 |
-
" reader = new FileReader();\n",
|
82 |
-
" reader.readAsDataURL(e.data); \n",
|
83 |
-
" reader.onloadend = function() {\n",
|
84 |
-
" base64data = reader.result;\n",
|
85 |
-
" //console.log(\"Inside FileReader:\" + base64data);\n",
|
86 |
-
" }\n",
|
87 |
-
" };\n",
|
88 |
-
" recorder.start();\n",
|
89 |
-
" };\n",
|
90 |
-
"\n",
|
91 |
-
"recordButton.innerText = \"Recording... press to stop\";\n",
|
92 |
-
"\n",
|
93 |
-
"navigator.mediaDevices.getUserMedia({audio: true}).then(handleSuccess);\n",
|
94 |
-
"\n",
|
95 |
-
"\n",
|
96 |
-
"function toggleRecording() {\n",
|
97 |
-
" if (recorder && recorder.state == \"recording\") {\n",
|
98 |
-
" recorder.stop();\n",
|
99 |
-
" gumStream.getAudioTracks()[0].stop();\n",
|
100 |
-
" recordButton.innerText = \"Saving recording...\"\n",
|
101 |
-
" }\n",
|
102 |
-
"}\n",
|
103 |
-
"\n",
|
104 |
-
"// https://stackoverflow.com/a/951057\n",
|
105 |
-
"function sleep(ms) {\n",
|
106 |
-
" return new Promise(resolve => setTimeout(resolve, ms));\n",
|
107 |
-
"}\n",
|
108 |
-
"\n",
|
109 |
-
"var data = new Promise(resolve=>{\n",
|
110 |
-
"//recordButton.addEventListener(\"click\", toggleRecording);\n",
|
111 |
-
"recordButton.onclick = ()=>{\n",
|
112 |
-
"toggleRecording()\n",
|
113 |
-
"\n",
|
114 |
-
"sleep(2000).then(() => {\n",
|
115 |
-
" // wait 2000ms for the data to be available...\n",
|
116 |
-
" // ideally this should use something like await...\n",
|
117 |
-
" //console.log(\"Inside data:\" + base64data)\n",
|
118 |
-
" resolve(base64data.toString())\n",
|
119 |
-
"\n",
|
120 |
-
"});\n",
|
121 |
-
"\n",
|
122 |
-
"}\n",
|
123 |
-
"});\n",
|
124 |
-
" \n",
|
125 |
-
"</script>\n",
|
126 |
-
"\"\"\"\n",
|
127 |
-
"\n",
|
128 |
-
"def record(sec=10):\n",
|
129 |
-
" display(HTML(AUDIO_HTML))\n",
|
130 |
-
" s = output.eval_js(\"data\")\n",
|
131 |
-
" b = b64decode(s.split(',')[1])\n",
|
132 |
-
" audio = AudioSegment.from_file(BytesIO(b))\n",
|
133 |
-
" audio.export('test.mp3', format='mp3')\n",
|
134 |
-
" audio = audio.set_channels(1)\n",
|
135 |
-
" audio = audio.set_frame_rate(16000)\n",
|
136 |
-
" audio_float = int2float(np.array(audio.get_array_of_samples()))\n",
|
137 |
-
" audio_tens = torch.tensor(audio_float )\n",
|
138 |
-
" return audio_tens\n",
|
139 |
-
"\n",
|
140 |
-
"def make_animation(probs, audio_duration, interval=40):\n",
|
141 |
-
" fig = plt.figure(figsize=(16, 9))\n",
|
142 |
-
" ax = plt.axes(xlim=(0, audio_duration), ylim=(0, 1.02))\n",
|
143 |
-
" line, = ax.plot([], [], lw=2)\n",
|
144 |
-
" x = [i / 16000 * 512 for i in range(len(probs))]\n",
|
145 |
-
" plt.xlabel('Time, seconds', fontsize=16)\n",
|
146 |
-
" plt.ylabel('Speech Probability', fontsize=16)\n",
|
147 |
-
"\n",
|
148 |
-
" def init():\n",
|
149 |
-
" plt.fill_between(x, probs, color='#064273')\n",
|
150 |
-
" line.set_data([], [])\n",
|
151 |
-
" line.set_color('#990000')\n",
|
152 |
-
" return line,\n",
|
153 |
-
"\n",
|
154 |
-
" def animate(i):\n",
|
155 |
-
" x = i * interval / 1000 - 0.04\n",
|
156 |
-
" y = np.linspace(0, 1.02, 2)\n",
|
157 |
-
" \n",
|
158 |
-
" line.set_data(x, y)\n",
|
159 |
-
" line.set_color('#990000')\n",
|
160 |
-
" return line,\n",
|
161 |
-
"\n",
|
162 |
-
" anim = FuncAnimation(fig, animate, init_func=init, interval=interval, save_count=audio_duration / (interval / 1000))\n",
|
163 |
-
"\n",
|
164 |
-
" f = r\"animation.mp4\" \n",
|
165 |
-
" writervideo = FFMpegWriter(fps=1000/interval) \n",
|
166 |
-
" anim.save(f, writer=writervideo)\n",
|
167 |
-
" plt.close('all')\n",
|
168 |
-
"\n",
|
169 |
-
"def combine_audio(vidname, audname, outname, fps=25): \n",
|
170 |
-
" my_clip = mpe.VideoFileClip(vidname, verbose=False)\n",
|
171 |
-
" audio_background = mpe.AudioFileClip(audname)\n",
|
172 |
-
" final_clip = my_clip.set_audio(audio_background)\n",
|
173 |
-
" final_clip.write_videofile(outname,fps=fps,verbose=False)\n",
|
174 |
-
"\n",
|
175 |
-
"def record_make_animation():\n",
|
176 |
-
" tensor = record()\n",
|
177 |
-
"\n",
|
178 |
-
" print('Calculating probabilities...')\n",
|
179 |
-
" speech_probs = []\n",
|
180 |
-
" window_size_samples = 512\n",
|
181 |
-
" for i in range(0, len(tensor), window_size_samples):\n",
|
182 |
-
" if len(tensor[i: i+ window_size_samples]) < window_size_samples:\n",
|
183 |
-
" break\n",
|
184 |
-
" speech_prob = model(tensor[i: i+ window_size_samples], 16000).item()\n",
|
185 |
-
" speech_probs.append(speech_prob)\n",
|
186 |
-
" model.reset_states()\n",
|
187 |
-
" print('Making animation...')\n",
|
188 |
-
" make_animation(speech_probs, len(tensor) / 16000)\n",
|
189 |
-
"\n",
|
190 |
-
" print('Merging your voice with animation...')\n",
|
191 |
-
" combine_audio('animation.mp4', 'test.mp3', 'merged.mp4')\n",
|
192 |
-
" print('Done!')\n",
|
193 |
-
" mp4 = open('merged.mp4','rb').read()\n",
|
194 |
-
" data_url = \"data:video/mp4;base64,\" + b64encode(mp4).decode()\n",
|
195 |
-
" display(HTML(\"\"\"\n",
|
196 |
-
" <video width=800 controls>\n",
|
197 |
-
" <source src=\"%s\" type=\"video/mp4\">\n",
|
198 |
-
" </video>\n",
|
199 |
-
" \"\"\" % data_url))"
|
200 |
-
]
|
201 |
-
},
|
202 |
-
{
|
203 |
-
"cell_type": "markdown",
|
204 |
-
"metadata": {
|
205 |
-
"id": "IFVs3GvTnpB1"
|
206 |
-
},
|
207 |
-
"source": [
|
208 |
-
"## Record example"
|
209 |
-
]
|
210 |
-
},
|
211 |
-
{
|
212 |
-
"cell_type": "code",
|
213 |
-
"execution_count": null,
|
214 |
-
"metadata": {
|
215 |
-
"id": "5EBjrTwiqAaQ"
|
216 |
-
},
|
217 |
-
"outputs": [],
|
218 |
-
"source": [
|
219 |
-
"record_make_animation()"
|
220 |
-
]
|
221 |
-
}
|
222 |
-
],
|
223 |
-
"metadata": {
|
224 |
-
"colab": {
|
225 |
-
"collapsed_sections": [
|
226 |
-
"bccAucKjnPHm"
|
227 |
-
],
|
228 |
-
"name": "Untitled2.ipynb",
|
229 |
-
"provenance": []
|
230 |
-
},
|
231 |
-
"kernelspec": {
|
232 |
-
"display_name": "Python 3",
|
233 |
-
"name": "python3"
|
234 |
-
},
|
235 |
-
"language_info": {
|
236 |
-
"name": "python"
|
237 |
-
}
|
238 |
-
},
|
239 |
-
"nbformat": 4,
|
240 |
-
"nbformat_minor": 0
|
241 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
hub/snakers4_silero-vad_master/examples/microphone_and_webRTC_integration/README.md
DELETED
@@ -1,28 +0,0 @@
|
|
1 |
-
|
2 |
-
In this example, an integration with the microphone and the webRTC VAD has been done. I used [this](https://github.com/mozilla/DeepSpeech-examples/tree/r0.8/mic_vad_streaming) as a draft.
|
3 |
-
Here a short video to present the results:
|
4 |
-
|
5 |
-
https://user-images.githubusercontent.com/28188499/116685087-182ff100-a9b2-11eb-927d-ed9f621226ee.mp4
|
6 |
-
|
7 |
-
# Requirements:
|
8 |
-
The libraries used for the following example are:
|
9 |
-
```
|
10 |
-
Python == 3.6.9
|
11 |
-
webrtcvad >= 2.0.10
|
12 |
-
torchaudio >= 0.8.1
|
13 |
-
torch >= 1.8.1
|
14 |
-
halo >= 0.0.31
|
15 |
-
Soundfile >= 0.13.3
|
16 |
-
```
|
17 |
-
Using pip3:
|
18 |
-
```
|
19 |
-
pip3 install webrtcvad
|
20 |
-
pip3 install torchaudio
|
21 |
-
pip3 install torch
|
22 |
-
pip3 install halo
|
23 |
-
pip3 install soundfile
|
24 |
-
```
|
25 |
-
Moreover, to make the code easier, the default sample_rate is 16KHz without resampling.
|
26 |
-
|
27 |
-
This example has been tested on ``` ubuntu 18.04.3 LTS```
|
28 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
hub/snakers4_silero-vad_master/examples/microphone_and_webRTC_integration/microphone_and_webRTC_integration.py
DELETED
@@ -1,201 +0,0 @@
|
|
1 |
-
import collections, queue
|
2 |
-
import numpy as np
|
3 |
-
import pyaudio
|
4 |
-
import webrtcvad
|
5 |
-
from halo import Halo
|
6 |
-
import torch
|
7 |
-
import torchaudio
|
8 |
-
|
9 |
-
class Audio(object):
|
10 |
-
"""Streams raw audio from microphone. Data is received in a separate thread, and stored in a buffer, to be read from."""
|
11 |
-
|
12 |
-
FORMAT = pyaudio.paInt16
|
13 |
-
# Network/VAD rate-space
|
14 |
-
RATE_PROCESS = 16000
|
15 |
-
CHANNELS = 1
|
16 |
-
BLOCKS_PER_SECOND = 50
|
17 |
-
|
18 |
-
def __init__(self, callback=None, device=None, input_rate=RATE_PROCESS):
|
19 |
-
def proxy_callback(in_data, frame_count, time_info, status):
|
20 |
-
#pylint: disable=unused-argument
|
21 |
-
callback(in_data)
|
22 |
-
return (None, pyaudio.paContinue)
|
23 |
-
if callback is None: callback = lambda in_data: self.buffer_queue.put(in_data)
|
24 |
-
self.buffer_queue = queue.Queue()
|
25 |
-
self.device = device
|
26 |
-
self.input_rate = input_rate
|
27 |
-
self.sample_rate = self.RATE_PROCESS
|
28 |
-
self.block_size = int(self.RATE_PROCESS / float(self.BLOCKS_PER_SECOND))
|
29 |
-
self.block_size_input = int(self.input_rate / float(self.BLOCKS_PER_SECOND))
|
30 |
-
self.pa = pyaudio.PyAudio()
|
31 |
-
|
32 |
-
kwargs = {
|
33 |
-
'format': self.FORMAT,
|
34 |
-
'channels': self.CHANNELS,
|
35 |
-
'rate': self.input_rate,
|
36 |
-
'input': True,
|
37 |
-
'frames_per_buffer': self.block_size_input,
|
38 |
-
'stream_callback': proxy_callback,
|
39 |
-
}
|
40 |
-
|
41 |
-
self.chunk = None
|
42 |
-
# if not default device
|
43 |
-
if self.device:
|
44 |
-
kwargs['input_device_index'] = self.device
|
45 |
-
|
46 |
-
self.stream = self.pa.open(**kwargs)
|
47 |
-
self.stream.start_stream()
|
48 |
-
|
49 |
-
def read(self):
|
50 |
-
"""Return a block of audio data, blocking if necessary."""
|
51 |
-
return self.buffer_queue.get()
|
52 |
-
|
53 |
-
def destroy(self):
|
54 |
-
self.stream.stop_stream()
|
55 |
-
self.stream.close()
|
56 |
-
self.pa.terminate()
|
57 |
-
|
58 |
-
frame_duration_ms = property(lambda self: 1000 * self.block_size // self.sample_rate)
|
59 |
-
|
60 |
-
|
61 |
-
class VADAudio(Audio):
|
62 |
-
"""Filter & segment audio with voice activity detection."""
|
63 |
-
|
64 |
-
def __init__(self, aggressiveness=3, device=None, input_rate=None):
|
65 |
-
super().__init__(device=device, input_rate=input_rate)
|
66 |
-
self.vad = webrtcvad.Vad(aggressiveness)
|
67 |
-
|
68 |
-
def frame_generator(self):
|
69 |
-
"""Generator that yields all audio frames from microphone."""
|
70 |
-
if self.input_rate == self.RATE_PROCESS:
|
71 |
-
while True:
|
72 |
-
yield self.read()
|
73 |
-
else:
|
74 |
-
raise Exception("Resampling required")
|
75 |
-
|
76 |
-
def vad_collector(self, padding_ms=300, ratio=0.75, frames=None):
|
77 |
-
"""Generator that yields series of consecutive audio frames comprising each utterence, separated by yielding a single None.
|
78 |
-
Determines voice activity by ratio of frames in padding_ms. Uses a buffer to include padding_ms prior to being triggered.
|
79 |
-
Example: (frame, ..., frame, None, frame, ..., frame, None, ...)
|
80 |
-
|---utterence---| |---utterence---|
|
81 |
-
"""
|
82 |
-
if frames is None: frames = self.frame_generator()
|
83 |
-
num_padding_frames = padding_ms // self.frame_duration_ms
|
84 |
-
ring_buffer = collections.deque(maxlen=num_padding_frames)
|
85 |
-
triggered = False
|
86 |
-
|
87 |
-
for frame in frames:
|
88 |
-
if len(frame) < 640:
|
89 |
-
return
|
90 |
-
|
91 |
-
is_speech = self.vad.is_speech(frame, self.sample_rate)
|
92 |
-
|
93 |
-
if not triggered:
|
94 |
-
ring_buffer.append((frame, is_speech))
|
95 |
-
num_voiced = len([f for f, speech in ring_buffer if speech])
|
96 |
-
if num_voiced > ratio * ring_buffer.maxlen:
|
97 |
-
triggered = True
|
98 |
-
for f, s in ring_buffer:
|
99 |
-
yield f
|
100 |
-
ring_buffer.clear()
|
101 |
-
|
102 |
-
else:
|
103 |
-
yield frame
|
104 |
-
ring_buffer.append((frame, is_speech))
|
105 |
-
num_unvoiced = len([f for f, speech in ring_buffer if not speech])
|
106 |
-
if num_unvoiced > ratio * ring_buffer.maxlen:
|
107 |
-
triggered = False
|
108 |
-
yield None
|
109 |
-
ring_buffer.clear()
|
110 |
-
|
111 |
-
def main(ARGS):
|
112 |
-
# Start audio with VAD
|
113 |
-
vad_audio = VADAudio(aggressiveness=ARGS.webRTC_aggressiveness,
|
114 |
-
device=ARGS.device,
|
115 |
-
input_rate=ARGS.rate)
|
116 |
-
|
117 |
-
print("Listening (ctrl-C to exit)...")
|
118 |
-
frames = vad_audio.vad_collector()
|
119 |
-
|
120 |
-
# load silero VAD
|
121 |
-
torchaudio.set_audio_backend("soundfile")
|
122 |
-
model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
|
123 |
-
model=ARGS.silaro_model_name,
|
124 |
-
force_reload= ARGS.reload)
|
125 |
-
(get_speech_ts,_,_, _,_, _, _) = utils
|
126 |
-
|
127 |
-
|
128 |
-
# Stream from microphone to DeepSpeech using VAD
|
129 |
-
spinner = None
|
130 |
-
if not ARGS.nospinner:
|
131 |
-
spinner = Halo(spinner='line')
|
132 |
-
wav_data = bytearray()
|
133 |
-
for frame in frames:
|
134 |
-
if frame is not None:
|
135 |
-
if spinner: spinner.start()
|
136 |
-
|
137 |
-
wav_data.extend(frame)
|
138 |
-
else:
|
139 |
-
if spinner: spinner.stop()
|
140 |
-
print("webRTC has detected a possible speech")
|
141 |
-
|
142 |
-
newsound= np.frombuffer(wav_data,np.int16)
|
143 |
-
audio_float32=Int2Float(newsound)
|
144 |
-
time_stamps =get_speech_ts(audio_float32, model,num_steps=ARGS.num_steps,trig_sum=ARGS.trig_sum,neg_trig_sum=ARGS.neg_trig_sum,
|
145 |
-
num_samples_per_window=ARGS.num_samples_per_window,min_speech_samples=ARGS.min_speech_samples,
|
146 |
-
min_silence_samples=ARGS.min_silence_samples)
|
147 |
-
|
148 |
-
if(len(time_stamps)>0):
|
149 |
-
print("silero VAD has detected a possible speech")
|
150 |
-
else:
|
151 |
-
print("silero VAD has detected a noise")
|
152 |
-
print()
|
153 |
-
wav_data = bytearray()
|
154 |
-
|
155 |
-
|
156 |
-
def Int2Float(sound):
|
157 |
-
_sound = np.copy(sound) #
|
158 |
-
abs_max = np.abs(_sound).max()
|
159 |
-
_sound = _sound.astype('float32')
|
160 |
-
if abs_max > 0:
|
161 |
-
_sound *= 1/abs_max
|
162 |
-
audio_float32 = torch.from_numpy(_sound.squeeze())
|
163 |
-
return audio_float32
|
164 |
-
|
165 |
-
if __name__ == '__main__':
|
166 |
-
DEFAULT_SAMPLE_RATE = 16000
|
167 |
-
|
168 |
-
import argparse
|
169 |
-
parser = argparse.ArgumentParser(description="Stream from microphone to webRTC and silero VAD")
|
170 |
-
|
171 |
-
parser.add_argument('-v', '--webRTC_aggressiveness', type=int, default=3,
|
172 |
-
help="Set aggressiveness of webRTC: an integer between 0 and 3, 0 being the least aggressive about filtering out non-speech, 3 the most aggressive. Default: 3")
|
173 |
-
parser.add_argument('--nospinner', action='store_true',
|
174 |
-
help="Disable spinner")
|
175 |
-
parser.add_argument('-d', '--device', type=int, default=None,
|
176 |
-
help="Device input index (Int) as listed by pyaudio.PyAudio.get_device_info_by_index(). If not provided, falls back to PyAudio.get_default_device().")
|
177 |
-
|
178 |
-
parser.add_argument('-name', '--silaro_model_name', type=str, default="silero_vad",
|
179 |
-
help="select the name of the model. You can select between 'silero_vad',''silero_vad_micro','silero_vad_micro_8k','silero_vad_mini','silero_vad_mini_8k'")
|
180 |
-
parser.add_argument('--reload', action='store_true',help="download the last version of the silero vad")
|
181 |
-
|
182 |
-
parser.add_argument('-ts', '--trig_sum', type=float, default=0.25,
|
183 |
-
help="overlapping windows are used for each audio chunk, trig sum defines average probability among those windows for switching into triggered state (speech state)")
|
184 |
-
|
185 |
-
parser.add_argument('-nts', '--neg_trig_sum', type=float, default=0.07,
|
186 |
-
help="same as trig_sum, but for switching from triggered to non-triggered state (non-speech)")
|
187 |
-
|
188 |
-
parser.add_argument('-N', '--num_steps', type=int, default=8,
|
189 |
-
help="nubmer of overlapping windows to split audio chunk into (we recommend 4 or 8)")
|
190 |
-
|
191 |
-
parser.add_argument('-nspw', '--num_samples_per_window', type=int, default=4000,
|
192 |
-
help="number of samples in each window, our models were trained using 4000 samples (250 ms) per window, so this is preferable value (lesser values reduce quality)")
|
193 |
-
|
194 |
-
parser.add_argument('-msps', '--min_speech_samples', type=int, default=10000,
|
195 |
-
help="minimum speech chunk duration in samples")
|
196 |
-
|
197 |
-
parser.add_argument('-msis', '--min_silence_samples', type=int, default=500,
|
198 |
-
help=" minimum silence duration in samples between to separate speech chunks")
|
199 |
-
ARGS = parser.parse_args()
|
200 |
-
ARGS.rate=DEFAULT_SAMPLE_RATE
|
201 |
-
main(ARGS)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
hub/snakers4_silero-vad_master/examples/pyaudio-streaming/README.md
DELETED
@@ -1,20 +0,0 @@
|
|
1 |
-
# Pyaudio Streaming Example
|
2 |
-
|
3 |
-
This example notebook shows how micophone audio fetched by pyaudio can be processed with Silero-VAD.
|
4 |
-
|
5 |
-
It has been designed as a low-level example for binary real-time streaming using only the prediction of the model, processing the binary data and plotting the speech probabilities at the end to visualize it.
|
6 |
-
|
7 |
-
Currently, the notebook consits of two examples:
|
8 |
-
- One that records audio of a predefined length from the microphone, process it with Silero-VAD, and plots it afterwards.
|
9 |
-
- The other one plots the speech probabilities in real-time (using jupyterplot) and records the audio until you press enter.
|
10 |
-
|
11 |
-
## Example Video for the Real-Time Visualization
|
12 |
-
|
13 |
-
|
14 |
-
https://user-images.githubusercontent.com/8079748/117580455-4622dd00-b0f8-11eb-858d-e6368ed4eada.mp4
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
hub/snakers4_silero-vad_master/examples/pyaudio-streaming/pyaudio-streaming-examples.ipynb
DELETED
@@ -1,331 +0,0 @@
|
|
1 |
-
{
|
2 |
-
"cells": [
|
3 |
-
{
|
4 |
-
"cell_type": "markdown",
|
5 |
-
"id": "62a0cccb",
|
6 |
-
"metadata": {},
|
7 |
-
"source": [
|
8 |
-
"# Pyaudio Microphone Streaming Examples\n",
|
9 |
-
"\n",
|
10 |
-
"A simple notebook that uses pyaudio to get the microphone audio and feeds this audio then to Silero VAD.\n",
|
11 |
-
"\n",
|
12 |
-
"I created it as an example on how binary data from a stream could be feed into Silero VAD.\n",
|
13 |
-
"\n",
|
14 |
-
"\n",
|
15 |
-
"Has been tested on Ubuntu 21.04 (x86). After you installed the dependencies below, no additional setup is required."
|
16 |
-
]
|
17 |
-
},
|
18 |
-
{
|
19 |
-
"cell_type": "markdown",
|
20 |
-
"id": "64cbe1eb",
|
21 |
-
"metadata": {},
|
22 |
-
"source": [
|
23 |
-
"## Dependencies\n",
|
24 |
-
"The cell below lists all used dependencies and the used versions. Uncomment to install them from within the notebook."
|
25 |
-
]
|
26 |
-
},
|
27 |
-
{
|
28 |
-
"cell_type": "code",
|
29 |
-
"execution_count": null,
|
30 |
-
"id": "57bc2aac",
|
31 |
-
"metadata": {},
|
32 |
-
"outputs": [],
|
33 |
-
"source": [
|
34 |
-
"#!pip install numpy==1.20.2\n",
|
35 |
-
"#!pip install torch==1.9.0\n",
|
36 |
-
"#!pip install matplotlib==3.4.2\n",
|
37 |
-
"#!pip install torchaudio==0.9.0\n",
|
38 |
-
"#!pip install soundfile==0.10.3.post1\n",
|
39 |
-
"#!pip install pyaudio==0.2.11"
|
40 |
-
]
|
41 |
-
},
|
42 |
-
{
|
43 |
-
"cell_type": "markdown",
|
44 |
-
"id": "110de761",
|
45 |
-
"metadata": {},
|
46 |
-
"source": [
|
47 |
-
"## Imports"
|
48 |
-
]
|
49 |
-
},
|
50 |
-
{
|
51 |
-
"cell_type": "code",
|
52 |
-
"execution_count": null,
|
53 |
-
"id": "5a647d8d",
|
54 |
-
"metadata": {},
|
55 |
-
"outputs": [],
|
56 |
-
"source": [
|
57 |
-
"import io\n",
|
58 |
-
"import numpy as np\n",
|
59 |
-
"import torch\n",
|
60 |
-
"torch.set_num_threads(1)\n",
|
61 |
-
"import torchaudio\n",
|
62 |
-
"import matplotlib\n",
|
63 |
-
"import matplotlib.pylab as plt\n",
|
64 |
-
"torchaudio.set_audio_backend(\"soundfile\")\n",
|
65 |
-
"import pyaudio"
|
66 |
-
]
|
67 |
-
},
|
68 |
-
{
|
69 |
-
"cell_type": "code",
|
70 |
-
"execution_count": null,
|
71 |
-
"id": "725d7066",
|
72 |
-
"metadata": {},
|
73 |
-
"outputs": [],
|
74 |
-
"source": [
|
75 |
-
"model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',\n",
|
76 |
-
" model='silero_vad',\n",
|
77 |
-
" force_reload=True)"
|
78 |
-
]
|
79 |
-
},
|
80 |
-
{
|
81 |
-
"cell_type": "code",
|
82 |
-
"execution_count": null,
|
83 |
-
"id": "1c0b2ea7",
|
84 |
-
"metadata": {},
|
85 |
-
"outputs": [],
|
86 |
-
"source": [
|
87 |
-
"(get_speech_timestamps,\n",
|
88 |
-
" save_audio,\n",
|
89 |
-
" read_audio,\n",
|
90 |
-
" VADIterator,\n",
|
91 |
-
" collect_chunks) = utils"
|
92 |
-
]
|
93 |
-
},
|
94 |
-
{
|
95 |
-
"cell_type": "markdown",
|
96 |
-
"id": "f9112603",
|
97 |
-
"metadata": {},
|
98 |
-
"source": [
|
99 |
-
"### Helper Methods"
|
100 |
-
]
|
101 |
-
},
|
102 |
-
{
|
103 |
-
"cell_type": "code",
|
104 |
-
"execution_count": null,
|
105 |
-
"id": "5abc6330",
|
106 |
-
"metadata": {},
|
107 |
-
"outputs": [],
|
108 |
-
"source": [
|
109 |
-
"# Taken from utils_vad.py\n",
|
110 |
-
"def validate(model,\n",
|
111 |
-
" inputs: torch.Tensor):\n",
|
112 |
-
" with torch.no_grad():\n",
|
113 |
-
" outs = model(inputs)\n",
|
114 |
-
" return outs\n",
|
115 |
-
"\n",
|
116 |
-
"# Provided by Alexander Veysov\n",
|
117 |
-
"def int2float(sound):\n",
|
118 |
-
" abs_max = np.abs(sound).max()\n",
|
119 |
-
" sound = sound.astype('float32')\n",
|
120 |
-
" if abs_max > 0:\n",
|
121 |
-
" sound *= 1/abs_max\n",
|
122 |
-
" sound = sound.squeeze() # depends on the use case\n",
|
123 |
-
" return sound"
|
124 |
-
]
|
125 |
-
},
|
126 |
-
{
|
127 |
-
"cell_type": "markdown",
|
128 |
-
"id": "5124095e",
|
129 |
-
"metadata": {},
|
130 |
-
"source": [
|
131 |
-
"## Pyaudio Set-up"
|
132 |
-
]
|
133 |
-
},
|
134 |
-
{
|
135 |
-
"cell_type": "code",
|
136 |
-
"execution_count": null,
|
137 |
-
"id": "a845356e",
|
138 |
-
"metadata": {},
|
139 |
-
"outputs": [],
|
140 |
-
"source": [
|
141 |
-
"FORMAT = pyaudio.paInt16\n",
|
142 |
-
"CHANNELS = 1\n",
|
143 |
-
"SAMPLE_RATE = 16000\n",
|
144 |
-
"CHUNK = int(SAMPLE_RATE / 10)\n",
|
145 |
-
"\n",
|
146 |
-
"audio = pyaudio.PyAudio()"
|
147 |
-
]
|
148 |
-
},
|
149 |
-
{
|
150 |
-
"cell_type": "markdown",
|
151 |
-
"id": "0b910c99",
|
152 |
-
"metadata": {},
|
153 |
-
"source": [
|
154 |
-
"## Simple Example\n",
|
155 |
-
"The following example reads the audio as 250ms chunks from the microphone, converts them to a Pytorch Tensor, and gets the probabilities/confidences if the model thinks the frame is voiced."
|
156 |
-
]
|
157 |
-
},
|
158 |
-
{
|
159 |
-
"cell_type": "code",
|
160 |
-
"execution_count": null,
|
161 |
-
"id": "9d3d2c10",
|
162 |
-
"metadata": {},
|
163 |
-
"outputs": [],
|
164 |
-
"source": [
|
165 |
-
"num_samples = 1536"
|
166 |
-
]
|
167 |
-
},
|
168 |
-
{
|
169 |
-
"cell_type": "code",
|
170 |
-
"execution_count": null,
|
171 |
-
"id": "3cb44a4a",
|
172 |
-
"metadata": {},
|
173 |
-
"outputs": [],
|
174 |
-
"source": [
|
175 |
-
"stream = audio.open(format=FORMAT,\n",
|
176 |
-
" channels=CHANNELS,\n",
|
177 |
-
" rate=SAMPLE_RATE,\n",
|
178 |
-
" input=True,\n",
|
179 |
-
" frames_per_buffer=CHUNK)\n",
|
180 |
-
"data = []\n",
|
181 |
-
"voiced_confidences = []\n",
|
182 |
-
"\n",
|
183 |
-
"print(\"Started Recording\")\n",
|
184 |
-
"for i in range(0, frames_to_record):\n",
|
185 |
-
" \n",
|
186 |
-
" audio_chunk = stream.read(num_samples)\n",
|
187 |
-
" \n",
|
188 |
-
" # in case you want to save the audio later\n",
|
189 |
-
" data.append(audio_chunk)\n",
|
190 |
-
" \n",
|
191 |
-
" audio_int16 = np.frombuffer(audio_chunk, np.int16);\n",
|
192 |
-
"\n",
|
193 |
-
" audio_float32 = int2float(audio_int16)\n",
|
194 |
-
" \n",
|
195 |
-
" # get the confidences and add them to the list to plot them later\n",
|
196 |
-
" new_confidence = model(torch.from_numpy(audio_float32), 16000).item()\n",
|
197 |
-
" voiced_confidences.append(new_confidence)\n",
|
198 |
-
" \n",
|
199 |
-
"print(\"Stopped the recording\")\n",
|
200 |
-
"\n",
|
201 |
-
"# plot the confidences for the speech\n",
|
202 |
-
"plt.figure(figsize=(20,6))\n",
|
203 |
-
"plt.plot(voiced_confidences)\n",
|
204 |
-
"plt.show()"
|
205 |
-
]
|
206 |
-
},
|
207 |
-
{
|
208 |
-
"cell_type": "markdown",
|
209 |
-
"id": "a3dda982",
|
210 |
-
"metadata": {},
|
211 |
-
"source": [
|
212 |
-
"## Real Time Visualization\n",
|
213 |
-
"\n",
|
214 |
-
"As an enhancement to plot the speech probabilities in real time I added the implementation below.\n",
|
215 |
-
"In contrast to the simeple one, it records the audio until to stop the recording by pressing enter.\n",
|
216 |
-
"While looking into good ways to update matplotlib plots in real-time, I found a simple libarary that does the job. https://github.com/lvwerra/jupyterplot It has some limitations, but works for this use case really well.\n"
|
217 |
-
]
|
218 |
-
},
|
219 |
-
{
|
220 |
-
"cell_type": "code",
|
221 |
-
"execution_count": null,
|
222 |
-
"id": "05ef4100",
|
223 |
-
"metadata": {},
|
224 |
-
"outputs": [],
|
225 |
-
"source": [
|
226 |
-
"#!pip install jupyterplot==0.0.3"
|
227 |
-
]
|
228 |
-
},
|
229 |
-
{
|
230 |
-
"cell_type": "code",
|
231 |
-
"execution_count": null,
|
232 |
-
"id": "d1d4cdd6",
|
233 |
-
"metadata": {},
|
234 |
-
"outputs": [],
|
235 |
-
"source": [
|
236 |
-
"from jupyterplot import ProgressPlot\n",
|
237 |
-
"import threading\n",
|
238 |
-
"\n",
|
239 |
-
"continue_recording = True\n",
|
240 |
-
"\n",
|
241 |
-
"def stop():\n",
|
242 |
-
" input(\"Press Enter to stop the recording:\")\n",
|
243 |
-
" global continue_recording\n",
|
244 |
-
" continue_recording = False\n",
|
245 |
-
"\n",
|
246 |
-
"def start_recording():\n",
|
247 |
-
" \n",
|
248 |
-
" stream = audio.open(format=FORMAT,\n",
|
249 |
-
" channels=CHANNELS,\n",
|
250 |
-
" rate=SAMPLE_RATE,\n",
|
251 |
-
" input=True,\n",
|
252 |
-
" frames_per_buffer=CHUNK)\n",
|
253 |
-
"\n",
|
254 |
-
" data = []\n",
|
255 |
-
" voiced_confidences = []\n",
|
256 |
-
" \n",
|
257 |
-
" global continue_recording\n",
|
258 |
-
" continue_recording = True\n",
|
259 |
-
" \n",
|
260 |
-
" pp = ProgressPlot(plot_names=[\"Silero VAD\"],line_names=[\"speech probabilities\"], x_label=\"audio chunks\")\n",
|
261 |
-
" \n",
|
262 |
-
" stop_listener = threading.Thread(target=stop)\n",
|
263 |
-
" stop_listener.start()\n",
|
264 |
-
"\n",
|
265 |
-
" while continue_recording:\n",
|
266 |
-
" \n",
|
267 |
-
" audio_chunk = stream.read(num_samples)\n",
|
268 |
-
" \n",
|
269 |
-
" # in case you want to save the audio later\n",
|
270 |
-
" data.append(audio_chunk)\n",
|
271 |
-
" \n",
|
272 |
-
" audio_int16 = np.frombuffer(audio_chunk, np.int16);\n",
|
273 |
-
"\n",
|
274 |
-
" audio_float32 = int2float(audio_int16)\n",
|
275 |
-
" \n",
|
276 |
-
" # get the confidences and add them to the list to plot them later\n",
|
277 |
-
" new_confidence = model(torch.from_numpy(audio_float32), 16000).item()\n",
|
278 |
-
" voiced_confidences.append(new_confidence)\n",
|
279 |
-
" \n",
|
280 |
-
" pp.update(new_confidence)\n",
|
281 |
-
"\n",
|
282 |
-
"\n",
|
283 |
-
" pp.finalize()"
|
284 |
-
]
|
285 |
-
},
|
286 |
-
{
|
287 |
-
"cell_type": "code",
|
288 |
-
"execution_count": null,
|
289 |
-
"id": "1e398009",
|
290 |
-
"metadata": {},
|
291 |
-
"outputs": [],
|
292 |
-
"source": [
|
293 |
-
"start_recording()"
|
294 |
-
]
|
295 |
-
}
|
296 |
-
],
|
297 |
-
"metadata": {
|
298 |
-
"kernelspec": {
|
299 |
-
"display_name": "Python 3",
|
300 |
-
"language": "python",
|
301 |
-
"name": "python3"
|
302 |
-
},
|
303 |
-
"language_info": {
|
304 |
-
"codemirror_mode": {
|
305 |
-
"name": "ipython",
|
306 |
-
"version": 3
|
307 |
-
},
|
308 |
-
"file_extension": ".py",
|
309 |
-
"mimetype": "text/x-python",
|
310 |
-
"name": "python",
|
311 |
-
"nbconvert_exporter": "python",
|
312 |
-
"pygments_lexer": "ipython3",
|
313 |
-
"version": "3.7.10"
|
314 |
-
},
|
315 |
-
"toc": {
|
316 |
-
"base_numbering": 1,
|
317 |
-
"nav_menu": {},
|
318 |
-
"number_sections": true,
|
319 |
-
"sideBar": true,
|
320 |
-
"skip_h1_title": false,
|
321 |
-
"title_cell": "Table of Contents",
|
322 |
-
"title_sidebar": "Contents",
|
323 |
-
"toc_cell": false,
|
324 |
-
"toc_position": {},
|
325 |
-
"toc_section_display": true,
|
326 |
-
"toc_window_display": false
|
327 |
-
}
|
328 |
-
},
|
329 |
-
"nbformat": 4,
|
330 |
-
"nbformat_minor": 5
|
331 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
hub/snakers4_silero-vad_master/files/lang_dict_95.json
DELETED
@@ -1 +0,0 @@
|
|
1 |
-
{"59": "mg, Malagasy", "76": "tk, Turkmen", "20": "lb, Luxembourgish, Letzeburgesch", "62": "or, Oriya", "30": "en, English", "26": "oc, Occitan", "69": "no, Norwegian", "77": "sr, Serbian", "90": "bs, Bosnian", "71": "el, Greek, Modern (1453\u2013)", "15": "az, Azerbaijani", "12": "lo, Lao", "85": "zh-HK, Chinese", "79": "cs, Czech", "43": "sv, Swedish", "37": "mn, Mongolian", "32": "fi, Finnish", "51": "tg, Tajik", "46": "am, Amharic", "17": "nn, Norwegian Nynorsk", "40": "ja, Japanese", "8": "it, Italian", "21": "ha, Hausa", "11": "as, Assamese", "29": "fa, Persian", "82": "bn, Bengali", "54": "mk, Macedonian", "31": "sw, Swahili", "45": "vi, Vietnamese", "41": "ur, Urdu", "74": "bo, Tibetan", "4": "hi, Hindi", "86": "mr, Marathi", "3": "fy-NL, Western Frisian", "65": "sk, Slovak", "2": "ln, Lingala", "92": "gl, Galician", "53": "sn, Shona", "87": "su, Sundanese", "35": "tt, Tatar", "93": "kn, Kannada", "6": "yo, Yoruba", "27": "ps, Pashto, Pushto", "34": "hy, Armenian", "25": "pa-IN, Punjabi, Panjabi", "23": "nl, Dutch, Flemish", "48": "th, Thai", "73": "mt, Maltese", "55": "ar, Arabic", "89": "ba, Bashkir", "78": "bg, Bulgarian", "42": "yi, Yiddish", "5": "ru, Russian", "84": "sv-SE, Swedish", "80": "tr, Turkish", "33": "sq, Albanian", "38": "kk, Kazakh", "50": "pl, Polish", "9": "hr, Croatian", "66": "ky, Kirghiz, Kyrgyz", "49": "hu, Hungarian", "10": "si, Sinhala, Sinhalese", "56": "la, Latin", "75": "de, German", "14": "ko, Korean", "22": "id, Indonesian", "47": "sl, Slovenian", "57": "be, Belarusian", "36": "ta, Tamil", "7": "da, Danish", "91": "sd, Sindhi", "28": "et, Estonian", "63": "pt, Portuguese", "60": "ne, Nepali", "94": "zh-TW, Chinese", "18": "zh-CN, Chinese", "88": "rw, Kinyarwanda", "19": "es, Spanish, Castilian", "39": "ht, Haitian, Haitian Creole", "64": "tl, Tagalog", "83": "ms, Malay", "70": "ro, Romanian, Moldavian, Moldovan", "68": "pa, Punjabi, Panjabi", "52": "uz, Uzbek", "58": "km, Central Khmer", "67": "my, Burmese", "0": "fr, French", "24": "af, Afrikaans", "16": "gu, Gujarati", "81": "so, Somali", "13": "uk, Ukrainian", "44": "ca, Catalan, Valencian", "72": "ml, Malayalam", "61": "te, Telugu", "1": "zh, Chinese"}
|
|
|
|
hub/snakers4_silero-vad_master/files/lang_group_dict_95.json
DELETED
@@ -1 +0,0 @@
|
|
1 |
-
{"0": ["Afrikaans", "Dutch, Flemish", "Western Frisian"], "1": ["Turkish", "Azerbaijani"], "2": ["Russian", "Slovak", "Ukrainian", "Czech", "Polish", "Belarusian"], "3": ["Bulgarian", "Macedonian", "Serbian", "Croatian", "Bosnian", "Slovenian"], "4": ["Norwegian Nynorsk", "Swedish", "Danish", "Norwegian"], "5": ["English"], "6": ["Finnish", "Estonian"], "7": ["Yiddish", "Luxembourgish, Letzeburgesch", "German"], "8": ["Spanish", "Occitan", "Portuguese", "Catalan, Valencian", "Galician", "Spanish, Castilian", "Italian"], "9": ["Maltese", "Arabic"], "10": ["Marathi"], "11": ["Hindi", "Urdu"], "12": ["Lao", "Thai"], "13": ["Malay", "Indonesian"], "14": ["Romanian, Moldavian, Moldovan"], "15": ["Tagalog"], "16": ["Tajik", "Persian"], "17": ["Kazakh", "Uzbek", "Kirghiz, Kyrgyz"], "18": ["Kinyarwanda"], "19": ["Tatar", "Bashkir"], "20": ["French"], "21": ["Chinese"], "22": ["Lingala"], "23": ["Yoruba"], "24": ["Sinhala, Sinhalese"], "25": ["Assamese"], "26": ["Korean"], "27": ["Gujarati"], "28": ["Hausa"], "29": ["Punjabi, Panjabi"], "30": ["Pashto, Pushto"], "31": ["Swahili"], "32": ["Albanian"], "33": ["Armenian"], "34": ["Mongolian"], "35": ["Tamil"], "36": ["Haitian, Haitian Creole"], "37": ["Japanese"], "38": ["Vietnamese"], "39": ["Amharic"], "40": ["Hungarian"], "41": ["Shona"], "42": ["Latin"], "43": ["Central Khmer"], "44": ["Malagasy"], "45": ["Nepali"], "46": ["Telugu"], "47": ["Oriya"], "48": ["Burmese"], "49": ["Greek, Modern (1453\u2013)"], "50": ["Malayalam"], "51": ["Tibetan"], "52": ["Turkmen"], "53": ["Somali"], "54": ["Bengali"], "55": ["Sundanese"], "56": ["Sindhi"], "57": ["Kannada"]}
|
|
|
|
hub/snakers4_silero-vad_master/files/silero_logo.jpg
DELETED
Binary file (23.9 kB)
|
|
hub/snakers4_silero-vad_master/files/silero_vad.jit
DELETED
@@ -1,3 +0,0 @@
|
|
1 |
-
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:082e21870cf7722b0c7fa5228eaed579efb6870df81192b79bed3f7bac2f738a
|
3 |
-
size 1439299
|
|
|
|
|
|
|
|
hub/snakers4_silero-vad_master/files/silero_vad.onnx
DELETED
@@ -1,3 +0,0 @@
|
|
1 |
-
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:a35ebf52fd3ce5f1469b2a36158dba761bc47b973ea3382b3186ca15b1f5af28
|
3 |
-
size 1807522
|
|
|
|
|
|
|
|
hub/snakers4_silero-vad_master/hubconf.py
DELETED
@@ -1,105 +0,0 @@
|
|
1 |
-
dependencies = ['torch', 'torchaudio']
|
2 |
-
import torch
|
3 |
-
import os
|
4 |
-
import json
|
5 |
-
from utils_vad import (init_jit_model,
|
6 |
-
get_speech_timestamps,
|
7 |
-
get_number_ts,
|
8 |
-
get_language,
|
9 |
-
get_language_and_group,
|
10 |
-
save_audio,
|
11 |
-
read_audio,
|
12 |
-
VADIterator,
|
13 |
-
collect_chunks,
|
14 |
-
drop_chunks,
|
15 |
-
Validator,
|
16 |
-
OnnxWrapper)
|
17 |
-
|
18 |
-
|
19 |
-
def versiontuple(v):
|
20 |
-
return tuple(map(int, (v.split('+')[0].split("."))))
|
21 |
-
|
22 |
-
|
23 |
-
def silero_vad(onnx=False, force_onnx_cpu=False):
|
24 |
-
"""Silero Voice Activity Detector
|
25 |
-
Returns a model with a set of utils
|
26 |
-
Please see https://github.com/snakers4/silero-vad for usage examples
|
27 |
-
"""
|
28 |
-
|
29 |
-
if not onnx:
|
30 |
-
installed_version = torch.__version__
|
31 |
-
supported_version = '1.12.0'
|
32 |
-
if versiontuple(installed_version) < versiontuple(supported_version):
|
33 |
-
raise Exception(f'Please install torch {supported_version} or greater ({installed_version} installed)')
|
34 |
-
|
35 |
-
model_dir = os.path.join(os.path.dirname(__file__), 'files')
|
36 |
-
if onnx:
|
37 |
-
model = OnnxWrapper(os.path.join(model_dir, 'silero_vad.onnx'))
|
38 |
-
else:
|
39 |
-
model = init_jit_model(os.path.join(model_dir, 'silero_vad.jit'))
|
40 |
-
utils = (get_speech_timestamps,
|
41 |
-
save_audio,
|
42 |
-
read_audio,
|
43 |
-
VADIterator,
|
44 |
-
collect_chunks)
|
45 |
-
|
46 |
-
return model, utils
|
47 |
-
|
48 |
-
|
49 |
-
def silero_number_detector(onnx=False, force_onnx_cpu=False):
|
50 |
-
"""Silero Number Detector
|
51 |
-
Returns a model with a set of utils
|
52 |
-
Please see https://github.com/snakers4/silero-vad for usage examples
|
53 |
-
"""
|
54 |
-
if onnx:
|
55 |
-
url = 'https://models.silero.ai/vad_models/number_detector.onnx'
|
56 |
-
else:
|
57 |
-
url = 'https://models.silero.ai/vad_models/number_detector.jit'
|
58 |
-
model = Validator(url, force_onnx_cpu)
|
59 |
-
utils = (get_number_ts,
|
60 |
-
save_audio,
|
61 |
-
read_audio,
|
62 |
-
collect_chunks,
|
63 |
-
drop_chunks)
|
64 |
-
|
65 |
-
return model, utils
|
66 |
-
|
67 |
-
|
68 |
-
def silero_lang_detector(onnx=False, force_onnx_cpu=False):
|
69 |
-
"""Silero Language Classifier
|
70 |
-
Returns a model with a set of utils
|
71 |
-
Please see https://github.com/snakers4/silero-vad for usage examples
|
72 |
-
"""
|
73 |
-
if onnx:
|
74 |
-
url = 'https://models.silero.ai/vad_models/number_detector.onnx'
|
75 |
-
else:
|
76 |
-
url = 'https://models.silero.ai/vad_models/number_detector.jit'
|
77 |
-
model = Validator(url, force_onnx_cpu)
|
78 |
-
utils = (get_language,
|
79 |
-
read_audio)
|
80 |
-
|
81 |
-
return model, utils
|
82 |
-
|
83 |
-
|
84 |
-
def silero_lang_detector_95(onnx=False, force_onnx_cpu=False):
|
85 |
-
"""Silero Language Classifier (95 languages)
|
86 |
-
Returns a model with a set of utils
|
87 |
-
Please see https://github.com/snakers4/silero-vad for usage examples
|
88 |
-
"""
|
89 |
-
|
90 |
-
if onnx:
|
91 |
-
url = 'https://models.silero.ai/vad_models/lang_classifier_95.onnx'
|
92 |
-
else:
|
93 |
-
url = 'https://models.silero.ai/vad_models/lang_classifier_95.jit'
|
94 |
-
model = Validator(url, force_onnx_cpu)
|
95 |
-
|
96 |
-
model_dir = os.path.join(os.path.dirname(__file__), 'files')
|
97 |
-
with open(os.path.join(model_dir, 'lang_dict_95.json'), 'r') as f:
|
98 |
-
lang_dict = json.load(f)
|
99 |
-
|
100 |
-
with open(os.path.join(model_dir, 'lang_group_dict_95.json'), 'r') as f:
|
101 |
-
lang_group_dict = json.load(f)
|
102 |
-
|
103 |
-
utils = (get_language_and_group, read_audio)
|
104 |
-
|
105 |
-
return model, lang_dict, lang_group_dict, utils
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
hub/snakers4_silero-vad_master/silero-vad.ipynb
DELETED
@@ -1,445 +0,0 @@
|
|
1 |
-
{
|
2 |
-
"cells": [
|
3 |
-
{
|
4 |
-
"cell_type": "markdown",
|
5 |
-
"metadata": {
|
6 |
-
"id": "FpMplOCA2Fwp"
|
7 |
-
},
|
8 |
-
"source": [
|
9 |
-
"#VAD"
|
10 |
-
]
|
11 |
-
},
|
12 |
-
{
|
13 |
-
"cell_type": "markdown",
|
14 |
-
"metadata": {
|
15 |
-
"heading_collapsed": true,
|
16 |
-
"id": "62A6F_072Fwq"
|
17 |
-
},
|
18 |
-
"source": [
|
19 |
-
"## Install Dependencies"
|
20 |
-
]
|
21 |
-
},
|
22 |
-
{
|
23 |
-
"cell_type": "code",
|
24 |
-
"execution_count": null,
|
25 |
-
"metadata": {
|
26 |
-
"hidden": true,
|
27 |
-
"id": "5w5AkskZ2Fwr"
|
28 |
-
},
|
29 |
-
"outputs": [],
|
30 |
-
"source": [
|
31 |
-
"#@title Install and Import Dependencies\n",
|
32 |
-
"\n",
|
33 |
-
"# this assumes that you have a relevant version of PyTorch installed\n",
|
34 |
-
"!pip install -q torchaudio\n",
|
35 |
-
"\n",
|
36 |
-
"SAMPLING_RATE = 16000\n",
|
37 |
-
"\n",
|
38 |
-
"import torch\n",
|
39 |
-
"torch.set_num_threads(1)\n",
|
40 |
-
"\n",
|
41 |
-
"from IPython.display import Audio\n",
|
42 |
-
"from pprint import pprint\n",
|
43 |
-
"# download example\n",
|
44 |
-
"torch.hub.download_url_to_file('https://models.silero.ai/vad_models/en.wav', 'en_example.wav')"
|
45 |
-
]
|
46 |
-
},
|
47 |
-
{
|
48 |
-
"cell_type": "code",
|
49 |
-
"execution_count": null,
|
50 |
-
"metadata": {
|
51 |
-
"id": "pSifus5IilRp"
|
52 |
-
},
|
53 |
-
"outputs": [],
|
54 |
-
"source": [
|
55 |
-
"USE_ONNX = False # change this to True if you want to test onnx model\n",
|
56 |
-
"if USE_ONNX:\n",
|
57 |
-
" !pip install -q onnxruntime\n",
|
58 |
-
" \n",
|
59 |
-
"model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',\n",
|
60 |
-
" model='silero_vad',\n",
|
61 |
-
" force_reload=True,\n",
|
62 |
-
" onnx=USE_ONNX)\n",
|
63 |
-
"\n",
|
64 |
-
"(get_speech_timestamps,\n",
|
65 |
-
" save_audio,\n",
|
66 |
-
" read_audio,\n",
|
67 |
-
" VADIterator,\n",
|
68 |
-
" collect_chunks) = utils"
|
69 |
-
]
|
70 |
-
},
|
71 |
-
{
|
72 |
-
"cell_type": "markdown",
|
73 |
-
"metadata": {
|
74 |
-
"id": "fXbbaUO3jsrw"
|
75 |
-
},
|
76 |
-
"source": [
|
77 |
-
"## Full Audio"
|
78 |
-
]
|
79 |
-
},
|
80 |
-
{
|
81 |
-
"cell_type": "markdown",
|
82 |
-
"metadata": {
|
83 |
-
"id": "RAfJPb_a-Auj"
|
84 |
-
},
|
85 |
-
"source": [
|
86 |
-
"**Speech timestapms from full audio**"
|
87 |
-
]
|
88 |
-
},
|
89 |
-
{
|
90 |
-
"cell_type": "code",
|
91 |
-
"execution_count": null,
|
92 |
-
"metadata": {
|
93 |
-
"id": "aI_eydBPjsrx"
|
94 |
-
},
|
95 |
-
"outputs": [],
|
96 |
-
"source": [
|
97 |
-
"wav = read_audio('en_example.wav', sampling_rate=SAMPLING_RATE)\n",
|
98 |
-
"# get speech timestamps from full audio file\n",
|
99 |
-
"speech_timestamps = get_speech_timestamps(wav, model, sampling_rate=SAMPLING_RATE)\n",
|
100 |
-
"pprint(speech_timestamps)"
|
101 |
-
]
|
102 |
-
},
|
103 |
-
{
|
104 |
-
"cell_type": "code",
|
105 |
-
"execution_count": null,
|
106 |
-
"metadata": {
|
107 |
-
"id": "OuEobLchjsry"
|
108 |
-
},
|
109 |
-
"outputs": [],
|
110 |
-
"source": [
|
111 |
-
"# merge all speech chunks to one audio\n",
|
112 |
-
"save_audio('only_speech.wav',\n",
|
113 |
-
" collect_chunks(speech_timestamps, wav), sampling_rate=SAMPLING_RATE) \n",
|
114 |
-
"Audio('only_speech.wav')"
|
115 |
-
]
|
116 |
-
},
|
117 |
-
{
|
118 |
-
"cell_type": "markdown",
|
119 |
-
"metadata": {
|
120 |
-
"id": "iDKQbVr8jsry"
|
121 |
-
},
|
122 |
-
"source": [
|
123 |
-
"## Stream imitation example"
|
124 |
-
]
|
125 |
-
},
|
126 |
-
{
|
127 |
-
"cell_type": "code",
|
128 |
-
"execution_count": null,
|
129 |
-
"metadata": {
|
130 |
-
"id": "q-lql_2Wjsry"
|
131 |
-
},
|
132 |
-
"outputs": [],
|
133 |
-
"source": [
|
134 |
-
"## using VADIterator class\n",
|
135 |
-
"\n",
|
136 |
-
"vad_iterator = VADIterator(model)\n",
|
137 |
-
"wav = read_audio(f'en_example.wav', sampling_rate=SAMPLING_RATE)\n",
|
138 |
-
"\n",
|
139 |
-
"window_size_samples = 1536 # number of samples in a single audio chunk\n",
|
140 |
-
"for i in range(0, len(wav), window_size_samples):\n",
|
141 |
-
" chunk = wav[i: i+ window_size_samples]\n",
|
142 |
-
" if len(chunk) < window_size_samples:\n",
|
143 |
-
" break\n",
|
144 |
-
" speech_dict = vad_iterator(chunk, return_seconds=True)\n",
|
145 |
-
" if speech_dict:\n",
|
146 |
-
" print(speech_dict, end=' ')\n",
|
147 |
-
"vad_iterator.reset_states() # reset model states after each audio"
|
148 |
-
]
|
149 |
-
},
|
150 |
-
{
|
151 |
-
"cell_type": "code",
|
152 |
-
"execution_count": null,
|
153 |
-
"metadata": {
|
154 |
-
"id": "BX3UgwwB2Fwv"
|
155 |
-
},
|
156 |
-
"outputs": [],
|
157 |
-
"source": [
|
158 |
-
"## just probabilities\n",
|
159 |
-
"\n",
|
160 |
-
"wav = read_audio('en_example.wav', sampling_rate=SAMPLING_RATE)\n",
|
161 |
-
"speech_probs = []\n",
|
162 |
-
"window_size_samples = 1536\n",
|
163 |
-
"for i in range(0, len(wav), window_size_samples):\n",
|
164 |
-
" chunk = wav[i: i+ window_size_samples]\n",
|
165 |
-
" if len(chunk) < window_size_samples:\n",
|
166 |
-
" break\n",
|
167 |
-
" speech_prob = model(chunk, SAMPLING_RATE).item()\n",
|
168 |
-
" speech_probs.append(speech_prob)\n",
|
169 |
-
"vad_iterator.reset_states() # reset model states after each audio\n",
|
170 |
-
"\n",
|
171 |
-
"print(speech_probs[:10]) # first 10 chunks predicts"
|
172 |
-
]
|
173 |
-
},
|
174 |
-
{
|
175 |
-
"cell_type": "markdown",
|
176 |
-
"metadata": {
|
177 |
-
"heading_collapsed": true,
|
178 |
-
"id": "36jY0niD2Fww"
|
179 |
-
},
|
180 |
-
"source": [
|
181 |
-
"# Number detector"
|
182 |
-
]
|
183 |
-
},
|
184 |
-
{
|
185 |
-
"cell_type": "markdown",
|
186 |
-
"metadata": {
|
187 |
-
"heading_collapsed": true,
|
188 |
-
"hidden": true,
|
189 |
-
"id": "scd1DlS42Fwx"
|
190 |
-
},
|
191 |
-
"source": [
|
192 |
-
"## Install Dependencies"
|
193 |
-
]
|
194 |
-
},
|
195 |
-
{
|
196 |
-
"cell_type": "code",
|
197 |
-
"execution_count": null,
|
198 |
-
"metadata": {
|
199 |
-
"hidden": true,
|
200 |
-
"id": "Kq5gQuYq2Fwx"
|
201 |
-
},
|
202 |
-
"outputs": [],
|
203 |
-
"source": [
|
204 |
-
"#@title Install and Import Dependencies\n",
|
205 |
-
"\n",
|
206 |
-
"# this assumes that you have a relevant version of PyTorch installed\n",
|
207 |
-
"!pip install -q torchaudio\n",
|
208 |
-
"\n",
|
209 |
-
"SAMPLING_RATE = 16000\n",
|
210 |
-
"\n",
|
211 |
-
"import torch\n",
|
212 |
-
"torch.set_num_threads(1)\n",
|
213 |
-
"\n",
|
214 |
-
"from IPython.display import Audio\n",
|
215 |
-
"from pprint import pprint\n",
|
216 |
-
"# download example\n",
|
217 |
-
"torch.hub.download_url_to_file('https://models.silero.ai/vad_models/en_num.wav', 'en_number_example.wav')"
|
218 |
-
]
|
219 |
-
},
|
220 |
-
{
|
221 |
-
"cell_type": "code",
|
222 |
-
"execution_count": null,
|
223 |
-
"metadata": {
|
224 |
-
"id": "dPwCFHmFycUF"
|
225 |
-
},
|
226 |
-
"outputs": [],
|
227 |
-
"source": [
|
228 |
-
"USE_ONNX = False # change this to True if you want to test onnx model\n",
|
229 |
-
"if USE_ONNX:\n",
|
230 |
-
" !pip install -q onnxruntime\n",
|
231 |
-
" \n",
|
232 |
-
"model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',\n",
|
233 |
-
" model='silero_number_detector',\n",
|
234 |
-
" force_reload=True,\n",
|
235 |
-
" onnx=USE_ONNX)\n",
|
236 |
-
"\n",
|
237 |
-
"(get_number_ts,\n",
|
238 |
-
" save_audio,\n",
|
239 |
-
" read_audio,\n",
|
240 |
-
" collect_chunks,\n",
|
241 |
-
" drop_chunks) = utils\n"
|
242 |
-
]
|
243 |
-
},
|
244 |
-
{
|
245 |
-
"cell_type": "markdown",
|
246 |
-
"metadata": {
|
247 |
-
"heading_collapsed": true,
|
248 |
-
"hidden": true,
|
249 |
-
"id": "qhPa30ij2Fwy"
|
250 |
-
},
|
251 |
-
"source": [
|
252 |
-
"## Full audio"
|
253 |
-
]
|
254 |
-
},
|
255 |
-
{
|
256 |
-
"cell_type": "code",
|
257 |
-
"execution_count": null,
|
258 |
-
"metadata": {
|
259 |
-
"hidden": true,
|
260 |
-
"id": "EXpau6xq2Fwy"
|
261 |
-
},
|
262 |
-
"outputs": [],
|
263 |
-
"source": [
|
264 |
-
"wav = read_audio('en_number_example.wav', sampling_rate=SAMPLING_RATE)\n",
|
265 |
-
"# get number timestamps from full audio file\n",
|
266 |
-
"number_timestamps = get_number_ts(wav, model)\n",
|
267 |
-
"pprint(number_timestamps)"
|
268 |
-
]
|
269 |
-
},
|
270 |
-
{
|
271 |
-
"cell_type": "code",
|
272 |
-
"execution_count": null,
|
273 |
-
"metadata": {
|
274 |
-
"hidden": true,
|
275 |
-
"id": "u-KfXRhZ2Fwy"
|
276 |
-
},
|
277 |
-
"outputs": [],
|
278 |
-
"source": [
|
279 |
-
"# convert ms in timestamps to samples\n",
|
280 |
-
"for timestamp in number_timestamps:\n",
|
281 |
-
" timestamp['start'] = int(timestamp['start'] * SAMPLING_RATE / 1000)\n",
|
282 |
-
" timestamp['end'] = int(timestamp['end'] * SAMPLING_RATE / 1000)"
|
283 |
-
]
|
284 |
-
},
|
285 |
-
{
|
286 |
-
"cell_type": "code",
|
287 |
-
"execution_count": null,
|
288 |
-
"metadata": {
|
289 |
-
"hidden": true,
|
290 |
-
"id": "iwYEC4aZ2Fwy"
|
291 |
-
},
|
292 |
-
"outputs": [],
|
293 |
-
"source": [
|
294 |
-
"# merge all number chunks to one audio\n",
|
295 |
-
"save_audio('only_numbers.wav',\n",
|
296 |
-
" collect_chunks(number_timestamps, wav), SAMPLING_RATE) \n",
|
297 |
-
"Audio('only_numbers.wav')"
|
298 |
-
]
|
299 |
-
},
|
300 |
-
{
|
301 |
-
"cell_type": "code",
|
302 |
-
"execution_count": null,
|
303 |
-
"metadata": {
|
304 |
-
"hidden": true,
|
305 |
-
"id": "fHaYejX12Fwy"
|
306 |
-
},
|
307 |
-
"outputs": [],
|
308 |
-
"source": [
|
309 |
-
"# drop all number chunks from audio\n",
|
310 |
-
"save_audio('no_numbers.wav',\n",
|
311 |
-
" drop_chunks(number_timestamps, wav), SAMPLING_RATE) \n",
|
312 |
-
"Audio('no_numbers.wav')"
|
313 |
-
]
|
314 |
-
},
|
315 |
-
{
|
316 |
-
"cell_type": "markdown",
|
317 |
-
"metadata": {
|
318 |
-
"heading_collapsed": true,
|
319 |
-
"id": "PnKtJKbq2Fwz"
|
320 |
-
},
|
321 |
-
"source": [
|
322 |
-
"# Language detector"
|
323 |
-
]
|
324 |
-
},
|
325 |
-
{
|
326 |
-
"cell_type": "markdown",
|
327 |
-
"metadata": {
|
328 |
-
"heading_collapsed": true,
|
329 |
-
"hidden": true,
|
330 |
-
"id": "F5cAmMbP2Fwz"
|
331 |
-
},
|
332 |
-
"source": [
|
333 |
-
"## Install Dependencies"
|
334 |
-
]
|
335 |
-
},
|
336 |
-
{
|
337 |
-
"cell_type": "code",
|
338 |
-
"execution_count": null,
|
339 |
-
"metadata": {
|
340 |
-
"hidden": true,
|
341 |
-
"id": "Zu9D0t6n2Fwz"
|
342 |
-
},
|
343 |
-
"outputs": [],
|
344 |
-
"source": [
|
345 |
-
"#@title Install and Import Dependencies\n",
|
346 |
-
"\n",
|
347 |
-
"# this assumes that you have a relevant version of PyTorch installed\n",
|
348 |
-
"!pip install -q torchaudio\n",
|
349 |
-
"\n",
|
350 |
-
"SAMPLING_RATE = 16000\n",
|
351 |
-
"\n",
|
352 |
-
"import torch\n",
|
353 |
-
"torch.set_num_threads(1)\n",
|
354 |
-
"\n",
|
355 |
-
"from IPython.display import Audio\n",
|
356 |
-
"from pprint import pprint\n",
|
357 |
-
"# download example\n",
|
358 |
-
"torch.hub.download_url_to_file('https://models.silero.ai/vad_models/en.wav', 'en_example.wav')"
|
359 |
-
]
|
360 |
-
},
|
361 |
-
{
|
362 |
-
"cell_type": "code",
|
363 |
-
"execution_count": null,
|
364 |
-
"metadata": {
|
365 |
-
"id": "JfRKDZiRztFe"
|
366 |
-
},
|
367 |
-
"outputs": [],
|
368 |
-
"source": [
|
369 |
-
"USE_ONNX = False # change this to True if you want to test onnx model\n",
|
370 |
-
"if USE_ONNX:\n",
|
371 |
-
" !pip install -q onnxruntime\n",
|
372 |
-
" \n",
|
373 |
-
"model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',\n",
|
374 |
-
" model='silero_lang_detector',\n",
|
375 |
-
" force_reload=True,\n",
|
376 |
-
" onnx=USE_ONNX)\n",
|
377 |
-
"\n",
|
378 |
-
"get_language, read_audio = utils"
|
379 |
-
]
|
380 |
-
},
|
381 |
-
{
|
382 |
-
"cell_type": "markdown",
|
383 |
-
"metadata": {
|
384 |
-
"heading_collapsed": true,
|
385 |
-
"hidden": true,
|
386 |
-
"id": "iC696eMX2Fwz"
|
387 |
-
},
|
388 |
-
"source": [
|
389 |
-
"## Full audio"
|
390 |
-
]
|
391 |
-
},
|
392 |
-
{
|
393 |
-
"cell_type": "code",
|
394 |
-
"execution_count": null,
|
395 |
-
"metadata": {
|
396 |
-
"hidden": true,
|
397 |
-
"id": "c8UYnYBF2Fw0"
|
398 |
-
},
|
399 |
-
"outputs": [],
|
400 |
-
"source": [
|
401 |
-
"wav = read_audio('en_example.wav', sampling_rate=SAMPLING_RATE)\n",
|
402 |
-
"lang = get_language(wav, model)\n",
|
403 |
-
"print(lang)"
|
404 |
-
]
|
405 |
-
}
|
406 |
-
],
|
407 |
-
"metadata": {
|
408 |
-
"colab": {
|
409 |
-
"name": "silero-vad.ipynb",
|
410 |
-
"provenance": []
|
411 |
-
},
|
412 |
-
"kernelspec": {
|
413 |
-
"display_name": "Python 3",
|
414 |
-
"language": "python",
|
415 |
-
"name": "python3"
|
416 |
-
},
|
417 |
-
"language_info": {
|
418 |
-
"codemirror_mode": {
|
419 |
-
"name": "ipython",
|
420 |
-
"version": 3
|
421 |
-
},
|
422 |
-
"file_extension": ".py",
|
423 |
-
"mimetype": "text/x-python",
|
424 |
-
"name": "python",
|
425 |
-
"nbconvert_exporter": "python",
|
426 |
-
"pygments_lexer": "ipython3",
|
427 |
-
"version": "3.8.8"
|
428 |
-
},
|
429 |
-
"toc": {
|
430 |
-
"base_numbering": 1,
|
431 |
-
"nav_menu": {},
|
432 |
-
"number_sections": true,
|
433 |
-
"sideBar": true,
|
434 |
-
"skip_h1_title": false,
|
435 |
-
"title_cell": "Table of Contents",
|
436 |
-
"title_sidebar": "Contents",
|
437 |
-
"toc_cell": false,
|
438 |
-
"toc_position": {},
|
439 |
-
"toc_section_display": true,
|
440 |
-
"toc_window_display": false
|
441 |
-
}
|
442 |
-
},
|
443 |
-
"nbformat": 4,
|
444 |
-
"nbformat_minor": 0
|
445 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
hub/snakers4_silero-vad_master/utils_vad.py
DELETED
@@ -1,488 +0,0 @@
|
|
1 |
-
import torch
|
2 |
-
import torchaudio
|
3 |
-
from typing import List
|
4 |
-
import torch.nn.functional as F
|
5 |
-
import warnings
|
6 |
-
|
7 |
-
languages = ['ru', 'en', 'de', 'es']
|
8 |
-
|
9 |
-
|
10 |
-
class OnnxWrapper():
|
11 |
-
|
12 |
-
def __init__(self, path, force_onnx_cpu=False):
|
13 |
-
import numpy as np
|
14 |
-
global np
|
15 |
-
import onnxruntime
|
16 |
-
if force_onnx_cpu and 'CPUExecutionProvider' in onnxruntime.get_available_providers():
|
17 |
-
self.session = onnxruntime.InferenceSession(path, providers=['CPUExecutionProvider'])
|
18 |
-
else:
|
19 |
-
self.session = onnxruntime.InferenceSession(path)
|
20 |
-
self.session.intra_op_num_threads = 1
|
21 |
-
self.session.inter_op_num_threads = 1
|
22 |
-
|
23 |
-
self.reset_states()
|
24 |
-
self.sample_rates = [8000, 16000]
|
25 |
-
|
26 |
-
def _validate_input(self, x, sr: int):
|
27 |
-
if x.dim() == 1:
|
28 |
-
x = x.unsqueeze(0)
|
29 |
-
if x.dim() > 2:
|
30 |
-
raise ValueError(f"Too many dimensions for input audio chunk {x.dim()}")
|
31 |
-
|
32 |
-
if sr != 16000 and (sr % 16000 == 0):
|
33 |
-
step = sr // 16000
|
34 |
-
x = x[::step]
|
35 |
-
sr = 16000
|
36 |
-
|
37 |
-
if sr not in self.sample_rates:
|
38 |
-
raise ValueError(f"Supported sampling rates: {self.sample_rates} (or multiply of 16000)")
|
39 |
-
|
40 |
-
if sr / x.shape[1] > 31.25:
|
41 |
-
raise ValueError("Input audio chunk is too short")
|
42 |
-
|
43 |
-
return x, sr
|
44 |
-
|
45 |
-
def reset_states(self, batch_size=1):
|
46 |
-
self._h = np.zeros((2, batch_size, 64)).astype('float32')
|
47 |
-
self._c = np.zeros((2, batch_size, 64)).astype('float32')
|
48 |
-
self._last_sr = 0
|
49 |
-
self._last_batch_size = 0
|
50 |
-
|
51 |
-
def __call__(self, x, sr: int):
|
52 |
-
|
53 |
-
x, sr = self._validate_input(x, sr)
|
54 |
-
batch_size = x.shape[0]
|
55 |
-
|
56 |
-
if not self._last_batch_size:
|
57 |
-
self.reset_states(batch_size)
|
58 |
-
if (self._last_sr) and (self._last_sr != sr):
|
59 |
-
self.reset_states(batch_size)
|
60 |
-
if (self._last_batch_size) and (self._last_batch_size != batch_size):
|
61 |
-
self.reset_states(batch_size)
|
62 |
-
|
63 |
-
if sr in [8000, 16000]:
|
64 |
-
ort_inputs = {'input': x.numpy(), 'h': self._h, 'c': self._c, 'sr': np.array(sr)}
|
65 |
-
ort_outs = self.session.run(None, ort_inputs)
|
66 |
-
out, self._h, self._c = ort_outs
|
67 |
-
else:
|
68 |
-
raise ValueError()
|
69 |
-
|
70 |
-
self._last_sr = sr
|
71 |
-
self._last_batch_size = batch_size
|
72 |
-
|
73 |
-
out = torch.tensor(out)
|
74 |
-
return out
|
75 |
-
|
76 |
-
def audio_forward(self, x, sr: int, num_samples: int = 512):
|
77 |
-
outs = []
|
78 |
-
x, sr = self._validate_input(x, sr)
|
79 |
-
|
80 |
-
if x.shape[1] % num_samples:
|
81 |
-
pad_num = num_samples - (x.shape[1] % num_samples)
|
82 |
-
x = torch.nn.functional.pad(x, (0, pad_num), 'constant', value=0.0)
|
83 |
-
|
84 |
-
self.reset_states(x.shape[0])
|
85 |
-
for i in range(0, x.shape[1], num_samples):
|
86 |
-
wavs_batch = x[:, i:i+num_samples]
|
87 |
-
out_chunk = self.__call__(wavs_batch, sr)
|
88 |
-
outs.append(out_chunk)
|
89 |
-
|
90 |
-
stacked = torch.cat(outs, dim=1)
|
91 |
-
return stacked.cpu()
|
92 |
-
|
93 |
-
|
94 |
-
class Validator():
|
95 |
-
def __init__(self, url, force_onnx_cpu):
|
96 |
-
self.onnx = True if url.endswith('.onnx') else False
|
97 |
-
torch.hub.download_url_to_file(url, 'inf.model')
|
98 |
-
if self.onnx:
|
99 |
-
import onnxruntime
|
100 |
-
if force_onnx_cpu and 'CPUExecutionProvider' in onnxruntime.get_available_providers():
|
101 |
-
self.model = onnxruntime.InferenceSession('inf.model', providers=['CPUExecutionProvider'])
|
102 |
-
else:
|
103 |
-
self.model = onnxruntime.InferenceSession('inf.model')
|
104 |
-
else:
|
105 |
-
self.model = init_jit_model(model_path='inf.model')
|
106 |
-
|
107 |
-
def __call__(self, inputs: torch.Tensor):
|
108 |
-
with torch.no_grad():
|
109 |
-
if self.onnx:
|
110 |
-
ort_inputs = {'input': inputs.cpu().numpy()}
|
111 |
-
outs = self.model.run(None, ort_inputs)
|
112 |
-
outs = [torch.Tensor(x) for x in outs]
|
113 |
-
else:
|
114 |
-
outs = self.model(inputs)
|
115 |
-
|
116 |
-
return outs
|
117 |
-
|
118 |
-
|
119 |
-
def read_audio(path: str,
|
120 |
-
sampling_rate: int = 16000):
|
121 |
-
|
122 |
-
wav, sr = torchaudio.load(path)
|
123 |
-
|
124 |
-
if wav.size(0) > 1:
|
125 |
-
wav = wav.mean(dim=0, keepdim=True)
|
126 |
-
|
127 |
-
if sr != sampling_rate:
|
128 |
-
transform = torchaudio.transforms.Resample(orig_freq=sr,
|
129 |
-
new_freq=sampling_rate)
|
130 |
-
wav = transform(wav)
|
131 |
-
sr = sampling_rate
|
132 |
-
|
133 |
-
assert sr == sampling_rate
|
134 |
-
return wav.squeeze(0)
|
135 |
-
|
136 |
-
|
137 |
-
def save_audio(path: str,
|
138 |
-
tensor: torch.Tensor,
|
139 |
-
sampling_rate: int = 16000):
|
140 |
-
torchaudio.save(path, tensor.unsqueeze(0), sampling_rate)
|
141 |
-
|
142 |
-
|
143 |
-
def init_jit_model(model_path: str,
|
144 |
-
device=torch.device('cpu')):
|
145 |
-
torch.set_grad_enabled(False)
|
146 |
-
model = torch.jit.load(model_path, map_location=device)
|
147 |
-
model.eval()
|
148 |
-
return model
|
149 |
-
|
150 |
-
|
151 |
-
def make_visualization(probs, step):
|
152 |
-
import pandas as pd
|
153 |
-
pd.DataFrame({'probs': probs},
|
154 |
-
index=[x * step for x in range(len(probs))]).plot(figsize=(16, 8),
|
155 |
-
kind='area', ylim=[0, 1.05], xlim=[0, len(probs) * step],
|
156 |
-
xlabel='seconds',
|
157 |
-
ylabel='speech probability',
|
158 |
-
colormap='tab20')
|
159 |
-
|
160 |
-
|
161 |
-
def get_speech_timestamps(audio: torch.Tensor,
|
162 |
-
model,
|
163 |
-
threshold: float = 0.5,
|
164 |
-
sampling_rate: int = 16000,
|
165 |
-
min_speech_duration_ms: int = 250,
|
166 |
-
min_silence_duration_ms: int = 100,
|
167 |
-
window_size_samples: int = 512,
|
168 |
-
speech_pad_ms: int = 30,
|
169 |
-
return_seconds: bool = False,
|
170 |
-
visualize_probs: bool = False):
|
171 |
-
|
172 |
-
"""
|
173 |
-
This method is used for splitting long audios into speech chunks using silero VAD
|
174 |
-
|
175 |
-
Parameters
|
176 |
-
----------
|
177 |
-
audio: torch.Tensor, one dimensional
|
178 |
-
One dimensional float torch.Tensor, other types are casted to torch if possible
|
179 |
-
|
180 |
-
model: preloaded .jit silero VAD model
|
181 |
-
|
182 |
-
threshold: float (default - 0.5)
|
183 |
-
Speech threshold. Silero VAD outputs speech probabilities for each audio chunk, probabilities ABOVE this value are considered as SPEECH.
|
184 |
-
It is better to tune this parameter for each dataset separately, but "lazy" 0.5 is pretty good for most datasets.
|
185 |
-
|
186 |
-
sampling_rate: int (default - 16000)
|
187 |
-
Currently silero VAD models support 8000 and 16000 sample rates
|
188 |
-
|
189 |
-
min_speech_duration_ms: int (default - 250 milliseconds)
|
190 |
-
Final speech chunks shorter min_speech_duration_ms are thrown out
|
191 |
-
|
192 |
-
min_silence_duration_ms: int (default - 100 milliseconds)
|
193 |
-
In the end of each speech chunk wait for min_silence_duration_ms before separating it
|
194 |
-
|
195 |
-
window_size_samples: int (default - 1536 samples)
|
196 |
-
Audio chunks of window_size_samples size are fed to the silero VAD model.
|
197 |
-
WARNING! Silero VAD models were trained using 512, 1024, 1536 samples for 16000 sample rate and 256, 512, 768 samples for 8000 sample rate.
|
198 |
-
Values other than these may affect model perfomance!!
|
199 |
-
|
200 |
-
speech_pad_ms: int (default - 30 milliseconds)
|
201 |
-
Final speech chunks are padded by speech_pad_ms each side
|
202 |
-
|
203 |
-
return_seconds: bool (default - False)
|
204 |
-
whether return timestamps in seconds (default - samples)
|
205 |
-
|
206 |
-
visualize_probs: bool (default - False)
|
207 |
-
whether draw prob hist or not
|
208 |
-
|
209 |
-
Returns
|
210 |
-
----------
|
211 |
-
speeches: list of dicts
|
212 |
-
list containing ends and beginnings of speech chunks (samples or seconds based on return_seconds)
|
213 |
-
"""
|
214 |
-
|
215 |
-
if not torch.is_tensor(audio):
|
216 |
-
try:
|
217 |
-
audio = torch.Tensor(audio)
|
218 |
-
except:
|
219 |
-
raise TypeError("Audio cannot be casted to tensor. Cast it manually")
|
220 |
-
|
221 |
-
if len(audio.shape) > 1:
|
222 |
-
for i in range(len(audio.shape)): # trying to squeeze empty dimensions
|
223 |
-
audio = audio.squeeze(0)
|
224 |
-
if len(audio.shape) > 1:
|
225 |
-
raise ValueError("More than one dimension in audio. Are you trying to process audio with 2 channels?")
|
226 |
-
|
227 |
-
if sampling_rate > 16000 and (sampling_rate % 16000 == 0):
|
228 |
-
step = sampling_rate // 16000
|
229 |
-
sampling_rate = 16000
|
230 |
-
audio = audio[::step]
|
231 |
-
warnings.warn('Sampling rate is a multiply of 16000, casting to 16000 manually!')
|
232 |
-
else:
|
233 |
-
step = 1
|
234 |
-
|
235 |
-
if sampling_rate == 8000 and window_size_samples > 768:
|
236 |
-
warnings.warn('window_size_samples is too big for 8000 sampling_rate! Better set window_size_samples to 256, 512 or 768 for 8000 sample rate!')
|
237 |
-
if window_size_samples not in [256, 512, 768, 1024, 1536]:
|
238 |
-
warnings.warn('Unusual window_size_samples! Supported window_size_samples:\n - [512, 1024, 1536] for 16000 sampling_rate\n - [256, 512, 768] for 8000 sampling_rate')
|
239 |
-
|
240 |
-
model.reset_states()
|
241 |
-
min_speech_samples = sampling_rate * min_speech_duration_ms / 1000
|
242 |
-
min_silence_samples = sampling_rate * min_silence_duration_ms / 1000
|
243 |
-
speech_pad_samples = sampling_rate * speech_pad_ms / 1000
|
244 |
-
|
245 |
-
audio_length_samples = len(audio)
|
246 |
-
|
247 |
-
speech_probs = []
|
248 |
-
for current_start_sample in range(0, audio_length_samples, window_size_samples):
|
249 |
-
chunk = audio[current_start_sample: current_start_sample + window_size_samples]
|
250 |
-
if len(chunk) < window_size_samples:
|
251 |
-
chunk = torch.nn.functional.pad(chunk, (0, int(window_size_samples - len(chunk))))
|
252 |
-
speech_prob = model(chunk, sampling_rate).item()
|
253 |
-
speech_probs.append(speech_prob)
|
254 |
-
|
255 |
-
triggered = False
|
256 |
-
speeches = []
|
257 |
-
current_speech = {}
|
258 |
-
neg_threshold = threshold - 0.15
|
259 |
-
temp_end = 0
|
260 |
-
|
261 |
-
for i, speech_prob in enumerate(speech_probs):
|
262 |
-
if (speech_prob >= threshold) and temp_end:
|
263 |
-
temp_end = 0
|
264 |
-
|
265 |
-
if (speech_prob >= threshold) and not triggered:
|
266 |
-
triggered = True
|
267 |
-
current_speech['start'] = window_size_samples * i
|
268 |
-
continue
|
269 |
-
|
270 |
-
if (speech_prob < neg_threshold) and triggered:
|
271 |
-
if not temp_end:
|
272 |
-
temp_end = window_size_samples * i
|
273 |
-
if (window_size_samples * i) - temp_end < min_silence_samples:
|
274 |
-
continue
|
275 |
-
else:
|
276 |
-
current_speech['end'] = temp_end
|
277 |
-
if (current_speech['end'] - current_speech['start']) > min_speech_samples:
|
278 |
-
speeches.append(current_speech)
|
279 |
-
temp_end = 0
|
280 |
-
current_speech = {}
|
281 |
-
triggered = False
|
282 |
-
continue
|
283 |
-
|
284 |
-
if current_speech and (audio_length_samples - current_speech['start']) > min_speech_samples:
|
285 |
-
current_speech['end'] = audio_length_samples
|
286 |
-
speeches.append(current_speech)
|
287 |
-
|
288 |
-
for i, speech in enumerate(speeches):
|
289 |
-
if i == 0:
|
290 |
-
speech['start'] = int(max(0, speech['start'] - speech_pad_samples))
|
291 |
-
if i != len(speeches) - 1:
|
292 |
-
silence_duration = speeches[i+1]['start'] - speech['end']
|
293 |
-
if silence_duration < 2 * speech_pad_samples:
|
294 |
-
speech['end'] += int(silence_duration // 2)
|
295 |
-
speeches[i+1]['start'] = int(max(0, speeches[i+1]['start'] - silence_duration // 2))
|
296 |
-
else:
|
297 |
-
speech['end'] = int(min(audio_length_samples, speech['end'] + speech_pad_samples))
|
298 |
-
speeches[i+1]['start'] = int(max(0, speeches[i+1]['start'] - speech_pad_samples))
|
299 |
-
else:
|
300 |
-
speech['end'] = int(min(audio_length_samples, speech['end'] + speech_pad_samples))
|
301 |
-
|
302 |
-
if return_seconds:
|
303 |
-
for speech_dict in speeches:
|
304 |
-
speech_dict['start'] = round(speech_dict['start'] / sampling_rate, 1)
|
305 |
-
speech_dict['end'] = round(speech_dict['end'] / sampling_rate, 1)
|
306 |
-
elif step > 1:
|
307 |
-
for speech_dict in speeches:
|
308 |
-
speech_dict['start'] *= step
|
309 |
-
speech_dict['end'] *= step
|
310 |
-
|
311 |
-
if visualize_probs:
|
312 |
-
make_visualization(speech_probs, window_size_samples / sampling_rate)
|
313 |
-
|
314 |
-
return speeches
|
315 |
-
|
316 |
-
|
317 |
-
def get_number_ts(wav: torch.Tensor,
|
318 |
-
model,
|
319 |
-
model_stride=8,
|
320 |
-
hop_length=160,
|
321 |
-
sample_rate=16000):
|
322 |
-
wav = torch.unsqueeze(wav, dim=0)
|
323 |
-
perframe_logits = model(wav)[0]
|
324 |
-
perframe_preds = torch.argmax(torch.softmax(perframe_logits, dim=1), dim=1).squeeze() # (1, num_frames_strided)
|
325 |
-
extended_preds = []
|
326 |
-
for i in perframe_preds:
|
327 |
-
extended_preds.extend([i.item()] * model_stride)
|
328 |
-
# len(extended_preds) is *num_frames_real*; for each frame of audio we know if it has a number in it.
|
329 |
-
triggered = False
|
330 |
-
timings = []
|
331 |
-
cur_timing = {}
|
332 |
-
for i, pred in enumerate(extended_preds):
|
333 |
-
if pred == 1:
|
334 |
-
if not triggered:
|
335 |
-
cur_timing['start'] = int((i * hop_length) / (sample_rate / 1000))
|
336 |
-
triggered = True
|
337 |
-
elif pred == 0:
|
338 |
-
if triggered:
|
339 |
-
cur_timing['end'] = int((i * hop_length) / (sample_rate / 1000))
|
340 |
-
timings.append(cur_timing)
|
341 |
-
cur_timing = {}
|
342 |
-
triggered = False
|
343 |
-
if cur_timing:
|
344 |
-
cur_timing['end'] = int(len(wav) / (sample_rate / 1000))
|
345 |
-
timings.append(cur_timing)
|
346 |
-
return timings
|
347 |
-
|
348 |
-
|
349 |
-
def get_language(wav: torch.Tensor,
|
350 |
-
model):
|
351 |
-
wav = torch.unsqueeze(wav, dim=0)
|
352 |
-
lang_logits = model(wav)[2]
|
353 |
-
lang_pred = torch.argmax(torch.softmax(lang_logits, dim=1), dim=1).item() # from 0 to len(languages) - 1
|
354 |
-
assert lang_pred < len(languages)
|
355 |
-
return languages[lang_pred]
|
356 |
-
|
357 |
-
|
358 |
-
def get_language_and_group(wav: torch.Tensor,
|
359 |
-
model,
|
360 |
-
lang_dict: dict,
|
361 |
-
lang_group_dict: dict,
|
362 |
-
top_n=1):
|
363 |
-
wav = torch.unsqueeze(wav, dim=0)
|
364 |
-
lang_logits, lang_group_logits = model(wav)
|
365 |
-
|
366 |
-
softm = torch.softmax(lang_logits, dim=1).squeeze()
|
367 |
-
softm_group = torch.softmax(lang_group_logits, dim=1).squeeze()
|
368 |
-
|
369 |
-
srtd = torch.argsort(softm, descending=True)
|
370 |
-
srtd_group = torch.argsort(softm_group, descending=True)
|
371 |
-
|
372 |
-
outs = []
|
373 |
-
outs_group = []
|
374 |
-
for i in range(top_n):
|
375 |
-
prob = round(softm[srtd[i]].item(), 2)
|
376 |
-
prob_group = round(softm_group[srtd_group[i]].item(), 2)
|
377 |
-
outs.append((lang_dict[str(srtd[i].item())], prob))
|
378 |
-
outs_group.append((lang_group_dict[str(srtd_group[i].item())], prob_group))
|
379 |
-
|
380 |
-
return outs, outs_group
|
381 |
-
|
382 |
-
|
383 |
-
class VADIterator:
|
384 |
-
def __init__(self,
|
385 |
-
model,
|
386 |
-
threshold: float = 0.5,
|
387 |
-
sampling_rate: int = 16000,
|
388 |
-
min_silence_duration_ms: int = 100,
|
389 |
-
speech_pad_ms: int = 30
|
390 |
-
):
|
391 |
-
|
392 |
-
"""
|
393 |
-
Class for stream imitation
|
394 |
-
|
395 |
-
Parameters
|
396 |
-
----------
|
397 |
-
model: preloaded .jit silero VAD model
|
398 |
-
|
399 |
-
threshold: float (default - 0.5)
|
400 |
-
Speech threshold. Silero VAD outputs speech probabilities for each audio chunk, probabilities ABOVE this value are considered as SPEECH.
|
401 |
-
It is better to tune this parameter for each dataset separately, but "lazy" 0.5 is pretty good for most datasets.
|
402 |
-
|
403 |
-
sampling_rate: int (default - 16000)
|
404 |
-
Currently silero VAD models support 8000 and 16000 sample rates
|
405 |
-
|
406 |
-
min_silence_duration_ms: int (default - 100 milliseconds)
|
407 |
-
In the end of each speech chunk wait for min_silence_duration_ms before separating it
|
408 |
-
|
409 |
-
speech_pad_ms: int (default - 30 milliseconds)
|
410 |
-
Final speech chunks are padded by speech_pad_ms each side
|
411 |
-
"""
|
412 |
-
|
413 |
-
self.model = model
|
414 |
-
self.threshold = threshold
|
415 |
-
self.sampling_rate = sampling_rate
|
416 |
-
|
417 |
-
if sampling_rate not in [8000, 16000]:
|
418 |
-
raise ValueError('VADIterator does not support sampling rates other than [8000, 16000]')
|
419 |
-
|
420 |
-
self.min_silence_samples = sampling_rate * min_silence_duration_ms / 1000
|
421 |
-
self.speech_pad_samples = sampling_rate * speech_pad_ms / 1000
|
422 |
-
self.reset_states()
|
423 |
-
|
424 |
-
def reset_states(self):
|
425 |
-
|
426 |
-
self.model.reset_states()
|
427 |
-
self.triggered = False
|
428 |
-
self.temp_end = 0
|
429 |
-
self.current_sample = 0
|
430 |
-
|
431 |
-
def __call__(self, x, return_seconds=False):
|
432 |
-
"""
|
433 |
-
x: torch.Tensor
|
434 |
-
audio chunk (see examples in repo)
|
435 |
-
|
436 |
-
return_seconds: bool (default - False)
|
437 |
-
whether return timestamps in seconds (default - samples)
|
438 |
-
"""
|
439 |
-
|
440 |
-
if not torch.is_tensor(x):
|
441 |
-
try:
|
442 |
-
x = torch.Tensor(x)
|
443 |
-
except:
|
444 |
-
raise TypeError("Audio cannot be casted to tensor. Cast it manually")
|
445 |
-
|
446 |
-
window_size_samples = len(x[0]) if x.dim() == 2 else len(x)
|
447 |
-
self.current_sample += window_size_samples
|
448 |
-
|
449 |
-
speech_prob = self.model(x, self.sampling_rate).item()
|
450 |
-
|
451 |
-
if (speech_prob >= self.threshold) and self.temp_end:
|
452 |
-
self.temp_end = 0
|
453 |
-
|
454 |
-
if (speech_prob >= self.threshold) and not self.triggered:
|
455 |
-
self.triggered = True
|
456 |
-
speech_start = self.current_sample - self.speech_pad_samples
|
457 |
-
return {'start': int(speech_start) if not return_seconds else round(speech_start / self.sampling_rate, 1)}
|
458 |
-
|
459 |
-
if (speech_prob < self.threshold - 0.15) and self.triggered:
|
460 |
-
if not self.temp_end:
|
461 |
-
self.temp_end = self.current_sample
|
462 |
-
if self.current_sample - self.temp_end < self.min_silence_samples:
|
463 |
-
return None
|
464 |
-
else:
|
465 |
-
speech_end = self.temp_end + self.speech_pad_samples
|
466 |
-
self.temp_end = 0
|
467 |
-
self.triggered = False
|
468 |
-
return {'end': int(speech_end) if not return_seconds else round(speech_end / self.sampling_rate, 1)}
|
469 |
-
|
470 |
-
return None
|
471 |
-
|
472 |
-
|
473 |
-
def collect_chunks(tss: List[dict],
|
474 |
-
wav: torch.Tensor):
|
475 |
-
chunks = []
|
476 |
-
for i in tss:
|
477 |
-
chunks.append(wav[i['start']: i['end']])
|
478 |
-
return torch.cat(chunks)
|
479 |
-
|
480 |
-
|
481 |
-
def drop_chunks(tss: List[dict],
|
482 |
-
wav: torch.Tensor):
|
483 |
-
chunks = []
|
484 |
-
cur_start = 0
|
485 |
-
for i in tss:
|
486 |
-
chunks.append((wav[cur_start: i['start']]))
|
487 |
-
cur_start = i['end']
|
488 |
-
return torch.cat(chunks)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
hub/trusted_list
DELETED
File without changes
|