Text-to-Speech
Safetensors
GGUF
qwen2
audio
speech
speech-language-models
conversational
KeythSullivan JohBohU commited on
Commit
b279ce5
·
0 Parent(s):

Duplicate from neuphonic/neutts-air

Browse files

Co-authored-by: Johanna Ulin <JohBohU@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
37
+ tokenizer_config.json filter=lfs diff=lfs merge=lfs -text
38
+ new_tokenizer.json filter=lfs diff=lfs merge=lfs -text
39
+ neutts-BF16.gguf filter=lfs diff=lfs merge=lfs -text
40
+ neutts-Q8-0.gguf filter=lfs diff=lfs merge=lfs -text
41
+ neutts-Q4_0.gguf filter=lfs diff=lfs merge=lfs -text
42
+ neutss-air-BF16.gguf filter=lfs diff=lfs merge=lfs -text
43
+ neutts-air-Q4-0.gguf filter=lfs diff=lfs merge=lfs -text
44
+ neutts-air-Q8-0.gguf filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,191 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: text-to-speech
4
+ tags:
5
+ - audio
6
+ - speech
7
+ - speech-language-models
8
+ datasets:
9
+ - amphion/Emilia-Dataset
10
+ - neuphonic/emilia-yodas-english-neucodec
11
+ ---
12
+
13
+ # NeuTTS Air ☁️
14
+
15
+ [![NeuTTSAir_Intro](neutts-air.png)](https://www.youtube.com/watch?v=YAB3hCtu5wE)
16
+
17
+ [🚀 Spaces Demo](https://huggingface.co/spaces/neuphonic/neutts-air), [🔧 Github](https://github.com/neuphonic/neutts-air)
18
+
19
+ [Q8 GGUF version](https://huggingface.co/neuphonic/neutts-air-q8-gguf), [Q4 GGUF version](https://huggingface.co/neuphonic/neutts-air-q4-gguf)
20
+
21
+ *Created by [Neuphonic](http://neuphonic.com/) - building faster, smaller, on-device voice AI*
22
+
23
+ State-of-the-art Voice AI has been locked behind web APIs for too long. NeuTTS Air is the world’s first super-realistic, on-device, TTS speech language model with instant voice cloning. Built off a 0.5B LLM backbone, NeuTTS Air brings natural-sounding speech, real-time performance, built-in security and speaker cloning to your local device - unlocking a new category of embedded voice agents, assistants, toys, and compliance-safe apps.
24
+
25
+ ## Key Features
26
+
27
+ - 🗣Best-in-class realism for its size - produces natural, ultra-realistic voices that sound human
28
+ - 📱Optimised for on-device deployment - provided in GGML format, ready to run on phones, laptops, or even Raspberry Pis
29
+ - 👫Instant voice cloning - create your own speaker with as little as 3 seconds of audio
30
+ - 🚄Simple LM + codec architecture built off a 0.5B backbone - the sweet spot between speed, size, and quality for real-world applications
31
+
32
+
33
+ > [!CAUTION]
34
+ > Websites like neutts.com are popping up and they're not affliated with Neuphonic, our github or this repo.
35
+ >
36
+ > We are on neuphonic.com only. Please be careful out there! 🙏
37
+
38
+
39
+ ## Model Details
40
+
41
+ NeuTTS Air is built off Qwen 0.5B - a lightweight yet capable language model optimised for text understanding and generation - as well as a powerful combination of technologies designed for efficiency and quality:
42
+
43
+ - **Audio Codec**: [NeuCodec](https://huggingface.co/neuphonic/neucodec) - our proprietary neural audio codec that achieves exceptional audio quality at low bitrates using a single codebook
44
+ - **Format**: Available in GGML format for efficient on-device inference
45
+ - **Responsibility**: Watermarked outputs
46
+ - **Inference Speed**: Real-time generation on mid-range devices
47
+ - **Power Consumption**: Optimised for mobile and embedded devices
48
+
49
+ ## Get Started with NeuTTS
50
+
51
+ 1. **Install System Dependencies (required): `espeak-ng`**
52
+
53
+ > [!NOTE]
54
+ > With `brew` on macOS Ventura and later, `apt` in Ubuntu version 25 or Debian version 13, and `choco`/`winget` on Windows, install the latest version of `espeak-ng` with the commands below. If you have a different or older operating system, you may need to install from source: see the following link https://github.com/espeak-ng/espeak-ng/blob/master/docs/building.md
55
+
56
+ Please refer to the following link for instructions on how to install `espeak-ng`:
57
+
58
+ https://github.com/espeak-ng/espeak-ng/blob/master/docs/guide.md
59
+
60
+ ```bash
61
+ # Mac OS
62
+ brew install espeak-ng
63
+
64
+ # Ubuntu/Debian
65
+ sudo apt install espeak-ng
66
+
67
+ # Windows install
68
+ # via chocolatey (https://community.chocolatey.org/packages?page=1&prerelease=False&moderatorQueue=False&tags=espeak)
69
+ choco install espeak-ng
70
+ # via winget
71
+ winget install -e --id eSpeak-NG.eSpeak-NG
72
+ # via msi (need to add to path or folow the "Windows users who installed via msi" below)
73
+ # find the msi at https://github.com/espeak-ng/espeak-ng/releases
74
+ ```
75
+
76
+ Windows users who installed via msi / do not have their install on path need to run the following (see https://github.com/bootphon/phonemizer/issues/163)
77
+ ```pwsh
78
+ $env:PHONEMIZER_ESPEAK_LIBRARY = "c:\Program Files\eSpeak NG\libespeak-ng.dll"
79
+ $env:PHONEMIZER_ESPEAK_PATH = "c:\Program Files\eSpeak NG"
80
+ setx PHONEMIZER_ESPEAK_LIBRARY "c:\Program Files\eSpeak NG\libespeak-ng.dll"
81
+ setx PHONEMIZER_ESPEAK_PATH "c:\Program Files\eSpeak NG"
82
+ ```
83
+
84
+ 2. **Install NeuTTS**
85
+ ```bash
86
+ pip install neutts
87
+ ```
88
+
89
+ Or for a local editable install, clone the [neutts repository](https://github.com/neuphonic/neutts) and run in the base folder:
90
+ ```bash
91
+ pip install -e .
92
+ ```
93
+
94
+ Alternatively to install all dependencies, including `onnxruntime` and `llama-cpp-python` (equivalent to steps 3 and 4 below):
95
+
96
+ ```bash
97
+ pip install neutts[all]
98
+ ```
99
+
100
+ or for an editable install:
101
+
102
+ ```bash
103
+ pip install -e .[all]
104
+ ```
105
+
106
+ 3. **(Optional) Install `llama-cpp-python` to use `.gguf` models.**
107
+
108
+ ```bash
109
+ pip install "neutts[llama]"
110
+ ```
111
+
112
+ Note that this installs `llama-cpp-python` without GPU support. To install with GPU support (e.g., CUDA, MPS) please refer to:
113
+ https://pypi.org/project/llama-cpp-python/
114
+
115
+ 4. **(Optional) Install `onnxruntime` to use the `.onnx` decoder.**
116
+ ```bash
117
+ pip install "neutts[onnx]"
118
+ ```
119
+
120
+
121
+
122
+ ## **Basic Example**
123
+
124
+ Run the basic example script to synthesize speech:
125
+
126
+ ```bash
127
+ python -m examples.basic_example \
128
+ --input_text "My name is Dave, and um, I'm from London" \
129
+ --ref_audio samples/dave.wav \
130
+ --ref_text samples/dave.txt
131
+
132
+ ```
133
+
134
+ To specify a particular model repo for the backbone or codec, add the `--backbone` argument. Available backbones are listed in [NeuTTS-Air huggingface collection](https://huggingface.co/collections/neuphonic/neutts-air-68cc14b7033b4c56197ef350).
135
+
136
+ Several examples are available, including a Jupyter notebook in the `examples` folder.
137
+
138
+ ### **Simple One-Code Block Usage**
139
+
140
+ ```python
141
+ from neutts import NeuTTS
142
+ import soundfile as sf
143
+
144
+ tts = NeuTTS(backbone_repo="neuphonic/neutts-air-q4-gguf", backbone_device="cpu", codec_repo="neuphonic/neucodec", codec_device="cpu")
145
+ input_text = "My name is Dave, and um, I'm from London."
146
+
147
+ ref_text = "samples/dave.txt"
148
+ ref_audio_path = "samples/dave.wav"
149
+
150
+ ref_text = open(ref_text, "r").read().strip()
151
+ ref_codes = tts.encode_reference(ref_audio_path)
152
+
153
+ wav = tts.infer(input_text, ref_codes, ref_text)
154
+ sf.write("test.wav", wav, 24000)
155
+
156
+ ```
157
+
158
+ # Tips
159
+
160
+ NeuTTS Air requires two inputs:
161
+
162
+ 1. A reference audio sample (`.wav` file)
163
+ 2. A text string
164
+
165
+ The model then synthesises the text as speech in the style of the reference audio. This is what enables NeuTTS Air’s instant voice cloning capability.
166
+
167
+ ### Example Reference Files
168
+
169
+ You can find some ready-to-use samples in the `examples` folder:
170
+
171
+ - `samples/dave.wav`
172
+ - `samples/jo.wav`
173
+
174
+ ### Guidelines for Best Results
175
+
176
+ For optimal performance, reference audio samples should be:
177
+
178
+ 1. **Mono channel**
179
+ 2. **16-44 kHz sample rate**
180
+ 3. **3–15 seconds in length**
181
+ 4. **Saved as a `.wav` file**
182
+ 5. **Clean** — minimal to no background noise
183
+ 6. **Natural, continuous speech** — like a monologue or conversation, with few pauses, so the model can capture tone effectively
184
+
185
+ # **Responsibility**
186
+
187
+ Every audio file generated by NeuTTS Air includes [**Perth (Perceptual Threshold) Watermarker](https://github.com/resemble-ai/perth).**
188
+
189
+ # **Disclaimer**
190
+
191
+ Don't use this model to do bad things… please.
config.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Qwen2ForCausalLM"
4
+ ],
5
+ "attention_dropout": 0.0,
6
+ "bos_token_id": 151643,
7
+ "eos_token_id": 151645,
8
+ "hidden_act": "silu",
9
+ "hidden_size": 896,
10
+ "initializer_range": 0.02,
11
+ "intermediate_size": 4864,
12
+ "max_position_embeddings": 32768,
13
+ "max_window_layers": 21,
14
+ "model_type": "qwen2",
15
+ "num_attention_heads": 14,
16
+ "num_hidden_layers": 24,
17
+ "num_key_value_heads": 2,
18
+ "rms_norm_eps": 1e-06,
19
+ "rope_scaling": null,
20
+ "rope_theta": 1000000.0,
21
+ "sliding_window": 32768,
22
+ "tie_word_embeddings": true,
23
+ "torch_dtype": "bfloat16",
24
+ "transformers_version": "4.50.3",
25
+ "use_cache": true,
26
+ "use_sliding_window": false,
27
+ "vocab_size": 217652
28
+ }
generation_config.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 151643,
3
+ "do_sample": true,
4
+ "eos_token_id": [
5
+ 151645,
6
+ 151643
7
+ ],
8
+ "pad_token_id": 151643,
9
+ "repetition_penalty": 1.1,
10
+ "temperature": 0.7,
11
+ "top_k": 20,
12
+ "top_p": 0.8,
13
+ "transformers_version": "4.50.3"
14
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:85c7db53fbe8d62be9bc29a0743661adcb0067552488f185b5f2eb2f1ee4179f
3
+ size 1495893752
neutss-air-BF16.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3f6d562b881e64feb785a2b0a422eeadea326289fd5614990f9809ae37acd0d7
3
+ size 1503776000
neutts-air.png ADDED
special_tokens_map.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>",
5
+ "<|object_ref_start|>",
6
+ "<|object_ref_end|>",
7
+ "<|box_start|>",
8
+ "<|box_end|>",
9
+ "<|quad_start|>",
10
+ "<|quad_end|>",
11
+ "<|vision_start|>",
12
+ "<|vision_end|>",
13
+ "<|vision_pad|>",
14
+ "<|image_pad|>",
15
+ "<|video_pad|>"
16
+ ],
17
+ "eos_token": {
18
+ "content": "<|im_end|>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ "pad_token": {
25
+ "content": "<|im_end|>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ }
31
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:74c466530bd698626a5b6a424d204711c58dfff0a6b3dd8b4dbac1e1e8c9aa87
3
+ size 24140239
tokenizer_config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:364126212a294d794d83036954b0154b925c329411da93e68cdd1addeb4a5bea
3
+ size 12065831
vocab.json ADDED
The diff for this file is too large to render. See raw diff