hexgrad (Hexgrad)

posted an update 7 days ago

Post

2916

To Meta AI Research: I would like to fold ylacombe/expresso into the training mix of an Apache TTS model series. Can you relax the Expresso dataset license to CC-BY or more permissive?

Barring that, can I have an individual exception to train on the materials and distribute trained Apache models, without direct redistribution of the original files? Thanks!

CC (Expresso paper authors whose handles I could find on HF) @wnhsu @adavirro @bowenshi @itaigat @TalRemez @JadeCopet @hassid @felixkreuk @adiyoss @edupoux

posted an update about 1 month ago

Post

7630

hexgrad/Kokoro-82M-v1.1-zh

reacted to Pendrokar's post with 👍 about 2 months ago

Post

2468

TTS: Added both Zonos model Spaces to the Arena Fork:
🏆 Pendrokar/TTS-Spaces-Arena

Zonos 🌍: Steveeeeeeen/Zonos

IMHO while Zonos still has fewer amount of capabilities than xVASynth, I'm still giving it the first row at the TTS capabilities table due to IPA and insta-clone/zero-shot:
1️⃣ Pendrokar/open_tts_tracker

replied to their post about 2 months ago

@fuzzy-mittenz Nope, this is more like it: https://huggingface.co/perplexity-ai/r1-1776

reacted to Xenova's post with 🔥 2 months ago

Post

12329

We did it. Kokoro TTS (v1.0) can now run 100% locally in your browser w/ WebGPU acceleration. Real-time text-to-speech without a server. ⚡️

Generate 10 seconds of speech in ~1 second for $0.

What will you build? 🔥
webml-community/kokoro-webgpu

The most difficult part was getting the model running in the first place, but the next steps are simple:
✂️ Implement sentence splitting, allowing for streamed responses
🌍 Multilingual support (only phonemization left)

Who wants to help?

11 replies

·

posted an update 2 months ago

Post

7040

Wanted: Peak Data. I'm collecting audio data to train another TTS model:
+ AVM data: ChatGPT Advanced Voice Mode audio & text from source
+ Professional audio: Permissive (CC0, Apache, MIT, CC-BY)

This audio should *impress* most native speakers, not just barely pass their audio Turing tests. Professional-caliber means S or A-tier, not your average bloke off the street. Traditional TTS may not make the cut. Absolutely no low-fi microphone recordings like Common Voice.

The bar is much higher than last time, so there are no timelines yet and I expect it may take longer to collect such mythical data. Raising the bar means evicting quite a bit of old data, and voice/language availability may decrease. The theme is *quality* over quantity. I would rather have 1 hour of A/S-tier than 100 hours of mid data.

I have nothing to offer but the north star of a future Apache 2.0 TTS model, so prefer data that you *already have* and costs you *nothing extra* to send. Additionally, *all* the new data may be used to construct public, Apache 2.0 voicepacks, and if that arrangement doesn't work for you, no need to send any audio.

Last time I asked for horses; now I'm asking for unicorns. As of writing this post, I've currently got a few English & Chinese unicorns, but there is plenty of room in the stable. Find me over on Discord at rzvzn: https://discord.gg/QuGxSWBfQy

4 replies

·

posted an update 2 months ago

Post

5753

I wrote an article about G2P: https://hf.co/blog/hexgrad/g2p

G2P is an underrated piece of small TTS models, like offensive linemen who do a bunch of work and get no credit.

Instead of relying on explicit G2P, larger speech models implicitly learn this task by eating many thousands of hours of audio data. They often use a 500M+ parameter LLM at the front to predict latent audio tokens over a learned codebook, then decode these tokens into audio.

Kokoro instead relies on G2P preprocessing, is 82M parameters, and thus needs less audio to learn. Because of this, we can cherrypick high fidelity audio for training data, and deliver solid speech for those voices. In turn, this excellent audio quality & lack of background noise helps explain why Kokoro is very competitive in single-voice TTS Arenas.

replied to Keltezaa's post 2 months ago

I am considering canceling my Pro subscription because I have been banned from posting an Article for many weeks now with no explanation or recourse.

Also, the ability to Post and the Posts feed are vandalized by those AI slop posts where the OP runs all 12 reactions on their own post and uses alt accounts to do the same. And I have no ability to block these circlejerking accounts.

reacted to Pendrokar's post with ❤️ 2 months ago

Post

3384

TTS: Added Kokoro v1, Parler Large, LlaSa 3B & MARS 6 TTS models to the Arena.
Pendrokar/TTS-Spaces-Arena

Also had added MaskGCT, GPT-SoVITS & OuteTTS a month ago. OuteTTS devs did say that is too early for it to be added to TTS Arenas.

Mars 5 does have a space with open weights models, but inference is way too slow (2 minutes+).

3 replies

·

reacted to fdaudens's post with ❤️ 2 months ago

Post

3530

🎯 Kokoro TTS just hit v1.0! 🚀

Small but mighty: 82M parameters, runs locally, speaks multiple languages. The best part? It's Apache 2.0 licensed!
This could unlock so many possibilities ✨

Check it out: hexgrad/Kokoro-82M

1 reply

·

replied to their post 2 months ago

I'm sure they try, but 14.8 trillion tokens is likely too many to prune everything considered "sensitive", and I am confident there is enough in there to theoretically put together a coherent answer to many topics without hallucinating. I could be wrong, but I think R1 refuses due to mitigations, not for lack of knowing, and abliteration claims to be able to bypass such mitigations.

The question is simple: Is abliteration an effective method to uncensor DeepSeek-R1? There is some info on abliteration as it relates to 70b models and smaller, but I have not heard of anyone abliterating a 670B MOE, and due to size/compute constraints I cannot do it myself. If you are aware of such experiments, feel free to drop links.

replied to their post 2 months ago

I do not think the usual concern—that an abliterated model will hallucinate—applies to DeepSeek. It was trained on 14.8T tokens, right? Unless they have unheard levels of data cleaning, it seems totally infeasible to sweep all mentions of Tienanmen square, Winnie the Pooh, Taiwan, and so on from the dataset.

I suspect that the refusal is baked into the weights, but the knowledge has also gotta be in there somewhere. It is a matter of science to tinker with the weights to remove the refusal and unlock that knowledge. Perplexity may have done something like this already, but I am not sure if they used an enormous system prompt or they're RAG-ing it in, or both, or something else.

replied to their post 2 months ago

Can't I just hook the Inference widget up to the Kokoro-TTS Space that's already running?

posted an update 2 months ago

Post

2785

Technical question: Is Abliteration still an effective method for uncensoring LLMs? Generally, what are the most effective methods to uncensor LLMs?

An effective uncensoring method would ideally be low-cost, data-efficient, and above all, successfully uncensor an LLM with minimal benchmark regressions.

"Tiananmen Square", "Winnie-the-Pooh", etc and more broadly "China influence/censorship" are some common criticisms leveled at DeepSeek.

I am vaguely aware of "Abliteration", a technique coined by @failspy (apologies if that attribution is incorrect) and originally described in a mid-2024 paper titled "Refusal in Language Models Is Mediated by a Single Direction" https://arxiv.org/abs/2406.11717

Abliteration is proposed as a relatively cheap and effective way to bypass censorship in models. However, it is not without criticism: https://www.reddit.com/r/LocalLLaMA/comments/1f07b4b/abliteration_fails_to_uncensor_models_while_it/

Curious to hear people's takes on Abliteration or other uncensoring methods, especially as it relates to DeepSeek.

10 replies

·

replied to their post 2 months ago

Its now a "Stream" tab capped at 5k characters, but you can also Duplicate that space and that limit is automatically lifted, or just download & run the model locally.

replied to their post 2 months ago

1.0 and 0.23 are different models. 1.0 continued training from 0.23

posted an update 2 months ago

Post

8476

hexgrad/Kokoro-82M got an upgrade! ⬆️ More voices, more languages, pip install kokoro, and still 82M parameters.

GitHub: https://github.com/hexgrad/kokoro
PyPI: https://pypi.org/project/kokoro/
Space: hexgrad/Kokoro-TTS

11 replies

·

posted an update 2 months ago

Post

3974

IMHO, being able & willing to defeat CAPTCHA, hCaptcha, or any other reasoning puzzle is a must-have for any Web-Browsing / Computer-Using Agent (WB/CUA).

I realize it subverts the purpose of CAPTCHA, but I do not think you can claim to be building AGI/agents without smoothly passing humanity checks. It would be like getting in a self-driving car that requires human intervention over speed bumps. Claiming AGI or even "somewhat powerful AI" seems hollow if you are halted by a mere CAPTCHA.

I imagine OpenAI's Operator is *able* but *not willing* to defeat CAPTCHA. Like their non-profit status, I expect that policy to evolve over time—and if not, rival agent-builders will attack that opening to offer a better product.

2 replies

·

replied to their post 3 months ago

@to-be There are more details at https://hf.co/hexgrad/Kokoro-82M/discussions/21 and my Discord DMs are open if you have more questions, but essentially I am looking for segmented text-audio pairs: likely .txt and .wav pairs, with each .txt being ~500 characters or less (needs to fit inside 512 token context hard limit) and the .wav matching the text.

replied to their post 3 months ago

It's simple: what you put in is what you get out. 😄 German support in the future depends mostly on how much German data (synthetic audio + text labels) is contributed.

Hexgrad PRO

AI & ML interests

Recent Activity

Organizations

hexgrad's activity