DavidBrowne17/Muchi · Hugging Face

Muchi is a finetuned speech-text foundation model and full-duplex spoken dialogue framework, based on the original Moshi model.

How to use:

You can use the original moshi ui to try out this model, just start the server pointed to this model

https://github.com/kyutai-labs/moshi

python -m moshi.server [--gradio-tunnel] [--hf-repo DavidBrowne17/Muchi]

Model Details

Pytorch version: Quantized in bf16 precision.

Model type: Multimodal speech-text foundation model

Language(s) (NLP): English

License: apache 2.0

Model Description

Muchi is a refined version of the Moshi model, designed for smoother, more adaptable dialogue generation. Building upon Moshi’s speech-to-speech generation foundation, Muchi enhances conversational coherence and reduces latency. Like Moshi, it uses a residual quantizer from a neural audio codec to generate speech tokens and models its own and user speech into parallel streams. This framework supports dynamic conversational flow without rigid speaker turns.

Muchi also implements the "Inner Monologue" method, predicting time-aligned text tokens before generating speech tokens. This approach enhances linguistic quality, supports streaming speech recognition, and improves text-to-speech output. Muchi achieves a practical latency of approximately 200ms, ensuring near real-time interaction.

Key Enhancements in Muchi:

Reduced latency and smoother conversational flow.

Enhanced adaptability in dialogue dynamics.

Improved speech synthesis quality.

Uses

Direct Use

Muchi can be deployed as a conversational agent for:

Casual conversation.

Basic factual responses and advice.

Roleplay scenarios.

Low-latency interactive tasks.

Downstream Use

Components like the audio codec can be repurposed for training speech models or enhancing text-to-speech systems.

The finetuned architecture allows for domain-specific adaptations with additional training.

Out-of-Scope Use

Muchi is not intended for:

Impersonating individuals.

Malicious applications.

Professional advice or critical decision-making.

Bias, Risks, and Limitations

Muchi inherits safeguards from Moshi but may still exhibit biases due to the nature of its training data. While toxicity has been minimized, there are risks of over-representation from certain data domains. The model is trained to produce a consistent voice and is not designed for impersonation. Further testing is necessary to evaluate long-term sociotechnical impacts.

DavidBrowne17
/

Muchi

Model tree for DavidBrowne17/Muchi