Mimi
Overview
The Mimi model was proposed in Moshi: a speech-text foundation model for real-time dialogue by Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave and Neil Zeghidour. Mimi is a high-fidelity audio codec model developed by the Kyutai team, that combines semantic and acoustic information into audio tokens running at 12Hz and a bitrate of 1.1kbps. In other words, it can be used to map audio waveforms into “audio tokens”, known as “codebooks”.
The abstract from the paper is the following:
We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Second, text being the intermediate modality for dialogue, non-linguistic information that modifies meaning— such as emotion or non-speech sounds— is lost in the interaction. Finally, they rely on a segmentation into speaker turns, which does not take into account overlapping speech, interruptions and interjections. Moshi solves these independent issues altogether by casting spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. We moreover extend the hierarchical semantic-to-acoustic token generation of previous work to first predict time-aligned text tokens as a prefix to audio tokens. Not only this “Inner Monologue” method significantly improves the linguistic quality of generated speech, but we also illustrate how it can provide streaming speech recognition and text-to-speech. Our resulting model is the first real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice, and is available at github.com/kyutai-labs/moshi.
Its architecture is based on Encodec with several major differences:
- it uses a much lower frame-rate.
- it uses additional transformers for encoding and decoding for better latent contextualization
- it uses a different quantization scheme: one codebook is dedicated to semantic projection.
Usage example
Here is a quick example of how to encode and decode an audio using this model:
>>> from datasets import load_dataset, Audio
>>> from transformers import MimiModel, AutoFeatureExtractor
>>> librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> # load model and feature extractor
>>> model = MimiModel.from_pretrained("kyutai/mimi")
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("kyutai/mimi")
>>> # load audio sample
>>> librispeech_dummy = librispeech_dummy.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate))
>>> audio_sample = librispeech_dummy[-1]["audio"]["array"]
>>> inputs = feature_extractor(raw_audio=audio_sample, sampling_rate=feature_extractor.sampling_rate, return_tensors="pt")
>>> encoder_outputs = model.encode(inputs["input_values"], inputs["padding_mask"])
>>> audio_values = model.decode(encoder_outputs.audio_codes, inputs["padding_mask"])[0]
>>> # or the equivalent with a forward pass
>>> audio_values = model(inputs["input_values"], inputs["padding_mask"]).audio_values
This model was contributed by Yoach Lacombe (ylacombe). The original code can be found here.
MimiConfig
class transformers.MimiConfig
< source >( sampling_rate = 24000 frame_rate = 12.5 audio_channels = 1 hidden_size = 512 num_filters = 64 num_residual_layers = 1 upsampling_ratios = None kernel_size = 7 last_kernel_size = 3 residual_kernel_size = 3 dilation_growth_rate = 2 use_causal_conv = True pad_mode = 'constant' compress = 2 trim_right_ratio = 1.0 codebook_size = 2048 codebook_dim = 256 num_quantizers = 32 use_conv_shortcut = False vector_quantization_hidden_dimension = 256 num_semantic_quantizers = 1 upsample_groups = 512 num_hidden_layers = 8 intermediate_size = 2048 num_attention_heads = 8 num_key_value_heads = 8 head_dim = None hidden_act = 'gelu' max_position_embeddings = 8000 initializer_range = 0.02 norm_eps = 1e-05 use_cache = False rope_theta = 10000.0 sliding_window = 250 attention_dropout = 0.0 layer_scale_initial_scale = 0.01 attention_bias = False **kwargs )
Parameters
- sampling_rate (
int
, optional, defaults to 24000) — The sampling rate at which the audio waveform should be digitalized expressed in hertz (Hz). - frame_rate (
float
, optional, defaults to 12.5) — Framerate of the model. - audio_channels (
int
, optional, defaults to 1) — Number of channels in the audio data. Either 1 for mono or 2 for stereo. - hidden_size (
int
, optional, defaults to 512) — Intermediate representation dimension. - num_filters (
int
, optional, defaults to 64) — Number of convolution kernels of firstMimiConv1d
down sampling layer. - num_residual_layers (
int
, optional, defaults to 1) — Number of residual layers. - upsampling_ratios (
Sequence[int]
, optional) — Kernel size and stride ratios. The encoder uses downsampling ratios instead of upsampling ratios, hence it will use the ratios in the reverse order to the ones specified here that must match the decoder order. If not specified, will defaults to[8, 6, 5, 4]
- kernel_size (
int
, optional, defaults to 7) — Kernel size for the initial convolution. - last_kernel_size (
int
, optional, defaults to 3) — Kernel size for the last convolution layer. - residual_kernel_size (
int
, optional, defaults to 3) — Kernel size for the residual layers. - dilation_growth_rate (
int
, optional, defaults to 2) — How much to increase the dilation with each layer. - use_causal_conv (
bool
, optional, defaults toTrue
) — Whether to use fully causal convolution. - pad_mode (
str
, optional, defaults to"constant"
) — Padding mode for the convolutions. - compress (
int
, optional, defaults to 2) — Reduced dimensionality in residual branches. - trim_right_ratio (
float
, optional, defaults to 1.0) — Ratio for trimming at the right of the transposed convolution under theuse_causal_conv = True
setup. If equal to 1.0, it means that all the trimming is done at the right. - codebook_size (
int
, optional, defaults to 2048) — Number of discret codes in each codebooks. - codebook_dim (
int
, optional, defaults to 256) — Dimension of the unquantized codebook vectors. If not defined, useshidden_size
. - num_quantizers (
int
, optional, defaults to 32) — Number of quantizer channels, or codebooks, in the quantizer. - use_conv_shortcut (
bool
, optional, defaults toFalse
) — Whether to use a convolutional layer as the ‘skip’ connection in theMimiResnetBlock
block. If False, an identity function will be used, giving a generic residual connection. - vector_quantization_hidden_dimension (
int
, optional, defaults to 256) — Intermediate representation dimension in the residual vector quantization space. - num_semantic_quantizers (
int
, optional, defaults to 1) — Number of semantic quantizer channels, or codebooks, in the semantic quantizer. Must be lower thannum_quantizers
. - upsample_groups (
int
, optional, defaults to 512) — Ifframe_rate!=encodec_frame_rate
, indicates the number of groups used in the upsampling operation to go from one rate to another. - num_hidden_layers (
int
, optional, defaults to 8) — Number of hidden layers in the Transformer models. - intermediate_size (
int
, optional, defaults to 2048) — Dimension of the MLP representations. - num_attention_heads (
int
, optional, defaults to 8) — Number of attention heads for each attention layer in the Transformer encoder. - num_key_value_heads (
int
, optional, defaults to 8) — This is the number of key_value heads that should be used to implement Grouped Query Attention. Ifnum_key_value_heads=num_attention_heads
, the model will use Multi Head Attention (MHA), ifnum_key_value_heads=1
the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed by meanpooling all the original heads within that group. For more details checkout this paper. If it is not specified, will default to8
. - head_dim (
int
, optional, defaults tohidden_size // num_attention_heads
) — The attention head dimension. - hidden_act (
str
orfunction
, optional, defaults to"gelu"
) — The non-linear activation function (function or string) in the decoder. - max_position_embeddings (
int
, optional, defaults to 8000) — The maximum sequence length that this model might ever be used with. Mimi’s sliding window attention allows sequence of up to 8000 tokens. - initializer_range (
float
, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - norm_eps (
float
, optional, defaults to 1e-05) — The epsilon used by the LayerNorm normalization layers. - use_cache (
bool
, optional, defaults toFalse
) — Whether or not the model should return the last key/values attentions (not used by all models). Only relevant ifconfig.is_decoder=True
. - rope_theta (
float
, optional, defaults to 10000.0) — The base period of the RoPE embeddings. - sliding_window (
int
, optional, defaults to 250) — Sliding window attention window size. If not specified, will default to250
. - attention_dropout (
float
, optional, defaults to 0.0) — The dropout ratio for the attention probabilities. - layer_scale_initial_scale (
float
, optional, defaults to 0.01) — Initiale scale of the residual rescaling operation done in the Transformer models. - attention_bias (
bool
, defaults toFalse
, optional, defaults toFalse
) — Whether to use a bias in the query, key, value and output projection layers during self-attention.
This is the configuration class to store the configuration of an MimiModel. It is used to instantiate a Mimi model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the kyutai/mimi architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
>>> from transformers import MimiModel, MimiConfig
>>> # Initializing a "kyutai/mimi" style configuration
>>> configuration = MimiConfig()
>>> # Initializing a model (with random weights) from the "kyutai/mimi" style configuration
>>> model = MimiModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
MimiModel
class transformers.MimiModel
< source >( config: MimiConfig )
Parameters
- config (MimiConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The Mimi neural audio codec model. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
decode
< source >( audio_codes: Tensor padding_mask: typing.Optional[torch.Tensor] = None decoder_past_key_values: typing.Union[transformers.cache_utils.Cache, typing.List[torch.FloatTensor], NoneType] = None return_dict: typing.Optional[bool] = None )
Parameters
- audio_codes (
torch.LongTensor
of shape(batch_size, num_quantizers, codes_length)
, optional) — Discret code embeddings computed usingmodel.encode
. - padding_mask (
torch.Tensor
of shape(batch_size, channels, sequence_length)
) — Indicates which inputs are to be ignored due to padding, where elements are either 1 for not masked or 0 for masked. - decoder_past_key_values (
Cache
, optional) — Pre-computed hidden-states (key and values in the self-attention blocks) that can be used to speed up sequential decoding of the decoder transformer. This typically consists in thepast_key_values
returned by the model at a previous stage of decoding, whenuse_cache=True
orconfig.use_cache=True
.The model will output the same cache format that is fed as input.
If
past_key_values
are used, the user can optionally input only the lastaudio_values
or `audio_codes (those that don’t have their past key value states given to this model). - return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Decodes the given frames into an output audio waveform.
Note that the output might be a bit bigger than the input. In that case, any extra steps at the end can be trimmed.
encode
< source >( input_values: Tensor padding_mask: Tensor = None num_quantizers: typing.Optional[float] = None encoder_past_key_values: typing.Union[transformers.cache_utils.Cache, typing.List[torch.FloatTensor], NoneType] = None return_dict: typing.Optional[bool] = None )
Parameters
- input_values (
torch.Tensor
of shape(batch_size, channels, sequence_length)
) — Float values of the input audio waveform. - padding_mask (
torch.Tensor
of shape(batch_size, channels, sequence_length)
) — Indicates which inputs are to be ignored due to padding, where elements are either 1 for not masked or 0 for masked. - num_quantizers (
int
, optional) — Number of quantizers (i.e codebooks) to use. By default, all quantizers are used. - encoder_past_key_values (
Cache
, optional) — Pre-computed hidden-states (key and values in the self-attention blocks) that can be used to speed up sequential decoding of the encoder transformer. This typically consists in thepast_key_values
returned by the model at a previous stage of decoding, whenuse_cache=True
orconfig.use_cache=True
.The model will output the same cache format that is fed as input.
If
past_key_values
are used, the user can optionally input only the lastaudio_values
or `audio_codes (those that don’t have their past key value states given to this model). - return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Encodes the input audio waveform into discrete codes.
forward
< source >( input_values: Tensor padding_mask: typing.Optional[torch.Tensor] = None num_quantizers: typing.Optional[int] = None audio_codes: typing.Optional[torch.Tensor] = None encoder_past_key_values: typing.Union[transformers.cache_utils.Cache, typing.List[torch.FloatTensor], NoneType] = None decoder_past_key_values: typing.Union[transformers.cache_utils.Cache, typing.List[torch.FloatTensor], NoneType] = None return_dict: typing.Optional[bool] = None ) → transformers.models.mimi.modeling_mimi.MimiOutput
or tuple(torch.FloatTensor)
Parameters
- input_values (
torch.FloatTensor
of shape(batch_size, channels, sequence_length)
, optional) — Raw audio input converted to Float. - padding_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Indicates which inputs are to be ignored due to padding, where elements are either 1 for not masked or 0 for masked. - num_quantizers (
int
, optional) — Number of quantizers (i.e codebooks) to use. By default, all quantizers are used. - audio_codes (
torch.LongTensor
of shape(batch_size, num_quantizers, codes_length)
, optional) — Discret code embeddings computed usingmodel.encode
. - encoder_past_key_values (
Cache
, optional) — Pre-computed hidden-states (key and values in the self-attention blocks) that can be used to speed up sequential decoding of the encoder transformer. This typically consists in thepast_key_values
returned by the model at a previous stage of decoding, whenuse_cache=True
orconfig.use_cache=True
.The model will output the same cache format that is fed as input.
If
past_key_values
are used, the user can optionally input only the lastaudio_values
or `audio_codes (those that don’t have their past key value states given to this model). - decoder_past_key_values (
Cache
, optional) — Pre-computed hidden-states (key and values in the self-attention blocks) that can be used to speed up sequential decoding of the decoder transformer. This typically consists in thepast_key_values
returned by the model at a previous stage of decoding, whenuse_cache=True
orconfig.use_cache=True
.The model will output the same cache format that is fed as input.
If
past_key_values
are used, the user can optionally input only the lastaudio_values
or `audio_codes (those that don’t have their past key value states given to this model). - return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.models.mimi.modeling_mimi.MimiOutput
or tuple(torch.FloatTensor)
A transformers.models.mimi.modeling_mimi.MimiOutput
or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (MimiConfig) and inputs.
-
audio_codes (
torch.LongTensor
of shape(batch_size, num_quantizers, codes_length)
, optional) — Discret code embeddings computed usingmodel.encode
. -
audio_values (
torch.FloatTensor
of shape(batch_size, sequence_length)
, optional) Decoded audio values, obtained using the decoder part of Mimi. -
encoder_past_key_values (
Cache
, optional) — Pre-computed hidden-states (key and values in the self-attention blocks) that can be used to speed up sequential decoding of the encoder transformer. This typically consists in thepast_key_values
returned by the model at a previous stage of decoding, whenuse_cache=True
orconfig.use_cache=True
.The model will output the same cache format that is fed as input.
If
past_key_values
are used, the user can optionally input only the lastaudio_values
or `audio_codes (those that don’t have their past key value states given to this model). -
decoder_past_key_values (
Cache
, optional) — Pre-computed hidden-states (key and values in the self-attention blocks) that can be used to speed up sequential decoding of the decoder transformer. This typically consists in thepast_key_values
returned by the model at a previous stage of decoding, whenuse_cache=True
orconfig.use_cache=True
.The model will output the same cache format that is fed as input.
If
past_key_values
are used, the user can optionally input only the lastaudio_values
or `audio_codes (those that don’t have their past key value states given to this model).
The MimiModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Examples:
>>> from datasets import load_dataset
>>> from transformers import AutoFeatureExtractor, MimiModel
>>> dataset = load_dataset("hf-internal-testing/ashraq-esc50-1-dog-example")
>>> audio_sample = dataset["train"]["audio"][0]["array"]
>>> model_id = "kyutai/mimi"
>>> model = MimiModel.from_pretrained(model_id)
>>> feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
>>> inputs = feature_extractor(raw_audio=audio_sample, return_tensors="pt")
>>> outputs = model(**inputs)
>>> audio_codes = outputs.audio_codes
>>> audio_values = outputs.audio_values