Bark
Overview
Bark は、suno-ai/bark で Suno AI によって提案されたトランスフォーマーベースのテキスト読み上げモデルです。
Bark は 4 つの主要なモデルで構成されています。
- BarkSemanticModel (‘テキスト’モデルとも呼ばれる): トークン化されたテキストを入力として受け取り、テキストの意味を捉えるセマンティック テキスト トークンを予測する因果的自己回帰変換モデル。
- BarkCoarseModel (‘粗い音響’ モデルとも呼ばれる): BarkSemanticModel モデルの結果を入力として受け取る因果的自己回帰変換器。 EnCodec に必要な最初の 2 つのオーディオ コードブックを予測することを目的としています。
- BarkFineModel (‘微細音響’ モデル)、今回は非因果的オートエンコーダー トランスフォーマーで、以前のコードブック埋め込みの合計に基づいて最後のコードブックを繰り返し予測します。
EncodecModel
からすべてのコードブック チャネルを予測したので、Bark はそれを使用して出力オーディオ配列をデコードします。
最初の 3 つのモジュールはそれぞれ、特定の事前定義された音声に従って出力サウンドを調整するための条件付きスピーカー埋め込みをサポートできることに注意してください。
Optimizing Bark
Bark は、コードを数行追加するだけで最適化でき、メモリ フットプリントが大幅に削減され、推論が高速化されます。
Using half-precision
モデルを半精度でロードするだけで、推論を高速化し、メモリ使用量を 50% 削減できます。
from transformers import BarkModel
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model = BarkModel.from_pretrained("suno/bark-small", torch_dtype=torch.float16).to(device)
Using 🤗 Better Transformer
Better Transformer は、内部でカーネル融合を実行する 🤗 最適な機能です。パフォーマンスを低下させることなく、速度を 20% ~ 30% 向上させることができます。モデルを 🤗 Better Transformer にエクスポートするのに必要なコードは 1 行だけです。
model = model.to_bettertransformer()
この機能を使用する前に 🤗 Optimum をインストールする必要があることに注意してください。 インストール方法はこちら
Using CPU offload
前述したように、Bark は 4 つのサブモデルで構成されており、オーディオ生成中に順番に呼び出されます。言い換えれば、1 つのサブモデルが使用されている間、他のサブモデルはアイドル状態になります。
CUDA デバイスを使用している場合、メモリ フットプリントの 80% 削減による恩恵を受ける簡単な解決策は、アイドル状態の GPU のサブモデルをオフロードすることです。この操作は CPU オフロードと呼ばれます。 1行のコードで使用できます。
model.enable_cpu_offload()
この機能を使用する前に、🤗 Accelerate をインストールする必要があることに注意してください。 インストール方法はこちら
Combining optimization techniques
最適化手法を組み合わせて、CPU オフロード、半精度、🤗 Better Transformer をすべて一度に使用できます。
from transformers import BarkModel
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
# load in fp16
model = BarkModel.from_pretrained("suno/bark-small", torch_dtype=torch.float16).to(device)
# convert to bettertransformer
model = BetterTransformer.transform(model, keep_original_model=False)
# enable CPU offload
model.enable_cpu_offload()
推論最適化手法の詳細については、こちら をご覧ください。
Tips
Suno は、多くの言語で音声プリセットのライブラリを提供しています こちら。 これらのプリセットは、ハブ こちら または こちら。
>>> from transformers import AutoProcessor, BarkModel
>>> processor = AutoProcessor.from_pretrained("suno/bark")
>>> model = BarkModel.from_pretrained("suno/bark")
>>> voice_preset = "v2/en_speaker_6"
>>> inputs = processor("Hello, my dog is cute", voice_preset=voice_preset)
>>> audio_array = model.generate(**inputs)
>>> audio_array = audio_array.cpu().numpy().squeeze()
Bark は、非常にリアルな 多言語 音声だけでなく、音楽、背景ノイズ、単純な効果音などの他の音声も生成できます。
>>> # Multilingual speech - simplified Chinese
>>> inputs = processor("惊人的!我会说中文")
>>> # Multilingual speech - French - let's use a voice_preset as well
>>> inputs = processor("Incroyable! Je peux générer du son.", voice_preset="fr_speaker_5")
>>> # Bark can also generate music. You can help it out by adding music notes around your lyrics.
>>> inputs = processor("♪ Hello, my dog is cute ♪")
>>> audio_array = model.generate(**inputs)
>>> audio_array = audio_array.cpu().numpy().squeeze()
このモデルは、笑う、ため息、泣くなどの非言語コミュニケーションを生成することもできます。
>>> # Adding non-speech cues to the input text
>>> inputs = processor("Hello uh ... [clears throat], my dog is cute [laughter]")
>>> audio_array = model.generate(**inputs)
>>> audio_array = audio_array.cpu().numpy().squeeze()
オーディオを保存するには、モデル設定と scipy ユーティリティからサンプル レートを取得するだけです。
>>> from scipy.io.wavfile import write as write_wav
>>> # save audio to disk, but first take the sample rate from the model config
>>> sample_rate = model.generation_config.sample_rate
>>> write_wav("bark_generation.wav", sample_rate, audio_array)
このモデルは、Yoach Lacombe (ylacombe) および Sanchit Gandhi (sanchit-gandhi) によって提供されました。 元のコードは ここ にあります。
BarkConfig
class transformers.BarkConfig
< source >( semantic_config: typing.Dict = None coarse_acoustics_config: typing.Dict = None fine_acoustics_config: typing.Dict = None codec_config: typing.Dict = None initializer_range = 0.02 **kwargs )
Parameters
- semantic_config (BarkSemanticConfig, optional) — Configuration of the underlying semantic sub-model.
- coarse_acoustics_config (BarkCoarseConfig, optional) — Configuration of the underlying coarse acoustics sub-model.
- fine_acoustics_config (BarkFineConfig, optional) — Configuration of the underlying fine acoustics sub-model.
- codec_config (AutoConfig, optional) — Configuration of the underlying codec sub-model.
- Example —
- ```python —
from transformers import (
- … BarkSemanticConfig, —
- … BarkCoarseConfig, —
- … BarkFineConfig, —
- … BarkModel, —
- … BarkConfig, —
- … AutoConfig, —
- … ) —
This is the configuration class to store the configuration of a BarkModel. It is used to instantiate a Bark model according to the specified sub-models configurations, defining the model architecture.
Instantiating a configuration with the defaults will yield a similar configuration to that of the Bark suno/bark architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
from_sub_model_configs
< source >( semantic_config: BarkSemanticConfig coarse_acoustics_config: BarkCoarseConfig fine_acoustics_config: BarkFineConfig codec_config: PretrainedConfig **kwargs ) → BarkConfig
Instantiate a BarkConfig (or a derived class) from bark sub-models configuration.
BarkProcessor
class transformers.BarkProcessor
< source >( tokenizer speaker_embeddings = None )
Parameters
- tokenizer (PreTrainedTokenizer) — An instance of PreTrainedTokenizer.
- speaker_embeddings (
Dict[Dict[str]]
, optional) — Optional nested speaker embeddings dictionary. The first level contains voice preset names (e.g"en_speaker_4"
). The second level contains"semantic_prompt"
,"coarse_prompt"
and"fine_prompt"
embeddings. The values correspond to the path of the correspondingnp.ndarray
. See here for a list ofvoice_preset_names
.
Constructs a Bark processor which wraps a text tokenizer and optional Bark voice presets into a single processor.
__call__
< source >( text = None voice_preset = None return_tensors = 'pt' max_length = 256 add_special_tokens = False return_attention_mask = True return_token_type_ids = False **kwargs ) → Tuple(BatchEncoding, BatchFeature)
Parameters
- text (
str
,List[str]
,List[List[str]]
) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must setis_split_into_words=True
(to lift the ambiguity with a batch of sequences). - voice_preset (
str
,Dict[np.ndarray]
) — The voice preset, i.e the speaker embeddings. It can either be a valid voice_preset name, e.g"en_speaker_1"
, or directly a dictionnary ofnp.ndarray
embeddings for each submodel ofBark
. Or it can be a valid file name of a local.npz
single voice preset. - return_tensors (
str
or TensorType, optional) — If set, will return tensors of a particular framework. Acceptable values are:'pt'
: Return PyTorchtorch.Tensor
objects.'np'
: Return NumPynp.ndarray
objects.
Returns
Tuple(BatchEncoding, BatchFeature)
A tuple composed of a BatchEncoding, i.e the output of the
tokenizer
and a BatchFeature, i.e the voice preset with the right tensors type.
Main method to prepare for the model one or several sequences(s). This method forwards the text
and kwargs
arguments to the AutoTokenizer’s __call__()
to encode the text. The method also proposes a
voice preset which is a dictionary of arrays that conditions Bark
’s output. kwargs
arguments are forwarded
to the tokenizer and to cached_file
method if voice_preset
is a valid filename.
from_pretrained
< source >( pretrained_processor_name_or_path speaker_embeddings_dict_path = 'speaker_embeddings_path.json' **kwargs )
Parameters
- pretrained_model_name_or_path (
str
oros.PathLike
) — This can be either:- a string, the model id of a pretrained BarkProcessor hosted inside a model repo on huggingface.co.
- a path to a directory containing a processor saved using the save_pretrained()
method, e.g.,
./my_model_directory/
.
- speaker_embeddings_dict_path (
str
, optional, defaults to"speaker_embeddings_path.json"
) — The name of the.json
file containing the speaker_embeddings dictionnary located inpretrained_model_name_or_path
. IfNone
, no speaker_embeddings is loaded. - **kwargs —
Additional keyword arguments passed along to both
~tokenization_utils_base.PreTrainedTokenizer.from_pretrained
.
Instantiate a Bark processor associated with a pretrained model.
save_pretrained
< source >( save_directory speaker_embeddings_dict_path = 'speaker_embeddings_path.json' speaker_embeddings_directory = 'speaker_embeddings' push_to_hub: bool = False **kwargs )
Parameters
- save_directory (
str
oros.PathLike
) — Directory where the tokenizer files and the speaker embeddings will be saved (directory will be created if it does not exist). - speaker_embeddings_dict_path (
str
, optional, defaults to"speaker_embeddings_path.json"
) — The name of the.json
file that will contains the speaker_embeddings nested path dictionnary, if it exists, and that will be located inpretrained_model_name_or_path/speaker_embeddings_directory
. - speaker_embeddings_directory (
str
, optional, defaults to"speaker_embeddings/"
) — The name of the folder in which the speaker_embeddings arrays will be saved. - push_to_hub (
bool
, optional, defaults toFalse
) — Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the repository you want to push to withrepo_id
(will default to the name ofsave_directory
in your namespace). - kwargs — Additional key word arguments passed along to the push_to_hub() method.
Saves the attributes of this processor (tokenizer…) in the specified directory so that it can be reloaded using the from_pretrained() method.
BarkModel
class transformers.BarkModel
< source >( config )
Parameters
- config (BarkConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The full Bark model, a text-to-speech model composed of 4 sub-models:
- BarkSemanticModel (also referred to as the ‘text’ model): a causal auto-regressive transformer model that takes as input tokenized text, and predicts semantic text tokens that capture the meaning of the text.
- BarkCoarseModel (also refered to as the ‘coarse acoustics’ model), also a causal autoregressive transformer,
that takes into input the results of the last model. It aims at regressing the first two audio codebooks necessary
to
encodec
. - BarkFineModel (the ‘fine acoustics’ model), this time a non-causal autoencoder transformer, which iteratively predicts the last codebooks based on the sum of the previous codebooks embeddings.
- having predicted all the codebook channels from the
EncodecModel
, Bark uses it to decode the output audio array.
It should be noted that each of the first three modules can support conditional speaker embeddings to condition the output sound according to specific predefined voice.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
generate
< source >( input_ids: typing.Optional[torch.Tensor] = None history_prompt: typing.Optional[typing.Dict[str, torch.Tensor]] = None return_output_lengths: typing.Optional[bool] = None **kwargs ) → By default
Parameters
- input_ids (
Optional[torch.Tensor]
of shape (batch_size, seq_len), optional) — Input ids. Will be truncated up to 256 tokens. Note that the output audios will be as long as the longest generation among the batch. - history_prompt (
Optional[Dict[str,torch.Tensor]]
, optional) — OptionalBark
speaker prompt. Note that for now, this model takes only one speaker prompt per batch. - kwargs (optional) — Remaining dictionary of keyword arguments. Keyword arguments are of two types:
- Without a prefix, they will be entered as
**kwargs
for thegenerate
method of each sub-model. - With a semantic_, coarse_, fine_ prefix, they will be input for the
generate
method of the semantic, coarse and fine respectively. It has the priority over the keywords without a prefix.
This means you can, for example, specify a generation strategy for all sub-models except one.
- Without a prefix, they will be entered as
- return_output_lengths (
bool
, optional) — Whether or not to return the waveform lengths. Useful when batching.
Returns
By default
- audio_waveform (
torch.Tensor
of shape (batch_size, seq_len)): Generated audio waveform. Whenreturn_output_lengths=True
: Returns a tuple made of: - audio_waveform (
torch.Tensor
of shape (batch_size, seq_len)): Generated audio waveform. - output_lengths (
torch.Tensor
of shape (batch_size)): The length of each waveform in the batch
Generates audio from an input prompt and an additional optional Bark
speaker prompt.
Example:
>>> from transformers import AutoProcessor, BarkModel
>>> processor = AutoProcessor.from_pretrained("suno/bark-small")
>>> model = BarkModel.from_pretrained("suno/bark-small")
>>> # To add a voice preset, you can pass `voice_preset` to `BarkProcessor.__call__(...)`
>>> voice_preset = "v2/en_speaker_6"
>>> inputs = processor("Hello, my dog is cute, I need him in my life", voice_preset=voice_preset)
>>> audio_array = model.generate(**inputs, semantic_max_new_tokens=100)
>>> audio_array = audio_array.cpu().numpy().squeeze()
enable_cpu_offload
< source >( gpu_id: typing.Optional[int] = 0 )
Offloads all sub-models to CPU using accelerate, reducing memory usage with a low impact on performance. This method moves one whole sub-model at a time to the GPU when it is used, and the sub-model remains in GPU until the next sub-model runs.
BarkSemanticModel
class transformers.BarkSemanticModel
< source >( config )
Parameters
- config (BarkSemanticConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
Bark semantic (or text) model. It shares the same architecture as the coarse model. It is a GPT-2 like autoregressive model with a language modeling head on top. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: typing.Optional[torch.Tensor] = None past_key_values: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.Tensor] = None head_mask: typing.Optional[torch.Tensor] = None labels: typing.Optional[torch.LongTensor] = None input_embeds: typing.Optional[torch.Tensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None )
Parameters
- input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it. Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. What are input IDs? - past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache
is passed or whenconfig.use_cache=True
) — Tuple oftuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
.Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.If
past_key_values
are used, the user can optionally input only the lastdecoder_input_ids
(those that don’t have their past key value states given to this model) of shape(batch_size, 1)
instead of allinput_ids
of shape(batch_size, sequence_length)
. - attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- position_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.max_position_embeddings - 1]
. - head_mask (
torch.Tensor
of shape(encoder_layers, encoder_attention_heads)
, optional) — Mask to nullify selected heads of the attention modules in the encoder. Mask values selected in[0, 1]
:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
- input_embeds (
torch.FloatTensor
of shape(batch_size, input_sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. Here, due toBark
particularities, ifpast_key_values
is used,input_embeds
will be ignored and you have to useinput_ids
. Ifpast_key_values
is not used anduse_cache
is set toTrue
,input_embeds
is used in priority instead ofinput_ids
. - use_cache (
bool
, optional) — If set toTrue
,past_key_values
key value states are returned and can be used to speed up decoding (seepast_key_values
). - output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail. - output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. - return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
The BarkCausalModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
BarkCoarseModel
class transformers.BarkCoarseModel
< source >( config )
Parameters
- config (BarkCoarseConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
Bark coarse acoustics model. It shares the same architecture as the semantic (or text) model. It is a GPT-2 like autoregressive model with a language modeling head on top. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: typing.Optional[torch.Tensor] = None past_key_values: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.Tensor] = None head_mask: typing.Optional[torch.Tensor] = None labels: typing.Optional[torch.LongTensor] = None input_embeds: typing.Optional[torch.Tensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None )
Parameters
- input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it. Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. What are input IDs? - past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache
is passed or whenconfig.use_cache=True
) — Tuple oftuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
.Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.If
past_key_values
are used, the user can optionally input only the lastdecoder_input_ids
(those that don’t have their past key value states given to this model) of shape(batch_size, 1)
instead of allinput_ids
of shape(batch_size, sequence_length)
. - attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- position_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.max_position_embeddings - 1]
. - head_mask (
torch.Tensor
of shape(encoder_layers, encoder_attention_heads)
, optional) — Mask to nullify selected heads of the attention modules in the encoder. Mask values selected in[0, 1]
:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
- input_embeds (
torch.FloatTensor
of shape(batch_size, input_sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. Here, due toBark
particularities, ifpast_key_values
is used,input_embeds
will be ignored and you have to useinput_ids
. Ifpast_key_values
is not used anduse_cache
is set toTrue
,input_embeds
is used in priority instead ofinput_ids
. - use_cache (
bool
, optional) — If set toTrue
,past_key_values
key value states are returned and can be used to speed up decoding (seepast_key_values
). - output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail. - output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. - return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
The BarkCausalModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
BarkFineModel
class transformers.BarkFineModel
< source >( config )
Parameters
- config (BarkFineConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
Bark fine acoustics model. It is a non-causal GPT-like model with config.n_codes_total
embedding layers and
language modeling heads, one for each codebook.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( codebook_idx: int input_ids: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.Tensor] = None head_mask: typing.Optional[torch.Tensor] = None labels: typing.Optional[torch.LongTensor] = None input_embeds: typing.Optional[torch.Tensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None )
Parameters
- codebook_idx (
int
) — Index of the codebook that will be predicted. - input_ids (
torch.LongTensor
of shape(batch_size, sequence_length, number_of_codebooks)
) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it. Initially, indices of the first two codebooks are obtained from thecoarse
sub-model. The rest is predicted recursively by attending the previously predicted channels. The model predicts on windows of length 1024. - attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- position_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.max_position_embeddings - 1]
. - head_mask (
torch.Tensor
of shape(encoder_layers, encoder_attention_heads)
, optional) — Mask to nullify selected heads of the attention modules in the encoder. Mask values selected in[0, 1]
:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
- labels (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — NOT IMPLEMENTED YET. - input_embeds (
torch.FloatTensor
of shape(batch_size, input_sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. Ifpast_key_values
is used, optionally only the lastinput_embeds
have to be input (seepast_key_values
). This is useful if you want more control over how to convertinput_ids
indices into associated vectors than the model’s internal embedding lookup matrix. - output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail. - output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. - return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
The BarkFineModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
BarkCausalModel
forward
< source >( input_ids: typing.Optional[torch.Tensor] = None past_key_values: typing.Optional[typing.Tuple[torch.FloatTensor]] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.Tensor] = None head_mask: typing.Optional[torch.Tensor] = None labels: typing.Optional[torch.LongTensor] = None input_embeds: typing.Optional[torch.Tensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None )
Parameters
- input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it. Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. What are input IDs? - past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache
is passed or whenconfig.use_cache=True
) — Tuple oftuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
.Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.If
past_key_values
are used, the user can optionally input only the lastdecoder_input_ids
(those that don’t have their past key value states given to this model) of shape(batch_size, 1)
instead of allinput_ids
of shape(batch_size, sequence_length)
. - attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- position_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.max_position_embeddings - 1]
. - head_mask (
torch.Tensor
of shape(encoder_layers, encoder_attention_heads)
, optional) — Mask to nullify selected heads of the attention modules in the encoder. Mask values selected in[0, 1]
:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
- input_embeds (
torch.FloatTensor
of shape(batch_size, input_sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. Here, due toBark
particularities, ifpast_key_values
is used,input_embeds
will be ignored and you have to useinput_ids
. Ifpast_key_values
is not used anduse_cache
is set toTrue
,input_embeds
is used in priority instead ofinput_ids
. - use_cache (
bool
, optional) — If set toTrue
,past_key_values
key value states are returned and can be used to speed up decoding (seepast_key_values
). - output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail. - output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. - return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
The BarkCausalModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
BarkCoarseConfig
class transformers.BarkCoarseConfig
< source >( block_size = 1024 input_vocab_size = 10048 output_vocab_size = 10048 num_layers = 12 num_heads = 12 hidden_size = 768 dropout = 0.0 bias = True initializer_range = 0.02 use_cache = True **kwargs )
Parameters
- block_size (
int
, optional, defaults to 1024) — The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048). - input_vocab_size (
int
, optional, defaults to 10_048) — Vocabulary size of a Bark sub-model. Defines the number of different tokens that can be represented by theinputs_ids
passed when calling BarkCoarseModel. Defaults to 10_048 but should be carefully thought with regards to the chosen sub-model. - output_vocab_size (
int
, optional, defaults to 10_048) — Output vocabulary size of a Bark sub-model. Defines the number of different tokens that can be represented by the:output_ids
when passing forward a BarkCoarseModel. Defaults to 10_048 but should be carefully thought with regards to the chosen sub-model. - num_layers (
int
, optional, defaults to 12) — Number of hidden layers in the given sub-model. - num_heads (
int
, optional, defaults to 12) — Number of attention heads for each attention layer in the Transformer architecture. - hidden_size (
int
, optional, defaults to 768) — Dimensionality of the “intermediate” (often named feed-forward) layer in the architecture. - dropout (
float
, optional, defaults to 0.0) — The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. - bias (
bool
, optional, defaults toTrue
) — Whether or not to use bias in the linear layers and layer norm layers. - initializer_range (
float
, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - use_cache (
bool
, optional, defaults toTrue
) — Whether or not the model should return the last key/values attentions (not used by all models).
This is the configuration class to store the configuration of a BarkCoarseModel. It is used to instantiate the model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Bark suno/bark architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
>>> from transformers import BarkCoarseConfig, BarkCoarseModel
>>> # Initializing a Bark sub-module style configuration
>>> configuration = BarkCoarseConfig()
>>> # Initializing a model (with random weights) from the suno/bark style configuration
>>> model = BarkCoarseModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
BarkFineConfig
class transformers.BarkFineConfig
< source >( tie_word_embeddings = True n_codes_total = 8 n_codes_given = 1 **kwargs )
Parameters
- block_size (
int
, optional, defaults to 1024) — The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048). - input_vocab_size (
int
, optional, defaults to 10_048) — Vocabulary size of a Bark sub-model. Defines the number of different tokens that can be represented by theinputs_ids
passed when calling BarkFineModel. Defaults to 10_048 but should be carefully thought with regards to the chosen sub-model. - output_vocab_size (
int
, optional, defaults to 10_048) — Output vocabulary size of a Bark sub-model. Defines the number of different tokens that can be represented by the:output_ids
when passing forward a BarkFineModel. Defaults to 10_048 but should be carefully thought with regards to the chosen sub-model. - num_layers (
int
, optional, defaults to 12) — Number of hidden layers in the given sub-model. - num_heads (
int
, optional, defaults to 12) — Number of attention heads for each attention layer in the Transformer architecture. - hidden_size (
int
, optional, defaults to 768) — Dimensionality of the “intermediate” (often named feed-forward) layer in the architecture. - dropout (
float
, optional, defaults to 0.0) — The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. - bias (
bool
, optional, defaults toTrue
) — Whether or not to use bias in the linear layers and layer norm layers. - initializer_range (
float
, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - use_cache (
bool
, optional, defaults toTrue
) — Whether or not the model should return the last key/values attentions (not used by all models). - n_codes_total (
int
, optional, defaults to 8) — The total number of audio codebooks predicted. Used in the fine acoustics sub-model. - n_codes_given (
int
, optional, defaults to 1) — The number of audio codebooks predicted in the coarse acoustics sub-model. Used in the acoustics sub-models.
This is the configuration class to store the configuration of a BarkFineModel. It is used to instantiate the model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Bark suno/bark architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
>>> from transformers import BarkFineConfig, BarkFineModel
>>> # Initializing a Bark sub-module style configuration
>>> configuration = BarkFineConfig()
>>> # Initializing a model (with random weights) from the suno/bark style configuration
>>> model = BarkFineModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
BarkSemanticConfig
class transformers.BarkSemanticConfig
< source >( block_size = 1024 input_vocab_size = 10048 output_vocab_size = 10048 num_layers = 12 num_heads = 12 hidden_size = 768 dropout = 0.0 bias = True initializer_range = 0.02 use_cache = True **kwargs )
Parameters
- block_size (
int
, optional, defaults to 1024) — The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048). - input_vocab_size (
int
, optional, defaults to 10_048) — Vocabulary size of a Bark sub-model. Defines the number of different tokens that can be represented by theinputs_ids
passed when calling BarkSemanticModel. Defaults to 10_048 but should be carefully thought with regards to the chosen sub-model. - output_vocab_size (
int
, optional, defaults to 10_048) — Output vocabulary size of a Bark sub-model. Defines the number of different tokens that can be represented by the:output_ids
when passing forward a BarkSemanticModel. Defaults to 10_048 but should be carefully thought with regards to the chosen sub-model. - num_layers (
int
, optional, defaults to 12) — Number of hidden layers in the given sub-model. - num_heads (
int
, optional, defaults to 12) — Number of attention heads for each attention layer in the Transformer architecture. - hidden_size (
int
, optional, defaults to 768) — Dimensionality of the “intermediate” (often named feed-forward) layer in the architecture. - dropout (
float
, optional, defaults to 0.0) — The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. - bias (
bool
, optional, defaults toTrue
) — Whether or not to use bias in the linear layers and layer norm layers. - initializer_range (
float
, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - use_cache (
bool
, optional, defaults toTrue
) — Whether or not the model should return the last key/values attentions (not used by all models).
This is the configuration class to store the configuration of a BarkSemanticModel. It is used to instantiate the model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Bark suno/bark architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
>>> from transformers import BarkSemanticConfig, BarkSemanticModel
>>> # Initializing a Bark sub-module style configuration
>>> configuration = BarkSemanticConfig()
>>> # Initializing a model (with random weights) from the suno/bark style configuration
>>> model = BarkSemanticModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config