Bark is a transformer-based text-to-speech model proposed by Suno AI in suno-ai/bark.
Bark is made of 4 main models:
It should be noted that each of the first three modules can support conditional speaker embeddings to condition the output sound according to specific predefined voice.
This model was contributed by Yoach Lacombe (ylacombe) and Sanchit Gandhi (sanchit-gandhi). The original code can be found here.
Optimizing BarkBark can be optimized with just a few extra lines of code, which significantly reduces its memory footprint and accelerates inference.
Using half-precisionYou can speed up inference and reduce memory footprint by 50% simply by loading the model in half-precision.
from transformers import BarkModel import torch device = "cuda" if torch.cuda.is_available() else "cpu" model = BarkModel.from_pretrained("suno/bark-small", torch_dtype=torch.float16).to(device)Using CPU offload
As mentioned above, Bark is made up of 4 sub-models, which are called up sequentially during audio generation. In other words, while one sub-model is in use, the other sub-models are idle.
If you’re using a CUDA device, a simple solution to benefit from an 80% reduction in memory footprint is to offload the submodels from GPU to CPU when they’re idle. This operation is called CPU offloading. You can use it with one line of code as follows:
model.enable_cpu_offload()
Note that 🤗 Accelerate must be installed before using this feature. Here’s how to install it.
Using Better TransformerBetter Transformer is an 🤗 Optimum feature that performs kernel fusion under the hood. You can gain 20% to 30% in speed with zero performance degradation. It only requires one line of code to export the model to 🤗 Better Transformer:
model = model.to_bettertransformer()
Note that 🤗 Optimum must be installed before using this feature. Here’s how to install it.
Using Flash Attention 2Flash Attention 2 is an even faster, optimized version of the previous optimization.
InstallationFirst, check whether your hardware is compatible with Flash Attention 2. The latest list of compatible hardware can be found in the official documentation. If your hardware is not compatible with Flash Attention 2, you can still benefit from attention kernel optimisations through Better Transformer support covered above.
Next, install the latest version of Flash Attention 2:
pip install -U flash-attn --no-build-isolationUsage
To load a model using Flash Attention 2, we can pass the attn_implementation="flash_attention_2"
flag to .from_pretrained
. We’ll also load the model in half-precision (e.g. torch.float16
), since it results in almost no degradation to audio quality but significantly lower memory usage and faster inference:
model = BarkModel.from_pretrained("suno/bark-small", torch_dtype=torch.float16, attn_implementation="flash_attention_2").to(device)Performance comparison
The following diagram shows the latency for the native attention implementation (no optimisation) against Better Transformer and Flash Attention 2. In all cases, we generate 400 semantic tokens on a 40GB A100 GPU with PyTorch 2.1. Flash Attention 2 is also consistently faster than Better Transformer, and its performance improves even more as batch sizes increase:
To put this into perspective, on an NVIDIA A100 and when generating 400 semantic tokens with a batch size of 16, you can get 17 times the throughput and still be 2 seconds faster than generating sentences one by one with the native model implementation. In other words, all the samples will be generated 17 times faster.
At batch size 8, on an NVIDIA A100, Flash Attention 2 is also 10% faster than Better Transformer, and at batch size 16, 25%.
Combining optimization techniquesYou can combine optimization techniques, and use CPU offload, half-precision and Flash Attention 2 (or 🤗 Better Transformer) all at once.
from transformers import BarkModel import torch device = "cuda" if torch.cuda.is_available() else "cpu" model = BarkModel.from_pretrained("suno/bark-small", torch_dtype=torch.float16, attn_implementation="flash_attention_2").to(device) model.enable_cpu_offload()
Find out more on inference optimization techniques here.
Usage tipsSuno offers a library of voice presets in a number of languages here. These presets are also uploaded in the hub here or here.
>>> from transformers import AutoProcessor, BarkModel >>> processor = AutoProcessor.from_pretrained("suno/bark") >>> model = BarkModel.from_pretrained("suno/bark") >>> voice_preset = "v2/en_speaker_6" >>> inputs = processor("Hello, my dog is cute", voice_preset=voice_preset) >>> audio_array = model.generate(**inputs) >>> audio_array = audio_array.cpu().numpy().squeeze()
Bark can generate highly realistic, multilingual speech as well as other audio - including music, background noise and simple sound effects.
>>> >>> inputs = processor("惊人的!我会说中文") >>> >>> inputs = processor("Incroyable! Je peux générer du son.", voice_preset="fr_speaker_5") >>> >>> inputs = processor("♪ Hello, my dog is cute ♪") >>> audio_array = model.generate(**inputs) >>> audio_array = audio_array.cpu().numpy().squeeze()
The model can also produce nonverbal communications like laughing, sighing and crying.
>>> >>> inputs = processor("Hello uh ... [clears throat], my dog is cute [laughter]") >>> audio_array = model.generate(**inputs) >>> audio_array = audio_array.cpu().numpy().squeeze()
To save the audio, simply take the sample rate from the model config and some scipy utility:
>>> from scipy.io.wavfile import write as write_wav >>> >>> sample_rate = model.generation_config.sample_rate >>> write_wav("bark_generation.wav", sample_rate, audio_array)BarkConfig class transformers.BarkConfig < source >
( semantic_config: typing.Optional[dict] = None coarse_acoustics_config: typing.Optional[dict] = None fine_acoustics_config: typing.Optional[dict] = None codec_config: typing.Optional[dict] = None initializer_range = 0.02 **kwargs )
Parameters
from transformers import (
This is the configuration class to store the configuration of a BarkModel. It is used to instantiate a Bark model according to the specified sub-models configurations, defining the model architecture.
Instantiating a configuration with the defaults will yield a similar configuration to that of the Bark suno/bark architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
from_sub_model_configs < source >( semantic_config: BarkSemanticConfig coarse_acoustics_config: BarkCoarseConfig fine_acoustics_config: BarkFineConfig codec_config: PretrainedConfig **kwargs ) → BarkConfig
An instance of a configuration object
Instantiate a BarkConfig (or a derived class) from bark sub-models configuration.
BarkProcessor class transformers.BarkProcessor < source >( tokenizer speaker_embeddings = None )
Parameters
dict[dict[str]]
, optional) — Optional nested speaker embeddings dictionary. The first level contains voice preset names (e.g "en_speaker_4"
). The second level contains "semantic_prompt"
, "coarse_prompt"
and "fine_prompt"
embeddings. The values correspond to the path of the corresponding np.ndarray
. See here for a list of voice_preset_names
. Constructs a Bark processor which wraps a text tokenizer and optional Bark voice presets into a single processor.
__call__ < source >( text = None voice_preset = None return_tensors = 'pt' max_length = 256 add_special_tokens = False return_attention_mask = True return_token_type_ids = False **kwargs ) → Tuple(BatchEncoding, BatchFeature)
Parameters
str
, list[str]
, list[list[str]]
) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True
(to lift the ambiguity with a batch of sequences). str
, dict[np.ndarray]
) — The voice preset, i.e the speaker embeddings. It can either be a valid voice_preset name, e.g "en_speaker_1"
, or directly a dictionary of np.ndarray
embeddings for each submodel of Bark
. Or it can be a valid file name of a local .npz
single voice preset. str
or TensorType, optional) — If set, will return tensors of a particular framework. Acceptable values are:
'pt'
: Return PyTorch torch.Tensor
objects.'np'
: Return NumPy np.ndarray
objects.A tuple composed of a BatchEncoding, i.e the output of the tokenizer
and a BatchFeature, i.e the voice preset with the right tensors type.
Main method to prepare for the model one or several sequences(s). This method forwards the text
and kwargs
arguments to the AutoTokenizer’s __call__()
to encode the text. The method also proposes a voice preset which is a dictionary of arrays that conditions Bark
’s output. kwargs
arguments are forwarded to the tokenizer and to cached_file
method if voice_preset
is a valid filename.
( pretrained_processor_name_or_path speaker_embeddings_dict_path = 'speaker_embeddings_path.json' **kwargs )
Parameters
str
or os.PathLike
) — This can be either:
./my_model_directory/
.str
, optional, defaults to "speaker_embeddings_path.json"
) — The name of the .json
file containing the speaker_embeddings dictionary located in pretrained_model_name_or_path
. If None
, no speaker_embeddings is loaded. ~tokenization_utils_base.PreTrainedTokenizer.from_pretrained
. Instantiate a Bark processor associated with a pretrained model.
save_pretrained < source >( save_directory speaker_embeddings_dict_path = 'speaker_embeddings_path.json' speaker_embeddings_directory = 'speaker_embeddings' push_to_hub: bool = False **kwargs )
Parameters
str
or os.PathLike
) — Directory where the tokenizer files and the speaker embeddings will be saved (directory will be created if it does not exist). str
, optional, defaults to "speaker_embeddings_path.json"
) — The name of the .json
file that will contains the speaker_embeddings nested path dictionary, if it exists, and that will be located in pretrained_model_name_or_path/speaker_embeddings_directory
. str
, optional, defaults to "speaker_embeddings/"
) — The name of the folder in which the speaker_embeddings arrays will be saved. bool
, optional, defaults to False
) — Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the repository you want to push to with repo_id
(will default to the name of save_directory
in your namespace). Saves the attributes of this processor (tokenizer…) in the specified directory so that it can be reloaded using the from_pretrained() method.
BarkModel class transformers.BarkModel < source >( config )
Parameters
The full Bark model, a text-to-speech model composed of 4 sub-models:
encodec
.It should be noted that each of the first three modules can support conditional speaker embeddings to condition the output sound according to specific predefined voice.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
generate < source >( input_ids: typing.Optional[torch.Tensor] = None history_prompt: typing.Optional[dict[str, torch.Tensor]] = None return_output_lengths: typing.Optional[bool] = None **kwargs ) → By default
Parameters
Optional[torch.Tensor]
of shape (batch_size, seq_len), optional) — Input ids. Will be truncated up to 256 tokens. Note that the output audios will be as long as the longest generation among the batch. Optional[dict[str,torch.Tensor]]
, optional) — Optional Bark
speaker prompt. Note that for now, this model takes only one speaker prompt per batch. **kwargs
for the generate
method of each sub-model.generate
method of the semantic, coarse and fine respectively. It has the priority over the keywords without a prefix.This means you can, for example, specify a generation strategy for all sub-models except one.
bool
, optional) — Whether or not to return the waveform lengths. Useful when batching. torch.Tensor
of shape (batch_size, seq_len)): Generated audio waveform. When return_output_lengths=True
: Returns a tuple made of:torch.Tensor
of shape (batch_size, seq_len)): Generated audio waveform.torch.Tensor
of shape (batch_size)): The length of each waveform in the batchGenerates audio from an input prompt and an additional optional Bark
speaker prompt.
Example:
>>> from transformers import AutoProcessor, BarkModel >>> processor = AutoProcessor.from_pretrained("suno/bark-small") >>> model = BarkModel.from_pretrained("suno/bark-small") >>> >>> voice_preset = "v2/en_speaker_6" >>> inputs = processor("Hello, my dog is cute, I need him in my life", voice_preset=voice_preset) >>> audio_array = model.generate(**inputs, semantic_max_new_tokens=100) >>> audio_array = audio_array.cpu().numpy().squeeze()enable_cpu_offload < source >
( accelerator_id: typing.Optional[int] = 0 **kwargs )
Parameters
int
, optional, defaults to 0) — accelerator id on which the sub-models will be loaded and offloaded. This argument is deprecated. dict
, optional) — additional keyword arguments: gpu_id
: accelerator id on which the sub-models will be loaded and offloaded. Offloads all sub-models to CPU using accelerate, reducing memory usage with a low impact on performance. This method moves one whole sub-model at a time to the accelerator when it is used, and the sub-model remains in accelerator until the next sub-model runs.
BarkSemanticModel class transformers.BarkSemanticModel < source >( config )
Parameters
Bark semantic (or text) model. It shares the same architecture as the coarse model. It is a GPT-2 like autoregressive model with a language modeling head on top.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward < source >( input_ids: typing.Optional[torch.Tensor] = None past_key_values: typing.Optional[tuple[torch.FloatTensor]] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.Tensor] = None head_mask: typing.Optional[torch.Tensor] = None labels: typing.Optional[torch.LongTensor] = None input_embeds: typing.Optional[torch.Tensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)
Parameters
torch.Tensor
of shape (batch_size, sequence_length)
, optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
tuple[torch.FloatTensor]
, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in the past_key_values
returned by the model at a previous stage of decoding, when use_cache=True
or config.use_cache=True
.
Two formats are allowed:
tuple(torch.FloatTensor)
of length config.n_layers
, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)
). This is also known as the legacy cache format.The model will output the same cache format that is fed as input. If no past_key_values
are passed, the legacy cache format will be returned.
If past_key_values
are used, the user can optionally input only the last input_ids
(those that don’t have their past key value states given to this model) of shape (batch_size, 1)
instead of all input_ids
of shape (batch_size, sequence_length)
.
torch.Tensor
of shape (batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
torch.Tensor
of shape (batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]
.
torch.Tensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
torch.LongTensor
of shape (batch_size, sequence_length)
, optional) — Labels for computing the masked language modeling loss. Indices should either be in [0, ..., config.vocab_size]
or -100 (see input_ids
docstring). Tokens with indices set to -100
are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size]
. torch.FloatTensor
of shape (batch_size, input_sequence_length, hidden_size)
, optional) — Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. Here, due to Bark
particularities, if past_key_values
is used, input_embeds
will be ignored and you have to use input_ids
. If past_key_values
is not used and use_cache
is set to True
, input_embeds
is used in priority instead of input_ids
. bool
, optional) — If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see past_key_values
). bool
, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions
under returned tensors for more detail. bool
, optional) — Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for more detail. bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple. A transformers.modeling_outputs.CausalLMOutputWithPast or a tuple of torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the configuration (BarkConfig) and inputs.
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Language modeling loss (for next-token prediction).
logits (torch.FloatTensor
of shape (batch_size, sequence_length, config.vocab_size)
) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (Cache
, optional, returned when use_cache=True
is passed or when config.use_cache=True
) — It is a Cache instance. For more details, see our kv cache guide.
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see past_key_values
input) to speed up sequential decoding.
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The BarkCausalModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
( config )
Parameters
[BarkCausalModel](/docs/transformers/v4.53.1/en/model_doc/bark#transformers.BarkCausalModel)
]) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights. Bark coarse acoustics model. It shares the same architecture as the semantic (or text) model. It is a GPT-2 like autoregressive model with a language modeling head on top.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward < source >( input_ids: typing.Optional[torch.Tensor] = None past_key_values: typing.Optional[tuple[torch.FloatTensor]] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.Tensor] = None head_mask: typing.Optional[torch.Tensor] = None labels: typing.Optional[torch.LongTensor] = None input_embeds: typing.Optional[torch.Tensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)
Parameters
torch.Tensor
of shape (batch_size, sequence_length)
, optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
tuple[torch.FloatTensor]
, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in the past_key_values
returned by the model at a previous stage of decoding, when use_cache=True
or config.use_cache=True
.
Two formats are allowed:
tuple(torch.FloatTensor)
of length config.n_layers
, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)
). This is also known as the legacy cache format.The model will output the same cache format that is fed as input. If no past_key_values
are passed, the legacy cache format will be returned.
If past_key_values
are used, the user can optionally input only the last input_ids
(those that don’t have their past key value states given to this model) of shape (batch_size, 1)
instead of all input_ids
of shape (batch_size, sequence_length)
.
torch.Tensor
of shape (batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
torch.Tensor
of shape (batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]
.
torch.Tensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
torch.LongTensor
of shape (batch_size, sequence_length)
, optional) — Labels for computing the masked language modeling loss. Indices should either be in [0, ..., config.vocab_size]
or -100 (see input_ids
docstring). Tokens with indices set to -100
are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size]
. torch.FloatTensor
of shape (batch_size, input_sequence_length, hidden_size)
, optional) — Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. Here, due to Bark
particularities, if past_key_values
is used, input_embeds
will be ignored and you have to use input_ids
. If past_key_values
is not used and use_cache
is set to True
, input_embeds
is used in priority instead of input_ids
. bool
, optional) — If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see past_key_values
). bool
, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions
under returned tensors for more detail. bool
, optional) — Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for more detail. bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple. A transformers.modeling_outputs.CausalLMOutputWithPast or a tuple of torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the configuration (BarkConfig) and inputs.
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Language modeling loss (for next-token prediction).
logits (torch.FloatTensor
of shape (batch_size, sequence_length, config.vocab_size)
) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (Cache
, optional, returned when use_cache=True
is passed or when config.use_cache=True
) — It is a Cache instance. For more details, see our kv cache guide.
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see past_key_values
input) to speed up sequential decoding.
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The BarkCausalModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
( config )
Parameters
Bark fine acoustics model. It is a non-causal GPT-like model with config.n_codes_total
embedding layers and language modeling heads, one for each codebook.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward < source >( codebook_idx: int input_ids: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.Tensor] = None head_mask: typing.Optional[torch.Tensor] = None labels: typing.Optional[torch.LongTensor] = None input_embeds: typing.Optional[torch.Tensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.MaskedLMOutput or tuple(torch.FloatTensor)
Parameters
int
) — Index of the codebook that will be predicted. torch.Tensor
of shape (batch_size, sequence_length)
, optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
torch.Tensor
of shape (batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
torch.Tensor
of shape (batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]
.
torch.Tensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
torch.LongTensor
of shape (batch_size, sequence_length)
, optional) — NOT IMPLEMENTED YET. torch.FloatTensor
of shape (batch_size, input_sequence_length, hidden_size)
, optional) — Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. If past_key_values
is used, optionally only the last input_embeds
have to be input (see past_key_values
). This is useful if you want more control over how to convert input_ids
indices into associated vectors than the model’s internal embedding lookup matrix. bool
, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions
under returned tensors for more detail. bool
, optional) — Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for more detail. bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple. A transformers.modeling_outputs.MaskedLMOutput or a tuple of torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the configuration (BarkConfig) and inputs.
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Masked language modeling (MLM) loss.
logits (torch.FloatTensor
of shape (batch_size, sequence_length, config.vocab_size)
) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The BarkFineModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
( config )
forward < source >( input_ids: typing.Optional[torch.Tensor] = None past_key_values: typing.Optional[tuple[torch.FloatTensor]] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.Tensor] = None head_mask: typing.Optional[torch.Tensor] = None labels: typing.Optional[torch.LongTensor] = None input_embeds: typing.Optional[torch.Tensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)
Parameters
torch.Tensor
of shape (batch_size, sequence_length)
, optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
tuple[torch.FloatTensor]
, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in the past_key_values
returned by the model at a previous stage of decoding, when use_cache=True
or config.use_cache=True
.
Two formats are allowed:
tuple(torch.FloatTensor)
of length config.n_layers
, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)
). This is also known as the legacy cache format.The model will output the same cache format that is fed as input. If no past_key_values
are passed, the legacy cache format will be returned.
If past_key_values
are used, the user can optionally input only the last input_ids
(those that don’t have their past key value states given to this model) of shape (batch_size, 1)
instead of all input_ids
of shape (batch_size, sequence_length)
.
torch.Tensor
of shape (batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
torch.Tensor
of shape (batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]
.
torch.Tensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in [0, 1]
:
torch.LongTensor
of shape (batch_size, sequence_length)
, optional) — Labels for computing the masked language modeling loss. Indices should either be in [0, ..., config.vocab_size]
or -100 (see input_ids
docstring). Tokens with indices set to -100
are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size]
. torch.FloatTensor
of shape (batch_size, input_sequence_length, hidden_size)
, optional) — Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. Here, due to Bark
particularities, if past_key_values
is used, input_embeds
will be ignored and you have to use input_ids
. If past_key_values
is not used and use_cache
is set to True
, input_embeds
is used in priority instead of input_ids
. bool
, optional) — If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see past_key_values
). bool
, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions
under returned tensors for more detail. bool
, optional) — Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for more detail. bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple. A transformers.modeling_outputs.CausalLMOutputWithPast or a tuple of torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the configuration (BarkConfig) and inputs.
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Language modeling loss (for next-token prediction).
logits (torch.FloatTensor
of shape (batch_size, sequence_length, config.vocab_size)
) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (Cache
, optional, returned when use_cache=True
is passed or when config.use_cache=True
) — It is a Cache instance. For more details, see our kv cache guide.
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see past_key_values
input) to speed up sequential decoding.
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The BarkCausalModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
( block_size = 1024 input_vocab_size = 10048 output_vocab_size = 10048 num_layers = 12 num_heads = 12 hidden_size = 768 dropout = 0.0 bias = True initializer_range = 0.02 use_cache = True **kwargs )
Parameters
int
, optional, defaults to 1024) — The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048). int
, optional, defaults to 10_048) — Vocabulary size of a Bark sub-model. Defines the number of different tokens that can be represented by the inputs_ids
passed when calling BarkCoarseModel. Defaults to 10_048 but should be carefully thought with regards to the chosen sub-model. int
, optional, defaults to 10_048) — Output vocabulary size of a Bark sub-model. Defines the number of different tokens that can be represented by the: output_ids
when passing forward a BarkCoarseModel. Defaults to 10_048 but should be carefully thought with regards to the chosen sub-model. int
, optional, defaults to 12) — Number of hidden layers in the given sub-model. int
, optional, defaults to 12) — Number of attention heads for each attention layer in the Transformer architecture. int
, optional, defaults to 768) — Dimensionality of the “intermediate” (often named feed-forward) layer in the architecture. float
, optional, defaults to 0.0) — The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. bool
, optional, defaults to True
) — Whether or not to use bias in the linear layers and layer norm layers. float
, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices. bool
, optional, defaults to True
) — Whether or not the model should return the last key/values attentions (not used by all models). This is the configuration class to store the configuration of a BarkCoarseModel. It is used to instantiate the model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Bark suno/bark architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
>>> from transformers import BarkCoarseConfig, BarkCoarseModel >>> >>> configuration = BarkCoarseConfig() >>> >>> model = BarkCoarseModel(configuration) >>> >>> configuration = model.configBarkFineConfig class transformers.BarkFineConfig < source >
( tie_word_embeddings = True n_codes_total = 8 n_codes_given = 1 **kwargs )
Parameters
int
, optional, defaults to 1024) — The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048). int
, optional, defaults to 10_048) — Vocabulary size of a Bark sub-model. Defines the number of different tokens that can be represented by the inputs_ids
passed when calling BarkFineModel. Defaults to 10_048 but should be carefully thought with regards to the chosen sub-model. int
, optional, defaults to 10_048) — Output vocabulary size of a Bark sub-model. Defines the number of different tokens that can be represented by the: output_ids
when passing forward a BarkFineModel. Defaults to 10_048 but should be carefully thought with regards to the chosen sub-model. int
, optional, defaults to 12) — Number of hidden layers in the given sub-model. int
, optional, defaults to 12) — Number of attention heads for each attention layer in the Transformer architecture. int
, optional, defaults to 768) — Dimensionality of the “intermediate” (often named feed-forward) layer in the architecture. float
, optional, defaults to 0.0) — The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. bool
, optional, defaults to True
) — Whether or not to use bias in the linear layers and layer norm layers. float
, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices. bool
, optional, defaults to True
) — Whether or not the model should return the last key/values attentions (not used by all models). int
, optional, defaults to 8) — The total number of audio codebooks predicted. Used in the fine acoustics sub-model. int
, optional, defaults to 1) — The number of audio codebooks predicted in the coarse acoustics sub-model. Used in the acoustics sub-models. This is the configuration class to store the configuration of a BarkFineModel. It is used to instantiate the model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Bark suno/bark architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
>>> from transformers import BarkFineConfig, BarkFineModel >>> >>> configuration = BarkFineConfig() >>> >>> model = BarkFineModel(configuration) >>> >>> configuration = model.configBarkSemanticConfig class transformers.BarkSemanticConfig < source >
( block_size = 1024 input_vocab_size = 10048 output_vocab_size = 10048 num_layers = 12 num_heads = 12 hidden_size = 768 dropout = 0.0 bias = True initializer_range = 0.02 use_cache = True **kwargs )
Parameters
int
, optional, defaults to 1024) — The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048). int
, optional, defaults to 10_048) — Vocabulary size of a Bark sub-model. Defines the number of different tokens that can be represented by the inputs_ids
passed when calling BarkSemanticModel. Defaults to 10_048 but should be carefully thought with regards to the chosen sub-model. int
, optional, defaults to 10_048) — Output vocabulary size of a Bark sub-model. Defines the number of different tokens that can be represented by the: output_ids
when passing forward a BarkSemanticModel. Defaults to 10_048 but should be carefully thought with regards to the chosen sub-model. int
, optional, defaults to 12) — Number of hidden layers in the given sub-model. int
, optional, defaults to 12) — Number of attention heads for each attention layer in the Transformer architecture. int
, optional, defaults to 768) — Dimensionality of the “intermediate” (often named feed-forward) layer in the architecture. float
, optional, defaults to 0.0) — The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. bool
, optional, defaults to True
) — Whether or not to use bias in the linear layers and layer norm layers. float
, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices. bool
, optional, defaults to True
) — Whether or not the model should return the last key/values attentions (not used by all models). This is the configuration class to store the configuration of a BarkSemanticModel. It is used to instantiate the model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Bark suno/bark architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
>>> from transformers import BarkSemanticConfig, BarkSemanticModel >>> >>> configuration = BarkSemanticConfig() >>> >>> model = BarkSemanticModel(configuration) >>> >>> configuration = model.config< > Update on GitHub
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4