A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://huggingface.co/docs/transformers/model_doc/bark below:

Website Navigation


Bark

Bark Overview

Bark is a transformer-based text-to-speech model proposed by Suno AI in suno-ai/bark.

Bark is made of 4 main models:

It should be noted that each of the first three modules can support conditional speaker embeddings to condition the output sound according to specific predefined voice.

This model was contributed by Yoach Lacombe (ylacombe) and Sanchit Gandhi (sanchit-gandhi). The original code can be found here.

Optimizing Bark

Bark can be optimized with just a few extra lines of code, which significantly reduces its memory footprint and accelerates inference.

Using half-precision

You can speed up inference and reduce memory footprint by 50% simply by loading the model in half-precision.

from transformers import BarkModel
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model = BarkModel.from_pretrained("suno/bark-small", torch_dtype=torch.float16).to(device)
Using CPU offload

As mentioned above, Bark is made up of 4 sub-models, which are called up sequentially during audio generation. In other words, while one sub-model is in use, the other sub-models are idle.

If you’re using a CUDA device, a simple solution to benefit from an 80% reduction in memory footprint is to offload the submodels from GPU to CPU when they’re idle. This operation is called CPU offloading. You can use it with one line of code as follows:

model.enable_cpu_offload()

Note that 🤗 Accelerate must be installed before using this feature. Here’s how to install it.

Using Better Transformer

Better Transformer is an 🤗 Optimum feature that performs kernel fusion under the hood. You can gain 20% to 30% in speed with zero performance degradation. It only requires one line of code to export the model to 🤗 Better Transformer:

model =  model.to_bettertransformer()

Note that 🤗 Optimum must be installed before using this feature. Here’s how to install it.

Using Flash Attention 2

Flash Attention 2 is an even faster, optimized version of the previous optimization.

Installation

First, check whether your hardware is compatible with Flash Attention 2. The latest list of compatible hardware can be found in the official documentation. If your hardware is not compatible with Flash Attention 2, you can still benefit from attention kernel optimisations through Better Transformer support covered above.

Next, install the latest version of Flash Attention 2:

pip install -U flash-attn --no-build-isolation
Usage

To load a model using Flash Attention 2, we can pass the attn_implementation="flash_attention_2" flag to .from_pretrained. We’ll also load the model in half-precision (e.g. torch.float16), since it results in almost no degradation to audio quality but significantly lower memory usage and faster inference:

model = BarkModel.from_pretrained("suno/bark-small", torch_dtype=torch.float16, attn_implementation="flash_attention_2").to(device)
Performance comparison

The following diagram shows the latency for the native attention implementation (no optimisation) against Better Transformer and Flash Attention 2. In all cases, we generate 400 semantic tokens on a 40GB A100 GPU with PyTorch 2.1. Flash Attention 2 is also consistently faster than Better Transformer, and its performance improves even more as batch sizes increase:

To put this into perspective, on an NVIDIA A100 and when generating 400 semantic tokens with a batch size of 16, you can get 17 times the throughput and still be 2 seconds faster than generating sentences one by one with the native model implementation. In other words, all the samples will be generated 17 times faster.

At batch size 8, on an NVIDIA A100, Flash Attention 2 is also 10% faster than Better Transformer, and at batch size 16, 25%.

Combining optimization techniques

You can combine optimization techniques, and use CPU offload, half-precision and Flash Attention 2 (or 🤗 Better Transformer) all at once.

from transformers import BarkModel
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"


model = BarkModel.from_pretrained("suno/bark-small", torch_dtype=torch.float16, attn_implementation="flash_attention_2").to(device)


model.enable_cpu_offload()

Find out more on inference optimization techniques here.

Usage tips

Suno offers a library of voice presets in a number of languages here. These presets are also uploaded in the hub here or here.

>>> from transformers import AutoProcessor, BarkModel

>>> processor = AutoProcessor.from_pretrained("suno/bark")
>>> model = BarkModel.from_pretrained("suno/bark")

>>> voice_preset = "v2/en_speaker_6"

>>> inputs = processor("Hello, my dog is cute", voice_preset=voice_preset)

>>> audio_array = model.generate(**inputs)
>>> audio_array = audio_array.cpu().numpy().squeeze()

Bark can generate highly realistic, multilingual speech as well as other audio - including music, background noise and simple sound effects.

>>> 
>>> inputs = processor("惊人的!我会说中文")

>>> 
>>> inputs = processor("Incroyable! Je peux générer du son.", voice_preset="fr_speaker_5")

>>> 
>>> inputs = processor("♪ Hello, my dog is cute ♪")

>>> audio_array = model.generate(**inputs)
>>> audio_array = audio_array.cpu().numpy().squeeze()

The model can also produce nonverbal communications like laughing, sighing and crying.

>>> 
>>> inputs = processor("Hello uh ... [clears throat], my dog is cute [laughter]")

>>> audio_array = model.generate(**inputs)
>>> audio_array = audio_array.cpu().numpy().squeeze()

To save the audio, simply take the sample rate from the model config and some scipy utility:

>>> from scipy.io.wavfile import write as write_wav

>>> 
>>> sample_rate = model.generation_config.sample_rate
>>> write_wav("bark_generation.wav", sample_rate, audio_array)
BarkConfig class transformers.BarkConfig < source >

( semantic_config: typing.Optional[dict] = None coarse_acoustics_config: typing.Optional[dict] = None fine_acoustics_config: typing.Optional[dict] = None codec_config: typing.Optional[dict] = None initializer_range = 0.02 **kwargs )

Parameters

This is the configuration class to store the configuration of a BarkModel. It is used to instantiate a Bark model according to the specified sub-models configurations, defining the model architecture.

Instantiating a configuration with the defaults will yield a similar configuration to that of the Bark suno/bark architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

from_sub_model_configs < source >

( semantic_config: BarkSemanticConfig coarse_acoustics_config: BarkCoarseConfig fine_acoustics_config: BarkFineConfig codec_config: PretrainedConfig **kwargs ) BarkConfig

An instance of a configuration object

Instantiate a BarkConfig (or a derived class) from bark sub-models configuration.

BarkProcessor class transformers.BarkProcessor < source >

( tokenizer speaker_embeddings = None )

Parameters

Constructs a Bark processor which wraps a text tokenizer and optional Bark voice presets into a single processor.

__call__ < source >

( text = None voice_preset = None return_tensors = 'pt' max_length = 256 add_special_tokens = False return_attention_mask = True return_token_type_ids = False **kwargs ) Tuple(BatchEncoding, BatchFeature)

Parameters

A tuple composed of a BatchEncoding, i.e the output of the tokenizer and a BatchFeature, i.e the voice preset with the right tensors type.

Main method to prepare for the model one or several sequences(s). This method forwards the text and kwargs arguments to the AutoTokenizer’s __call__() to encode the text. The method also proposes a voice preset which is a dictionary of arrays that conditions Bark’s output. kwargs arguments are forwarded to the tokenizer and to cached_file method if voice_preset is a valid filename.

from_pretrained < source >

( pretrained_processor_name_or_path speaker_embeddings_dict_path = 'speaker_embeddings_path.json' **kwargs )

Parameters

Instantiate a Bark processor associated with a pretrained model.

save_pretrained < source >

( save_directory speaker_embeddings_dict_path = 'speaker_embeddings_path.json' speaker_embeddings_directory = 'speaker_embeddings' push_to_hub: bool = False **kwargs )

Parameters

Saves the attributes of this processor (tokenizer…) in the specified directory so that it can be reloaded using the from_pretrained() method.

BarkModel class transformers.BarkModel < source >

( config )

Parameters

The full Bark model, a text-to-speech model composed of 4 sub-models:

It should be noted that each of the first three modules can support conditional speaker embeddings to condition the output sound according to specific predefined voice.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

generate < source >

( input_ids: typing.Optional[torch.Tensor] = None history_prompt: typing.Optional[dict[str, torch.Tensor]] = None return_output_lengths: typing.Optional[bool] = None **kwargs ) By default

Parameters

Generates audio from an input prompt and an additional optional Bark speaker prompt.

Example:

>>> from transformers import AutoProcessor, BarkModel

>>> processor = AutoProcessor.from_pretrained("suno/bark-small")
>>> model = BarkModel.from_pretrained("suno/bark-small")

>>> 
>>> voice_preset = "v2/en_speaker_6"

>>> inputs = processor("Hello, my dog is cute, I need him in my life", voice_preset=voice_preset)

>>> audio_array = model.generate(**inputs, semantic_max_new_tokens=100)
>>> audio_array = audio_array.cpu().numpy().squeeze()
enable_cpu_offload < source >

( accelerator_id: typing.Optional[int] = 0 **kwargs )

Parameters

Offloads all sub-models to CPU using accelerate, reducing memory usage with a low impact on performance. This method moves one whole sub-model at a time to the accelerator when it is used, and the sub-model remains in accelerator until the next sub-model runs.

BarkSemanticModel class transformers.BarkSemanticModel < source >

( config )

Parameters

Bark semantic (or text) model. It shares the same architecture as the coarse model. It is a GPT-2 like autoregressive model with a language modeling head on top.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < source >

( input_ids: typing.Optional[torch.Tensor] = None past_key_values: typing.Optional[tuple[torch.FloatTensor]] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.Tensor] = None head_mask: typing.Optional[torch.Tensor] = None labels: typing.Optional[torch.LongTensor] = None input_embeds: typing.Optional[torch.Tensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)

Parameters

A transformers.modeling_outputs.CausalLMOutputWithPast or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (BarkConfig) and inputs.

The BarkCausalModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

BarkCoarseModel class transformers.BarkCoarseModel < source >

( config )

Parameters

Bark coarse acoustics model. It shares the same architecture as the semantic (or text) model. It is a GPT-2 like autoregressive model with a language modeling head on top.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < source >

( input_ids: typing.Optional[torch.Tensor] = None past_key_values: typing.Optional[tuple[torch.FloatTensor]] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.Tensor] = None head_mask: typing.Optional[torch.Tensor] = None labels: typing.Optional[torch.LongTensor] = None input_embeds: typing.Optional[torch.Tensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)

Parameters

A transformers.modeling_outputs.CausalLMOutputWithPast or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (BarkConfig) and inputs.

The BarkCausalModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

BarkFineModel class transformers.BarkFineModel < source >

( config )

Parameters

Bark fine acoustics model. It is a non-causal GPT-like model with config.n_codes_total embedding layers and language modeling heads, one for each codebook.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < source >

( codebook_idx: int input_ids: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.Tensor] = None head_mask: typing.Optional[torch.Tensor] = None labels: typing.Optional[torch.LongTensor] = None input_embeds: typing.Optional[torch.Tensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) transformers.modeling_outputs.MaskedLMOutput or tuple(torch.FloatTensor)

Parameters

A transformers.modeling_outputs.MaskedLMOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (BarkConfig) and inputs.

The BarkFineModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

BarkCausalModel class transformers.BarkCausalModel < source >

( config )

forward < source >

( input_ids: typing.Optional[torch.Tensor] = None past_key_values: typing.Optional[tuple[torch.FloatTensor]] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.Tensor] = None head_mask: typing.Optional[torch.Tensor] = None labels: typing.Optional[torch.LongTensor] = None input_embeds: typing.Optional[torch.Tensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)

Parameters

A transformers.modeling_outputs.CausalLMOutputWithPast or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (BarkConfig) and inputs.

The BarkCausalModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

BarkCoarseConfig class transformers.BarkCoarseConfig < source >

( block_size = 1024 input_vocab_size = 10048 output_vocab_size = 10048 num_layers = 12 num_heads = 12 hidden_size = 768 dropout = 0.0 bias = True initializer_range = 0.02 use_cache = True **kwargs )

Parameters

This is the configuration class to store the configuration of a BarkCoarseModel. It is used to instantiate the model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Bark suno/bark architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Example:

>>> from transformers import BarkCoarseConfig, BarkCoarseModel

>>> 
>>> configuration = BarkCoarseConfig()

>>> 
>>> model = BarkCoarseModel(configuration)

>>> 
>>> configuration = model.config
BarkFineConfig class transformers.BarkFineConfig < source >

( tie_word_embeddings = True n_codes_total = 8 n_codes_given = 1 **kwargs )

Parameters

This is the configuration class to store the configuration of a BarkFineModel. It is used to instantiate the model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Bark suno/bark architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Example:

>>> from transformers import BarkFineConfig, BarkFineModel

>>> 
>>> configuration = BarkFineConfig()

>>> 
>>> model = BarkFineModel(configuration)

>>> 
>>> configuration = model.config
BarkSemanticConfig class transformers.BarkSemanticConfig < source >

( block_size = 1024 input_vocab_size = 10048 output_vocab_size = 10048 num_layers = 12 num_heads = 12 hidden_size = 768 dropout = 0.0 bias = True initializer_range = 0.02 use_cache = True **kwargs )

Parameters

This is the configuration class to store the configuration of a BarkSemanticModel. It is used to instantiate the model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Bark suno/bark architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Example:

>>> from transformers import BarkSemanticConfig, BarkSemanticModel

>>> 
>>> configuration = BarkSemanticConfig()

>>> 
>>> model = BarkSemanticModel(configuration)

>>> 
>>> configuration = model.config
< > Update on GitHub

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4