A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://huggingface.co/docs/transformers/v4.51.3/en/model_doc/dac below:

Website Navigation


DAC

DAC Overview

The DAC model was proposed in Descript Audio Codec: High-Fidelity Audio Compression with Improved RVQGAN by Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, Kundan Kumar.

The Descript Audio Codec (DAC) model is a powerful tool for compressing audio data, making it highly efficient for storage and transmission. By compressing 44.1 KHz audio into tokens at just 8kbps bandwidth, the DAC model enables high-quality audio processing while significantly reducing the data footprint. This is particularly useful in scenarios where bandwidth is limited or storage space is at a premium, such as in streaming applications, remote conferencing, and archiving large audio datasets.

The abstract from the paper is the following:

Language models have been successfully used to model natural signals, such as images, speech, and music. A key component of these models is a high quality neural compression model that can compress high-dimensional natural signals into lower dimensional discrete tokens. To that end, we introduce a high-fidelity universal neural audio compression algorithm that achieves ~90x compression of 44.1 KHz audio into tokens at just 8kbps bandwidth. We achieve this by combining advances in high-fidelity audio generation with better vector quantization techniques from the image domain, along with improved adversarial and reconstruction losses. We compress all domains (speech, environment, music, etc.) with a single universal model, making it widely applicable to generative modeling of all audio. We compare with competing audio compression algorithms, and find our method outperforms them significantly. We provide thorough ablations for every design choice, as well as open-source code and trained model weights. We hope our work can lay the foundation for the next generation of high-fidelity audio modeling.

This model was contributed by Kamil Akesbi. The original code can be found here.

Model structure

The Descript Audio Codec (DAC) model is structured into three distinct stages:

  1. Encoder Model: This stage compresses the input audio, reducing its size while retaining essential information.
  2. Residual Vector Quantizer (RVQ) Model: Working in tandem with the encoder, this model quantizes the latent codes of the audio, refining the compression and ensuring high-quality reconstruction.
  3. Decoder Model: This final stage reconstructs the audio from its compressed form, restoring it to a state that closely resembles the original input.
Usage example

Here is a quick example of how to encode and decode an audio using this model:

>>> from datasets import load_dataset, Audio
>>> from transformers import DacModel, AutoProcessor
>>> librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")

>>> model = DacModel.from_pretrained("descript/dac_16khz")
>>> processor = AutoProcessor.from_pretrained("descript/dac_16khz")
>>> librispeech_dummy = librispeech_dummy.cast_column("audio", Audio(sampling_rate=processor.sampling_rate))
>>> audio_sample = librispeech_dummy[-1]["audio"]["array"]
>>> inputs = processor(raw_audio=audio_sample, sampling_rate=processor.sampling_rate, return_tensors="pt")

>>> encoder_outputs = model.encode(inputs["input_values"])
>>> 
>>> audio_codes = encoder_outputs.audio_codes
>>> 
>>> audio_values = model.decode(encoder_outputs.quantized_representation)
>>> 
>>> audio_values = model(inputs["input_values"]).audio_values
DacConfig class transformers.DacConfig < source >

( encoder_hidden_size = 64 downsampling_ratios = [2, 4, 8, 8] decoder_hidden_size = 1536 n_codebooks = 9 codebook_size = 1024 codebook_dim = 8 quantizer_dropout = 0 commitment_loss_weight = 0.25 codebook_loss_weight = 1.0 sampling_rate = 16000 **kwargs )

Parameters

This is the configuration class to store the configuration of an DacModel. It is used to instantiate a Dac model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the descript/dac_16khz architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Example:

>>> from transformers import DacModel, DacConfig

>>> 
>>> configuration = DacConfig()

>>> 
>>> model = DacModel(configuration)

>>> 
>>> configuration = model.config
DacFeatureExtractor

( feature_size: int = 1 sampling_rate: int = 16000 padding_value: float = 0.0 hop_length: int = 512 **kwargs )

Parameters

Constructs an Dac feature extractor.

This feature extractor inherits from SequenceFeatureExtractor which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

( raw_audio: typing.Union[numpy.ndarray, typing.List[float], typing.List[numpy.ndarray], typing.List[typing.List[float]]] padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy, NoneType] = None truncation: typing.Optional[bool] = False max_length: typing.Optional[int] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None sampling_rate: typing.Optional[int] = None )

Parameters

Main method to featurize and prepare for the model one or several sequence(s).

DacModel class transformers.DacModel < source >

( config: DacConfig )

Parameters

The DAC (Descript Audio Codec) model. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

decode < source >

( quantized_representation: typing.Optional[torch.Tensor] = None audio_codes: typing.Optional[torch.Tensor] = None return_dict: typing.Optional[bool] = None ) transformers.models.dac.modeling_dac.DacDecoderOutput or tuple(torch.FloatTensor)

Parameters

Returns

transformers.models.dac.modeling_dac.DacDecoderOutput or tuple(torch.FloatTensor)

A transformers.models.dac.modeling_dac.DacDecoderOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (DacConfig) and inputs.

Decode given latent codes and return audio data

encode < source >

( input_values: Tensor n_quantizers: typing.Optional[int] = None return_dict: typing.Optional[bool] = None ) transformers.models.dac.modeling_dac.DacEncoderOutput or tuple(torch.FloatTensor)

Parameters

Returns

transformers.models.dac.modeling_dac.DacEncoderOutput or tuple(torch.FloatTensor)

A transformers.models.dac.modeling_dac.DacEncoderOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (DacConfig) and inputs.

Encode given audio data and return quantized latent codes

forward < source >

( input_values: Tensor n_quantizers: typing.Optional[int] = None return_dict: typing.Optional[bool] = None ) transformers.models.dac.modeling_dac.DacOutput or tuple(torch.FloatTensor)

Parameters

Returns

transformers.models.dac.modeling_dac.DacOutput or tuple(torch.FloatTensor)

A transformers.models.dac.modeling_dac.DacOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (DacConfig) and inputs.

The DacModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

>>> from datasets import load_dataset, Audio
>>> from transformers import DacModel, AutoProcessor
>>> librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")

>>> model = DacModel.from_pretrained("descript/dac_16khz")
>>> processor = AutoProcessor.from_pretrained("descript/dac_16khz")
>>> librispeech_dummy = librispeech_dummy.cast_column("audio", Audio(sampling_rate=processor.sampling_rate))
>>> audio_sample = librispeech_dummy[-1]["audio"]["array"]
>>> inputs = processor(raw_audio=audio_sample, sampling_rate=processor.sampling_rate, return_tensors="pt")

>>> encoder_outputs = model.encode(inputs["input_values"])
>>> 
>>> audio_codes = encoder_outputs.audio_codes
>>> 
>>> audio_values = model.decode(encoder_outputs.quantized_representation)
>>> 
>>> audio_values = model(inputs["input_values"]).audio_values
< > Update on GitHub

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.3