A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://huggingface.co/docs/transformers/main/en/model_doc/stt below:

Website Navigation


Kyutai Speech-To-Text

Kyutai Speech-To-Text Overview

Kyutai STT is a speech-to-text model architecture based on the Mimi codec, which encodes audio into discrete tokens in a streaming fashion, and a Moshi-like autoregressive decoder. Kyutai’s lab has released two model checkpoints:

Usage Tips Inference
import torch
from datasets import load_dataset, Audio
from transformers import KyutaiSpeechToTextProcessor, KyutaiSpeechToTextForConditionalGeneration


torch_device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "kyutai/stt-2.6b-en"

processor = KyutaiSpeechToTextProcessor.from_pretrained(model_id)
model = KyutaiSpeechToTextForConditionalGeneration.from_pretrained(model_id, device_map=torch_device)


ds = load_dataset(
    "hf-internal-testing/librispeech_asr_dummy", "clean", split="validation"
)
ds = ds.cast_column("audio", Audio(sampling_rate=24000))


inputs = processor(
    ds[0]["audio"]["array"],
)
inputs.to(torch_device)


output_tokens = model.generate(**inputs)


print(processor.batch_decode(output_tokens, skip_special_tokens=True))
Batched Inference
import torch
from datasets import load_dataset, Audio
from transformers import KyutaiSpeechToTextProcessor, KyutaiSpeechToTextForConditionalGeneration


torch_device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "kyutai/stt-2.6b-en"

processor = KyutaiSpeechToTextProcessor.from_pretrained(model_id)
model = KyutaiSpeechToTextForConditionalGeneration.from_pretrained(model_id, device_map=torch_device)


ds = load_dataset(
    "hf-internal-testing/librispeech_asr_dummy", "clean", split="validation"
)
ds = ds.cast_column("audio", Audio(sampling_rate=24000))


audio_arrays = [ds[i]["audio"]["array"] for i in range(4)]
inputs = processor(audio_arrays, return_tensors="pt", padding=True)
inputs = inputs.to(torch_device)


output_tokens = model.generate(**inputs)


decoded_outputs = processor.batch_decode(output_tokens, skip_special_tokens=True)
for output in decoded_outputs:
    print(output)

This model was contributed by Eustache Le Bihan. The original code can be found here.

KyutaiSpeechToTextConfig class transformers.KyutaiSpeechToTextConfig < source >

( codebook_vocab_size = 2049 vocab_size = 4001 hidden_size = 2048 num_hidden_layers = 48 num_attention_heads = 32 num_key_value_heads = None max_position_embeddings = 750 rope_theta = 100000.0 hidden_act = 'silu' head_dim = None initializer_range = 0.02 use_cache = True sliding_window = 375 attention_dropout = 0.0 ffn_dim = 11264 rms_norm_eps = 1e-08 num_codebooks = 32 audio_bos_token_id = 2048 audio_pad_token_id = 69569 tie_word_embeddings = False pad_token_id = 3 bos_token_id = 48000 codec_config = None **kwargs )

Parameters

This is the configuration class to store the configuration of a KyutaiSpeechToTextForConditionalGeneration. It is used to instantiate a Kyutai Speech-to-Text model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the 2.6b-en model.

e.g. kyutai/stt-2.6b-en

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Example:

>>> from transformers import KyutaiSpeechToTextConfig, KyutaiSpeechToTextForConditionalGeneration

>>> 
>>> configuration = KyutaiSpeechToTextConfig()

>>> 
>>> model = KyutaiSpeechToTextForConditionalGeneration(configuration)

>>> 
>>> configuration = model.config
KyutaiSpeechToTextProcessor class transformers.KyutaiSpeechToTextProcessor < source >

( *args **kwargs )

Constructs a Moshi ASR processor which wraps EncodecFeatureExtractor and PreTrainedTokenizerFast into a single processor that inherits both the audio feature extraction and tokenizer functionalities. See the call() for more information.

__call__ < source >

( audio: typing.Union[numpy.ndarray, ForwardRef('torch.Tensor'), list[numpy.ndarray], tuple[numpy.ndarray], list['torch.Tensor'], tuple['torch.Tensor'], NoneType] = None **kwargs: typing_extensions.Unpack[transformers.models.stt.processing_kyutai_speech_to_text.KyutaiSpeechToTextProcessorKwargs] ) BatchFeature

Parameters

A BatchFeature with the following fields:

Main method to prepare audio to be fed as input to the model. This method forwards the audio arguments to KyutaiSpeechToTextFeatureExtractor’s __call__(). Please refer to the docstring of the above method for more information.

KyutaiSpeechToTextFeatureExtractor

( feature_size: int = 1 sampling_rate: int = 24000 padding_value: float = 0.0 chunk_length_s: typing.Optional[float] = None overlap: typing.Optional[float] = None audio_delay_seconds: typing.Optional[float] = 0.0 audio_silence_prefix_seconds: typing.Optional[float] = 0.0 **kwargs )

Parameters

Constructs an KyutaiSpeechToText feature extractor.

This feature extractor inherits from SequenceFeatureExtractor which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

KyutaiSpeechToTextForConditionalGeneration class transformers.KyutaiSpeechToTextForConditionalGeneration < source >

( config )

Parameters

The Kyutai Speech To Text Model for token generation conditioned on other modalities (e.g. image-text-to-text generation).

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < source >

( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[transformers.cache_utils.Cache] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None logits_to_keep: typing.Union[int, torch.Tensor] = 0 **kwargs: typing_extensions.Unpack[transformers.models.stt.modeling_kyutai_speech_to_text.KwargsForCausalLM] ) transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)

Parameters

A transformers.modeling_outputs.CausalLMOutputWithPast or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (ModelConfig) and inputs.

The KyutaiSpeechToTextForConditionalGeneration forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> import torch
>>> from datasets import load_dataset, Audio
>>> from transformers import KyutaiSpeechToTextProcessor, KyutaiSpeechToTextForConditionalGeneration

>>> torch_device = "cuda" if torch.cuda.is_available() else "cpu"
>>> model_id = "kyutai/stt-2.6b-en"

>>> processor = KyutaiSpeechToTextProcessor.from_pretrained(model_id)
>>> model = KyutaiSpeechToTextForConditionalGeneration.from_pretrained(model_id, device_map=torch_device)

>>> ds = load_dataset(
...     "hf-internal-testing/librispeech_asr_dummy", "clean", split="validation"
... )

>>> ds = ds.cast_column("audio", Audio(sampling_rate=24000))
>>> inputs = processor(
...     ds[0]["audio"]["array"],
... )
>>> inputs.to(torch_device)

>>> output_tokens = model.generate(**inputs)
>>> print(processor.batch_decode(output_tokens, skip_special_tokens=True))

This method forwards all its arguments to GenerationMixin’s generate(). Please refer to the docstring of this method for more information.

KyutaiSpeechToTextModel class transformers.KyutaiSpeechToTextModel < source >

( config )

Parameters

The bare Kyutai Speech To Text Text Model outputting raw hidden-states without any specific head on to.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < source >

( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Union[list[torch.FloatTensor], transformers.cache_utils.Cache, NoneType] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None ) transformers.modeling_outputs.BaseModelOutputWithPast or tuple(torch.FloatTensor)

Parameters

A transformers.modeling_outputs.BaseModelOutputWithPast or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (ModelConfig) and inputs.

The KyutaiSpeechToTextModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

< > Update on GitHub

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4