A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://huggingface.co/docs/transformers/v4.51.3/en/model_doc/seamless_m4t_v2 below:

Website Navigation


SeamlessM4T-v2

SeamlessM4T-v2 Overview

The SeamlessM4T-v2 model was proposed in Seamless: Multilingual Expressive and Streaming Speech Translation by the Seamless Communication team from Meta AI.

SeamlessM4T-v2 is a collection of models designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text. It is an improvement on the previous version. For more details on the differences between v1 and v2, refer to section Difference with SeamlessM4T-v1.

SeamlessM4T-v2 enables multiple tasks without relying on separate models:

SeamlessM4Tv2Model can perform all the above tasks, but each task also has its own dedicated sub-model.

The abstract from the paper is the following:

Recent advancements in automatic speech translation have dramatically expanded language coverage, improved multimodal capabilities, and enabled a wide range of tasks and functionalities. That said, large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model—SeamlessM4T v2. This newer model, incorporating an updated UnitY2 framework, was trained on more low-resource language data. The expanded version of SeamlessAlign adds 114,800 hours of automatically aligned data for a total of 76 languages. SeamlessM4T v2 provides the foundation on which our two newest models, SeamlessExpressive and SeamlessStreaming, are initiated. SeamlessExpressive enables translation that preserves vocal styles and prosody. Compared to previous efforts in expressive speech research, our work addresses certain underexplored aspects of prosody, such as speech rate and pauses, while also preserving the style of one’s voice. As for SeamlessStreaming, our model leverages the Efficient Monotonic Multihead Attention (EMMA) mechanism to generate low-latency target translations without waiting for complete source utterances. As the first of its kind, SeamlessStreaming enables simultaneous speech-to-speech/text translation for multiple source and target languages. To understand the performance of these models, we combined novel and modified versions of existing automatic metrics to evaluate prosody, latency, and robustness. For human evaluations, we adapted existing protocols tailored for measuring the most relevant attributes in the preservation of meaning, naturalness, and expressivity. To ensure that our models can be used safely and responsibly, we implemented the first known red-teaming effort for multimodal machine translation, a system for the detection and mitigation of added toxicity, a systematic evaluation of gender bias, and an inaudible localized watermarking mechanism designed to dampen the impact of deepfakes. Consequently, we bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time. In sum, Seamless gives us a pivotal look at the technical foundation needed to turn the Universal Speech Translator from a science fiction concept into a real-world technology. Finally, contributions in this work—including models, code, and a watermark detector—are publicly released and accessible at the link below.

Usage

In the following example, we’ll load an Arabic audio sample and an English text sample and convert them into Russian speech and French text.

First, load the processor and a checkpoint of the model:

>>> from transformers import AutoProcessor, SeamlessM4Tv2Model

>>> processor = AutoProcessor.from_pretrained("facebook/seamless-m4t-v2-large")
>>> model = SeamlessM4Tv2Model.from_pretrained("facebook/seamless-m4t-v2-large")

You can seamlessly use this model on text or on audio, to generated either translated text or translated audio.

Here is how to use the processor to process text and audio:

>>> 
>>> from datasets import load_dataset
>>> dataset = load_dataset("arabic_speech_corpus", split="test", streaming=True, trust_remote_code=True)
>>> audio_sample = next(iter(dataset))["audio"]

>>> 
>>> audio_inputs = processor(audios=audio_sample["array"], return_tensors="pt")

>>> 
>>> text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")
Speech

SeamlessM4Tv2Model can seamlessly generate text or speech with few or no changes. Let’s target Russian voice translation:

>>> audio_array_from_text = model.generate(**text_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
>>> audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()

With basically the same code, I’ve translated English text and Arabic speech to Russian speech samples.

Text

Similarly, you can generate translated text from audio files or from text with the same model. You only have to pass generate_speech=False to SeamlessM4Tv2Model.generate(). This time, let’s translate to French.

>>> 
>>> output_tokens = model.generate(**audio_inputs, tgt_lang="fra", generate_speech=False)
>>> translated_text_from_audio = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)

>>> 
>>> output_tokens = model.generate(**text_inputs, tgt_lang="fra", generate_speech=False)
>>> translated_text_from_text = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
Tips 1. Use dedicated models

SeamlessM4Tv2Model is transformers top level model to generate speech and text, but you can also use dedicated models that perform the task without additional components, thus reducing the memory footprint. For example, you can replace the audio-to-audio generation snippet with the model dedicated to the S2ST task, the rest is exactly the same code:

>>> from transformers import SeamlessM4Tv2ForSpeechToSpeech
>>> model = SeamlessM4Tv2ForSpeechToSpeech.from_pretrained("facebook/seamless-m4t-v2-large")

Or you can replace the text-to-text generation snippet with the model dedicated to the T2TT task, you only have to remove generate_speech=False.

>>> from transformers import SeamlessM4Tv2ForTextToText
>>> model = SeamlessM4Tv2ForTextToText.from_pretrained("facebook/seamless-m4t-v2-large")

Feel free to try out SeamlessM4Tv2ForSpeechToText and SeamlessM4Tv2ForTextToSpeech as well.

2. Change the speaker identity

You have the possibility to change the speaker used for speech synthesis with the speaker_id argument. Some speaker_id works better than other for some languages!

3. Change the generation strategy

You can use different generation strategies for text generation, e.g .generate(input_ids=input_ids, text_num_beams=4, text_do_sample=True) which will perform multinomial beam-search decoding on the text model. Note that speech generation only supports greedy - by default - or multinomial sampling, which can be used with e.g. .generate(..., speech_do_sample=True, speech_temperature=0.6).

4. Generate speech and text at the same time

Use return_intermediate_token_ids=True with SeamlessM4Tv2Model to return both speech and text !

Model architecture

SeamlessM4T-v2 features a versatile architecture that smoothly handles the sequential generation of text and speech. This setup comprises two sequence-to-sequence (seq2seq) models. The first model translates the input modality into translated text, while the second model generates speech tokens, known as “unit tokens,” from the translated text.

Each modality has its own dedicated encoder with a unique architecture. Additionally, for speech output, a vocoder inspired by the HiFi-GAN architecture is placed on top of the second seq2seq model.

Difference with SeamlessM4T-v1

The architecture of this new version differs from the first in a few aspects:

Improvements on the second-pass model

The second seq2seq model, named text-to-unit model, is now non-auto regressive, meaning that it computes units in a single forward pass. This achievement is made possible by:

Difference in the speech encoder

The speech encoder, which is used during the first-pass generation process to predict the translated text, differs mainly from the previous speech encoder through these mechanisms:

Generation process

Here’s how the generation process works:

This model was contributed by ylacombe. The original code can be found here.

SeamlessM4Tv2Model class transformers.SeamlessM4Tv2Model < source >

( config current_modality = 'text' )

Parameters

The original SeamlessM4Tv2 Model transformer which can be used for every tasks available (S2ST, S2TT, T2TT, T2ST). This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

generate < source >

( input_ids: typing.Optional[torch.Tensor] = None input_features: typing.Optional[torch.Tensor] = None return_intermediate_token_ids: typing.Optional[bool] = None tgt_lang: typing.Optional[str] = None speaker_id: typing.Optional[int] = 0 generate_speech: typing.Optional[bool] = True **kwargs ) Union[SeamlessM4Tv2GenerationOutput, Tuple[Tensor], ModelOutput]

Parameters

Returns

Union[SeamlessM4Tv2GenerationOutput, Tuple[Tensor], ModelOutput]

Generates translated token ids and/or translated audio waveforms.

This method successively calls the .generate function of two different sub-models. You can specify keyword arguments at two different levels: general arguments that will be passed to both models, or prefixed arguments that will be passed to one of them.

For example, calling .generate(input_ids=input_ids, num_beams=4, speech_do_sample=True) will successively perform beam-search decoding on the text model, and multinomial beam-search sampling on the speech model.

For an overview of generation strategies and code examples, check out the following guide.

SeamlessM4Tv2ForTextToSpeech class transformers.SeamlessM4Tv2ForTextToSpeech < source >

( config: SeamlessM4Tv2Config )

Parameters

The text-to-speech SeamlessM4Tv2 Model transformer which can be used for T2ST. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

generate < source >

( input_ids: typing.Optional[torch.Tensor] = None return_intermediate_token_ids: typing.Optional[bool] = None tgt_lang: typing.Optional[str] = None speaker_id: typing.Optional[int] = 0 **kwargs ) Union[SeamlessM4Tv2GenerationOutput, Tuple[Tensor]]

Parameters

Returns

Union[SeamlessM4Tv2GenerationOutput, Tuple[Tensor]]

Generates translated audio waveforms.

This method successively calls the .generate function of two different sub-models. You can specify keyword arguments at two different levels: general arguments that will be passed to both models, or prefixed arguments that will be passed to one of them.

For example, calling .generate(input_ids, num_beams=4, speech_do_sample=True) will successively perform beam-search decoding on the text model, and multinomial beam-search sampling on the speech model.

For an overview of generation strategies and code examples, check out the following guide.

SeamlessM4Tv2ForSpeechToSpeech class transformers.SeamlessM4Tv2ForSpeechToSpeech < source >

( config )

Parameters

The speech-to-speech SeamlessM4Tv2 Model transformer which can be used for S2ST. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

generate < source >

( input_features: typing.Optional[torch.Tensor] = None return_intermediate_token_ids: typing.Optional[bool] = None tgt_lang: typing.Optional[str] = None speaker_id: typing.Optional[int] = 0 **kwargs ) Union[SeamlessM4Tv2GenerationOutput, Tuple[Tensor]]

Parameters

Returns

Union[SeamlessM4Tv2GenerationOutput, Tuple[Tensor]]

Generates translated audio waveforms.

This method successively calls the .generate function of two different sub-models. You can specify keyword arguments at two different levels: general arguments that will be passed to both models, or prefixed arguments that will be passed to one of them.

For example, calling .generate(input_features, num_beams=4, speech_do_sample=True) will successively perform beam-search decoding on the text model, and multinomial beam-search sampling on the speech model.

For an overview of generation strategies and code examples, check out the following guide.

SeamlessM4Tv2ForTextToText class transformers.SeamlessM4Tv2ForTextToText < source >

( config: SeamlessM4Tv2Config )

Parameters

The text-to-text SeamlessM4Tv2 Model transformer which can be used for T2TT. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < source >

( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.Tensor] = None decoder_input_ids: typing.Optional[torch.LongTensor] = None decoder_attention_mask: typing.Optional[torch.LongTensor] = None encoder_outputs: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None decoder_inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None **kwargs )

Parameters

The SeamlessM4Tv2ForTextToText forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

generate < source >

( input_ids = None tgt_lang = None generation_config = None logits_processor = None stopping_criteria = None prefix_allowed_tokens_fn = None synced_gpus = False **kwargs ) ModelOutput or torch.LongTensor

Parameters

Returns

ModelOutput or torch.LongTensor

A ModelOutput (if return_dict_in_generate=True or when config.return_dict_in_generate=True) or a torch.FloatTensor. The possible ModelOutput types are:

Generates sequences of token ids.

Most generation-controlling parameters are set in generation_config which, if not passed, will be set to the model’s default generation configuration. You can override any generation_config by passing the corresponding parameters to generate(), e.g. .generate(inputs, num_beams=4, do_sample=True).

For an overview of generation strategies and code examples, check out the following guide.

SeamlessM4Tv2ForSpeechToText class transformers.SeamlessM4Tv2ForSpeechToText < source >

( config: SeamlessM4Tv2Config )

Parameters

The speech-to-text SeamlessM4Tv2 Model transformer which can be used for S2TT. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < source >

( input_features: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.Tensor] = None decoder_input_ids: typing.Optional[torch.LongTensor] = None decoder_attention_mask: typing.Optional[torch.LongTensor] = None encoder_outputs: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = None past_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None decoder_inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None **kwargs )

Parameters

The SeamlessM4Tv2ForSpeechToText forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

generate < source >

( input_features = None tgt_lang = None generation_config = None logits_processor = None stopping_criteria = None prefix_allowed_tokens_fn = None synced_gpus = False **kwargs ) ModelOutput or torch.LongTensor

Parameters

Returns

ModelOutput or torch.LongTensor

A ModelOutput (if return_dict_in_generate=True or when config.return_dict_in_generate=True) or a torch.FloatTensor. The possible ModelOutput types are:

Generates sequences of token ids.

Most generation-controlling parameters are set in generation_config which, if not passed, will be set to the model’s default generation configuration. You can override any generation_config by passing the corresponding parameters to generate(), e.g. .generate(inputs, num_beams=4, do_sample=True).

For an overview of generation strategies and code examples, check out the following guide.

SeamlessM4Tv2Config class transformers.SeamlessM4Tv2Config < source >

( vocab_size = 256102 t2u_vocab_size = 10082 char_vocab_size = 10943 hidden_size = 1024 initializer_range = 0.02 layer_norm_eps = 1e-05 use_cache = True max_position_embeddings = 4096 is_encoder_decoder = True encoder_layerdrop = 0.05 decoder_layerdrop = 0.05 activation_function = 'relu' dropout = 0.1 attention_dropout = 0.1 activation_dropout = 0.0 scale_embedding = True encoder_layers = 24 encoder_ffn_dim = 8192 encoder_attention_heads = 16 decoder_layers = 24 decoder_ffn_dim = 8192 decoder_attention_heads = 16 decoder_start_token_id = 3 max_new_tokens = 256 pad_token_id = 0 bos_token_id = 2 eos_token_id = 3 speech_encoder_layers = 24 speech_encoder_attention_heads = 16 speech_encoder_intermediate_size = 4096 speech_encoder_hidden_act = 'swish' speech_encoder_dropout = 0.0 add_adapter = True speech_encoder_layerdrop = 0.1 feature_projection_input_dim = 160 adaptor_kernel_size = 8 adaptor_stride = 8 adaptor_dropout = 0.1 num_adapter_layers = 1 position_embeddings_type = 'relative_key' conv_depthwise_kernel_size = 31 left_max_position_embeddings = 64 right_max_position_embeddings = 8 speech_encoder_chunk_size = 20000 speech_encoder_left_chunk_num = 128 t2u_bos_token_id = 0 t2u_pad_token_id = 1 t2u_eos_token_id = 2 t2u_encoder_layers = 6 t2u_encoder_ffn_dim = 8192 t2u_encoder_attention_heads = 16 t2u_decoder_layers = 6 t2u_decoder_ffn_dim = 8192 t2u_decoder_attention_heads = 16 t2u_max_position_embeddings = 4096 t2u_variance_predictor_embed_dim = 1024 t2u_variance_predictor_hidden_dim = 256 t2u_variance_predictor_kernel_size = 3 t2u_variance_pred_dropout = 0.5 sampling_rate = 16000 upsample_initial_channel = 512 upsample_rates = [5, 4, 4, 2, 2] upsample_kernel_sizes = [11, 8, 8, 4, 4] resblock_kernel_sizes = [3, 7, 11] resblock_dilation_sizes = [[1, 3, 5], [1, 3, 5], [1, 3, 5]] leaky_relu_slope = 0.1 unit_hifi_gan_vocab_size = 10000 unit_embed_dim = 1280 lang_embed_dim = 256 spkr_embed_dim = 256 vocoder_num_langs = 36 vocoder_num_spkrs = 200 variance_predictor_kernel_size = 3 var_pred_dropout = 0.5 vocoder_offset = 4 **kwargs )

Parameters

Parameters shared across sub-models

Text encoder and text decoder specific parameters

Speech encoder specific parameters

Text-To-Unit (t2u) model specific parameters

This is the configuration class to store the configuration of a ~SeamlessM4Tv2Model. It is used to instantiate an SeamlessM4Tv2 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the SeamlessM4Tv2 "" architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

>>> from transformers import SeamlessM4Tv2Model, SeamlessM4Tv2Config

>>> 
>>> configuration = SeamlessM4Tv2Config()

>>> 
>>> model = SeamlessM4Tv2Model(configuration)

>>> 
>>> configuration = model.config
< > Update on GitHub

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.3