Helium was proposed in Announcing Helium-1 Preview by the Kyutai Team.
Helium-1 preview is a lightweight language model with 2B parameters, targeting edge and mobile devices. It supports the following languages: English, French, German, Italian, Portuguese, Spanish.
The model was evaluated on MMLU, TriviaQA, NaturalQuestions, ARC Easy & Challenge, Open Book QA, Common Sense QA, Physical Interaction QA, Social Interaction QA, HellaSwag, WinoGrande, Multilingual Knowledge QA, FLORES 200.
MetricsWe report accuracy on MMLU, ARC, OBQA, CSQA, PIQA, SIQA, HellaSwag, WinoGrande. We report exact match on TriviaQA, NQ and MKQA. We report BLEU on FLORES.
English Results Benchmark Helium-1 Preview HF SmolLM2 (1.7B) Gemma-2 (2.6B) Llama-3.2 (3B) Qwen2.5 (1.5B) MMLU 51.2 50.4 53.1 56.6 61.0 NQ 17.3 15.1 17.7 22.0 13.1 TQA 47.9 45.4 49.9 53.6 35.9 ARC E 80.9 81.8 81.1 84.6 89.7 ARC C 62.7 64.7 66.0 69.0 77.2 OBQA 63.8 61.4 64.6 68.4 73.8 CSQA 65.6 59.0 64.4 65.4 72.4 PIQA 77.4 77.7 79.8 78.9 76.0 SIQA 64.4 57.5 61.9 63.8 68.7 HS 69.7 73.2 74.7 76.9 67.5 WG 66.5 65.6 71.2 72.0 64.8 Average 60.7 59.3 62.2 64.7 63.6 Multilingual Results Language Benchmark Helium-1 Preview HF SmolLM2 (1.7B) Gemma-2 (2.6B) Llama-3.2 (3B) Qwen2.5 (1.5B) German MMLU 45.6 35.3 45.0 47.5 49.5 ARC C 56.7 38.4 54.7 58.3 60.2 HS 53.5 33.9 53.4 53.7 42.8 MKQA 16.1 7.1 18.9 20.2 10.4 Spanish MMLU 46.5 38.9 46.2 49.6 52.8 ARC C 58.3 43.2 58.8 60.0 68.1 HS 58.6 40.8 60.5 61.1 51.4 MKQA 16.0 7.9 18.5 20.6 10.6 Technical Specifications Model Architecture and Objective Hyperparameter Value Layers 24 Heads 20 Model dimension 2560 MLP dimension 7040 Context size 4096 Theta RoPE 100,000Tips:
Helium
can be found on the Huggingface Hub
In the following, we demonstrate how to use helium-1-preview
for the inference.
>>> from transformers import AutoModelForCausalLM, AutoTokenizer >>> device = "cuda" >>> model = AutoModelForCausalLM.from_pretrained("kyutai/helium-1-preview-2b", device_map="auto") >>> tokenizer = AutoTokenizer.from_pretrained("kyutai/helium-1-preview-2b") >>> prompt = "Give me a short introduction to large language model." >>> model_inputs = tokenizer(prompt, return_tensors="pt").to(device) >>> generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, do_sample=True) >>> generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)] >>> response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]HeliumConfig class transformers.HeliumConfig < source >
( vocab_size = 48000 hidden_size = 2560 intermediate_size = 7040 num_hidden_layers = 24 num_attention_heads = 20 num_key_value_heads = 20 head_dim = 128 hidden_act = 'silu' attention_dropout = 0.0 max_position_embeddings = 4096 initializer_range = 0.02 rms_norm_eps = 1e-08 use_cache = True tie_word_embeddings = False rope_theta = 100000.0 pad_token_id = 3 eos_token_id = 2 bos_token_id = 1 attention_bias = False mlp_bias = False **kwargs )
Parameters
int
, optional, defaults to 48000) — Vocabulary size of the Helium model. Defines the number of different tokens that can be represented by the inputs_ids
passed when calling HeliumModel int
, optional, defaults to 2560) — Dimension of the hidden representations. int
, optional, defaults to 7040) — Dimension of the MLP representations. int
, optional, defaults to 24) — Number of hidden layers in the Transformer decoder. int
, optional, defaults to 20) — Number of attention heads for each attention layer in the Transformer decoder. int
, optional, defaults to 20) — This is the number of key_value heads that should be used to implement Grouped Query Attention. If num_key_value_heads=num_attention_heads
, the model will use Multi Head Attention (MHA), if num_key_value_heads=1
the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed by meanpooling all the original heads within that group. For more details checkout this paper. If it is not specified, will default to num_attention_heads
. int
, optional, defaults to 128) — The attention head dimension. str
or function
, optional, defaults to "silu"
) — The legacy activation function. It is overwritten by the hidden_activation
. float
, optional, defaults to 0.0) — The dropout ratio for the attention probabilities. int
, optional, defaults to 4096) — The maximum sequence length that this model might ever be used with. float
, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices. float
, optional, defaults to 1e-08) — The epsilon used by the rms normalization layers. bool
, optional, defaults to True
) — Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True
. bool
, optional, defaults to False
) — Whether to tie weight embeddings float
, optional, defaults to 100000.0) — The base period of the RoPE embeddings. int
, optional, defaults to 3) — Padding token id. int
| list
, optional, defaults to 2) — End of stream token id. int
, optional, defaults to 1) — Beginning of stream token id. bool
, optional, defaults to False
) — Whether to use a bias in the query, key, value and output projection layers during self-attention. bool
, optional, defaults to False
) — Whether to use a bias in up_proj, down_proj and gate_proj layers in the MLP layers. This is the configuration class to store the configuration of a HeliumModel. It is used to instantiate an Helium model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Helium 2b model. e.g. kyutai/helium-2b Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
>>> from transformers import HeliumModel, HeliumConfig >>> >>> configuration = HeliumConfig() >>> >>> model = HeliumModel(configuration) >>> >>> configuration = model.configHeliumModel class transformers.HeliumModel < source >
( config: HeliumConfig )
Parameters
The bare Helium Model outputting raw hidden-states without any specific head on top. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
Transformer decoder consisting of config.num_hidden_layers layers. Each layer is a HeliumDecoderLayer
( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[transformers.cache_utils.Cache] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None **flash_attn_kwargs: typing_extensions.Unpack[transformers.modeling_flash_attention_utils.FlashAttentionKwargs] )
Parameters
torch.LongTensor
of shape (batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
torch.Tensor
of shape (batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
If past_key_values
is used, optionally only the last input_ids
have to be input (see past_key_values
).
If you want to change padding behavior, you should read modeling_opt._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in the paper for more information on the default strategy.
torch.LongTensor
of shape (batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]
.
Cache
, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in the past_key_values
returned by the model at a previous stage of decoding, when use_cache=True
or config.use_cache=True
.
It is a Cache instance. For more details, see our kv cache guide.
If past_key_values
are used, the user can optionally input only the last input_ids
(those that don’t have their past key value states given to this model) of shape (batch_size, 1)
instead of all input_ids
of shape (batch_size, sequence_length)
.
torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids
indices into associated vectors than the model’s internal embedding lookup matrix. bool
, optional) — If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see past_key_values
). bool
, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions
under returned tensors for more detail. bool
, optional) — Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for more detail. bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple. torch.LongTensor
of shape (sequence_length)
, optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily to position_ids
, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length. The HeliumModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
( config: HeliumConfig )
forward < source >( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[transformers.cache_utils.Cache] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None logits_to_keep: typing.Union[int, torch.Tensor] = 0 **kwargs: typing_extensions.Unpack[transformers.models.helium.modeling_helium.KwargsForCausalLM] ) → transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)
Parameters
torch.LongTensor
of shape (batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
torch.Tensor
of shape (batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
If past_key_values
is used, optionally only the last input_ids
have to be input (see past_key_values
).
If you want to change padding behavior, you should read modeling_opt._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in the paper for more information on the default strategy.
torch.LongTensor
of shape (batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]
.
Cache
, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in the past_key_values
returned by the model at a previous stage of decoding, when use_cache=True
or config.use_cache=True
.
It is a Cache instance. For more details, see our kv cache guide.
If past_key_values
are used, the user can optionally input only the last input_ids
(those that don’t have their past key value states given to this model) of shape (batch_size, 1)
instead of all input_ids
of shape (batch_size, sequence_length)
.
torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids
indices into associated vectors than the model’s internal embedding lookup matrix. bool
, optional) — If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see past_key_values
). bool
, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions
under returned tensors for more detail. bool
, optional) — Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for more detail. bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple. torch.LongTensor
of shape (sequence_length)
, optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily to position_ids
, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length. torch.LongTensor
of shape (batch_size, sequence_length)
, optional) — Labels for computing the masked language modeling loss. Indices should either be in [0, ..., config.vocab_size]
or -100 (see input_ids
docstring). Tokens with indices set to -100
are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size]
. int
or torch.Tensor
, optional) — If an int
, compute logits for the last logits_to_keep
tokens. If 0
, calculate logits for all input_ids
(special case). Only last token logits are needed for generation, and calculating them only for that token can save memory, which becomes pretty significant for long sequences or large vocabulary size. If a torch.Tensor
, must be 1D corresponding to the indices to keep in the sequence length dimension. This is useful when using packed tensor format (single dimension for batch and sequence length). A transformers.modeling_outputs.CausalLMOutputWithPast or a tuple of torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the configuration (HeliumConfig) and inputs.
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Language modeling loss (for next-token prediction).
logits (torch.FloatTensor
of shape (batch_size, sequence_length, config.vocab_size)
) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (tuple(tuple(torch.FloatTensor))
, optional, returned when use_cache=True
is passed or when config.use_cache=True
) — Tuple of tuple(torch.FloatTensor)
of length config.n_layers
, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)
)
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see past_key_values
input) to speed up sequential decoding.
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The HeliumForCausalLM forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Example:
>>> from transformers import AutoTokenizer, HeliumForCausalLM >>> model = HeliumForCausalLM.from_pretrained("google/helium-7b") >>> tokenizer = AutoTokenizer.from_pretrained("google/helium-7b") >>> prompt = "What is your favorite condiment?" >>> inputs = tokenizer(prompt, return_tensors="pt") >>> >>> generate_ids = model.generate(inputs.input_ids, max_length=30) >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] "What is your favorite condiment?"HeliumForSequenceClassification class transformers.HeliumForSequenceClassification < source >
( config: HeliumConfig )
Parameters
The Helium Model transformer with a sequence classification head on top (linear layer).
HeliumForSequenceClassification uses the last token in order to do the classification, as other causal models (e.g. GPT-2) do.
Since it does classification on the last token, it requires to know the position of the last token. If a pad_token_id
is defined in the configuration, it finds the last token that is not a padding token in each row. If no pad_token_id
is defined, it simply takes the last value in each row of the batch. Since it cannot guess the padding tokens when inputs_embeds
are passed instead of input_ids
, it does the same (take the last value in each row of the batch).
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward < source >( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[transformers.cache_utils.Cache] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None )
Parameters
torch.LongTensor
of shape (batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
torch.Tensor
of shape (batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
If past_key_values
is used, optionally only the last input_ids
have to be input (see past_key_values
).
If you want to change padding behavior, you should read modeling_opt._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in the paper for more information on the default strategy.
torch.LongTensor
of shape (batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]
.
Cache
, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in the past_key_values
returned by the model at a previous stage of decoding, when use_cache=True
or config.use_cache=True
.
It is a Cache instance. For more details, see our kv cache guide.
If past_key_values
are used, the user can optionally input only the last input_ids
(those that don’t have their past key value states given to this model) of shape (batch_size, 1)
instead of all input_ids
of shape (batch_size, sequence_length)
.
torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids
indices into associated vectors than the model’s internal embedding lookup matrix. bool
, optional) — If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see past_key_values
). bool
, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions
under returned tensors for more detail. bool
, optional) — Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for more detail. bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple. torch.LongTensor
of shape (sequence_length)
, optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily to position_ids
, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length. torch.LongTensor
of shape (batch_size,)
, optional) — Labels for computing the sequence classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]
. If config.num_labels == 1
a regression loss is computed (Mean-Square loss), If config.num_labels > 1
a classification loss is computed (Cross-Entropy). The HeliumForSequenceClassification forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
( config: HeliumConfig )
Parameters
The Helium Model transformer with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward < source >( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[transformers.cache_utils.Cache] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None ) → transformers.modeling_outputs.TokenClassifierOutput or tuple(torch.FloatTensor)
Parameters
torch.LongTensor
of shape (batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
torch.Tensor
of shape (batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
If past_key_values
is used, optionally only the last input_ids
have to be input (see past_key_values
).
If you want to change padding behavior, you should read modeling_opt._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in the paper for more information on the default strategy.
torch.LongTensor
of shape (batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]
.
Cache
, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in the past_key_values
returned by the model at a previous stage of decoding, when use_cache=True
or config.use_cache=True
.
It is a Cache instance. For more details, see our kv cache guide.
If past_key_values
are used, the user can optionally input only the last input_ids
(those that don’t have their past key value states given to this model) of shape (batch_size, 1)
instead of all input_ids
of shape (batch_size, sequence_length)
.
torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids
indices into associated vectors than the model’s internal embedding lookup matrix. bool
, optional) — If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see past_key_values
). bool
, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions
under returned tensors for more detail. bool
, optional) — Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for more detail. bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple. torch.LongTensor
of shape (sequence_length)
, optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily to position_ids
, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length. torch.LongTensor
of shape (batch_size,)
, optional) — Labels for computing the sequence classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]
. If config.num_labels == 1
a regression loss is computed (Mean-Square loss), If config.num_labels > 1
a classification loss is computed (Cross-Entropy). A transformers.modeling_outputs.TokenClassifierOutput or a tuple of torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the configuration (HeliumConfig) and inputs.
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Classification loss.
logits (torch.FloatTensor
of shape (batch_size, sequence_length, config.num_labels)
) — Classification scores (before SoftMax).
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The HeliumForTokenClassification forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Example:
>>> from transformers import AutoTokenizer, HeliumForTokenClassification >>> import torch >>> tokenizer = AutoTokenizer.from_pretrained("google/helium-7b") >>> model = HeliumForTokenClassification.from_pretrained("google/helium-7b") >>> inputs = tokenizer( ... "HuggingFace is a company based in Paris and New York", add_special_tokens=False, return_tensors="pt" ... ) >>> with torch.no_grad(): ... logits = model(**inputs).logits >>> predicted_token_class_ids = logits.argmax(-1) >>> >>> >>> >>> predicted_tokens_classes = [model.config.id2label[t.item()] for t in predicted_token_class_ids[0]] >>> labels = predicted_token_class_ids >>> loss = model(**inputs, labels=labels).loss< > Update on GitHub
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.3