A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://huggingface.co/docs/transformers/v4.51.3/en/model_doc/granitevision below:

Website Navigation


Granite Vision

Granite Vision Overview

The Granite Vision model is a variant of LLaVA-NeXT, leveraging a Granite language model alongside a SigLIP visual encoder. It utilizes multiple concatenated vision hidden states as its image features, similar to VipLlava. It also uses a larger set of image grid pinpoints than the original LlaVa-NeXT models to support additional aspect ratios.

Tips:

"<|user|>\nWhat’s shown in this image?\n<|assistant|>\nThis image shows a red stop sign.<|end_of_text|><|user|>\nDescribe the image in more details.\n<|assistant|>\n"

Sample inference:

from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration

model_path = "ibm-granite/granite-vision-3.1-2b-preview"
processor = LlavaNextProcessor.from_pretrained(model_path)

model = LlavaNextForConditionalGeneration.from_pretrained(model_path).to("cuda")


url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": url},
            {"type": "text", "text": "What is shown in this image?"},
        ],
    },
]
inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to("cuda")



output = model.generate(**inputs, max_new_tokens=100)

print(processor.decode(output[0], skip_special_tokens=True))

This model was contributed by Alexander Brooks.

LlavaNextConfig class transformers.LlavaNextConfig < source >

( vision_config = None text_config = None image_token_index = 32000 projector_hidden_act = 'gelu' vision_feature_select_strategy = 'default' vision_feature_layer = -2 image_grid_pinpoints = None tie_word_embeddings = False image_seq_length = 576 multimodal_projector_bias = True **kwargs )

Parameters

This is the configuration class to store the configuration of a LlavaNextForConditionalGeneration. It is used to instantiate an Llava-NeXT model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the llava-hf/llava-v1.6-mistral-7b-hf model.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Example:

>>> from transformers import LlavaNextForConditionalGeneration, LlavaNextConfig, CLIPVisionConfig, LlamaConfig

>>> 
>>> vision_config = CLIPVisionConfig()

>>> 
>>> text_config = LlamaConfig()

>>> 
>>> configuration = LlavaNextConfig(vision_config, text_config)

>>> 
>>> model = LlavaNextForConditionalGeneration(configuration)

>>> 
>>> configuration = model.config
LlavaNextImageProcessor class transformers.LlavaNextImageProcessor < source >

( do_resize: bool = True size: typing.Dict[str, int] = None image_grid_pinpoints: typing.List = None resample: Resampling = <Resampling.BICUBIC: 3> do_center_crop: bool = True crop_size: typing.Dict[str, int] = None do_rescale: bool = True rescale_factor: typing.Union[int, float] = 0.00392156862745098 do_normalize: bool = True image_mean: typing.Union[float, typing.List[float], NoneType] = None image_std: typing.Union[float, typing.List[float], NoneType] = None do_pad: typing.Optional[bool] = True do_convert_rgb: bool = True **kwargs )

Parameters

Constructs a LLaVa-NeXT image processor. Based on CLIPImageProcessor with incorporation of additional techniques for processing high resolution images as explained in the LLaVa paper.

preprocess < source >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] do_resize: typing.Optional[bool] = None size: typing.Dict[str, int] = None image_grid_pinpoints: typing.List = None resample: Resampling = None do_center_crop: typing.Optional[bool] = None crop_size: typing.Optional[int] = None do_rescale: typing.Optional[bool] = None rescale_factor: typing.Optional[float] = None do_normalize: typing.Optional[bool] = None image_mean: typing.Union[float, typing.List[float], NoneType] = None image_std: typing.Union[float, typing.List[float], NoneType] = None do_pad: typing.Optional[bool] = None do_convert_rgb: typing.Optional[bool] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None data_format: typing.Optional[transformers.image_utils.ChannelDimension] = <ChannelDimension.FIRST: 'channels_first'> input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None )

Parameters

LlavaNextProcessor class transformers.LlavaNextProcessor < source >

( image_processor = None tokenizer = None patch_size = None vision_feature_select_strategy = None chat_template = None image_token = '<image>' num_additional_image_tokens = 0 **kwargs )

Parameters

Constructs a LLaVa-NeXT processor which wraps a LLaVa-NeXT image processor and a LLaMa tokenizer into a single processor.

LlavaNextProcessor offers all the functionalities of LlavaNextImageProcessor and LlamaTokenizerFast. See the __call__() and decode() for more information.

This method forwards all its arguments to LlamaTokenizerFast’s batch_decode(). Please refer to the docstring of this method for more information.

This method forwards all its arguments to LlamaTokenizerFast’s decode(). Please refer to the docstring of this method for more information.

LlavaNextForConditionalGeneration class transformers.LlavaNextForConditionalGeneration < source >

( config: LlavaNextConfig )

Parameters

The LLAVA-NeXT model which consists of a vision backbone and a language model. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward < source >

( input_ids: typing.Optional[torch.LongTensor] = None pixel_values: typing.Optional[torch.FloatTensor] = None image_sizes: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[typing.List[torch.FloatTensor]] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None vision_feature_layer: typing.Union[int, typing.List[int], NoneType] = None vision_feature_select_strategy: typing.Optional[str] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None logits_to_keep: typing.Union[int, torch.Tensor] = 0 **lm_kwargs ) transformers.models.llava_next.modeling_llava_next.LlavaNextCausalLMOutputWithPast or tuple(torch.FloatTensor)

Parameters

Returns

transformers.models.llava_next.modeling_llava_next.LlavaNextCausalLMOutputWithPast or tuple(torch.FloatTensor)

A transformers.models.llava_next.modeling_llava_next.LlavaNextCausalLMOutputWithPast or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (LlavaNextConfig) and inputs.

The LlavaNextForConditionalGeneration forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, LlavaNextForConditionalGeneration

>>> model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
>>> processor = AutoProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")

>>> prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"
>>> url = "https://www.ilankelman.org/stopsigns/australia.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> inputs = processor(images=image, text=prompt, return_tensors="pt")

>>> 
>>> generate_ids = model.generate(**inputs, max_length=30)
>>> processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"[INST]  \nWhat is shown in this image? [/INST] The image appears to be a radar chart, which is a type of multi-dimensional plot (...)"
< > Update on GitHub

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.3