The ViTMatte model was proposed in Boosting Image Matting with Pretrained Plain Vision Transformers by Jingfeng Yao, Xinggang Wang, Shusheng Yang, Baoyuan Wang. ViTMatte leverages plain Vision Transformers for the task of image matting, which is the process of accurately estimating the foreground object in images and videos.
The abstract from the paper is the following:
Recently, plain vision Transformers (ViTs) have shown impressive performance on various computer vision tasks, thanks to their strong modeling capacity and large-scale pretraining. However, they have not yet conquered the problem of image matting. We hypothesize that image matting could also be boosted by ViTs and present a new efficient and robust ViT-based matting system, named ViTMatte. Our method utilizes (i) a hybrid attention mechanism combined with a convolution neck to help ViTs achieve an excellent performance-computation trade-off in matting tasks. (ii) Additionally, we introduce the detail capture module, which just consists of simple lightweight convolutions to complement the detailed information required by matting. To the best of our knowledge, ViTMatte is the first work to unleash the potential of ViT on image matting with concise adaptation. It inherits many superior properties from ViT to matting, including various pretraining strategies, concise architecture design, and flexible inference strategies. We evaluate ViTMatte on Composition-1k and Distinctions-646, the most commonly used benchmark for image matting, our method achieves state-of-the-art performance and outperforms prior matting works by a large margin.
This model was contributed by nielsr. The original code can be found here.
ViTMatte high-level overview. Taken from the original paper. ResourcesA list of official Hugging Face and community (indicated by π) resources to help you get started with ViTMatte.
The model expects both the image and trimap (concatenated) as input. Use ViTMatteImageProcessor
for this purpose.
( backbone_config: PretrainedConfig = None backbone = None use_pretrained_backbone = False use_timm_backbone = False backbone_kwargs = None hidden_size: int = 384 batch_norm_eps: float = 1e-05 initializer_range: float = 0.02 convstream_hidden_sizes: typing.List[int] = [48, 96, 192] fusion_hidden_sizes: typing.List[int] = [256, 128, 64, 32] **kwargs )
Parameters
PretrainedConfig
or dict
, optional, defaults to VitDetConfig()
) β The configuration of the backbone model. str
, optional) β Name of backbone to use when backbone_config
is None
. If use_pretrained_backbone
is True
, this will load the corresponding pretrained weights from the timm or transformers library. If use_pretrained_backbone
is False
, this loads the backboneβs config and uses that to initialize the backbone with random weights. bool
, optional, defaults to False
) β Whether to use pretrained weights for the backbone. bool
, optional, defaults to False
) β Whether to load backbone
from the timm library. If False
, the backbone is loaded from the transformers library. dict
, optional) β Keyword arguments to be passed to AutoBackbone when loading from a checkpoint e.g. {'out_indices': (0, 1, 2, 3)}
. Cannot be specified if backbone_config
is set. int
, optional, defaults to 384) β The number of input channels of the decoder. float
, optional, defaults to 1e-05) β The epsilon used by the batch norm layers. float
, optional, defaults to 0.02) β The standard deviation of the truncated_normal_initializer for initializing all weight matrices. List[int]
, optional, defaults to [48, 96, 192]
) β The output channels of the ConvStream module. List[int]
, optional, defaults to [256, 128, 64, 32]
) β The output channels of the Fusion blocks. This is the configuration class to store the configuration of VitMatteForImageMatting. It is used to instantiate a ViTMatte model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the ViTMatte hustvl/vitmatte-small-composition-1k architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
>>> from transformers import VitMatteConfig, VitMatteForImageMatting >>> >>> configuration = VitMatteConfig() >>> >>> model = VitMatteForImageMatting(configuration) >>> >>> configuration = model.config
Serializes this instance to a Python dictionary. Override the default to_dict(). Returns: Dict[str, any]
: Dictionary of all the attributes that make up this configuration instance,
( do_rescale: bool = True rescale_factor: typing.Union[int, float] = 0.00392156862745098 do_normalize: bool = True image_mean: typing.Union[float, typing.List[float], NoneType] = None image_std: typing.Union[float, typing.List[float], NoneType] = None do_pad: bool = True size_divisibility: int = 32 **kwargs )
Parameters
bool
, optional, defaults to True
) β Whether to rescale the image by the specified scale rescale_factor
. Can be overridden by the do_rescale
parameter in the preprocess
method. int
or float
, optional, defaults to 1/255
) β Scale factor to use if rescaling the image. Can be overridden by the rescale_factor
parameter in the preprocess
method. bool
, optional, defaults to True
) β Whether to normalize the image. Can be overridden by the do_normalize
parameter in the preprocess
method. float
or List[float]
, optional, defaults to IMAGENET_STANDARD_MEAN
) β Mean to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by the image_mean
parameter in the preprocess
method. float
or List[float]
, optional, defaults to IMAGENET_STANDARD_STD
) β Standard deviation to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by the image_std
parameter in the preprocess
method. bool
, optional, defaults to True
) β Whether to pad the image to make the width and height divisible by size_divisibility
. Can be overridden by the do_pad
parameter in the preprocess
method. int
, optional, defaults to 32) β The width and height of the image will be padded to be divisible by this number. Constructs a ViTMatte image processor.
preprocess < source >( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] trimaps: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] do_rescale: typing.Optional[bool] = None rescale_factor: typing.Optional[float] = None do_normalize: typing.Optional[bool] = None image_mean: typing.Union[float, typing.List[float], NoneType] = None image_std: typing.Union[float, typing.List[float], NoneType] = None do_pad: typing.Optional[bool] = None size_divisibility: typing.Optional[int] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None data_format: typing.Union[str, transformers.image_utils.ChannelDimension] = <ChannelDimension.FIRST: 'channels_first'> input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None )
Parameters
ImageInput
) β Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, set do_rescale=False
. ImageInput
) β Trimap to preprocess. bool
, optional, defaults to self.do_rescale
) β Whether to rescale the image values between [0 - 1]. float
, optional, defaults to self.rescale_factor
) β Rescale factor to rescale the image by if do_rescale
is set to True
. bool
, optional, defaults to self.do_normalize
) β Whether to normalize the image. float
or List[float]
, optional, defaults to self.image_mean
) β Image mean to use if do_normalize
is set to True
. float
or List[float]
, optional, defaults to self.image_std
) β Image standard deviation to use if do_normalize
is set to True
. bool
, optional, defaults to self.do_pad
) β Whether to pad the image. int
, optional, defaults to self.size_divisibility
) β The size divisibility to pad the image to if do_pad
is set to True
. str
or TensorType
, optional) β The type of tensors to return. Can be one of:
np.ndarray
.TensorType.TENSORFLOW
or 'tf'
: Return a batch of type tf.Tensor
.TensorType.PYTORCH
or 'pt'
: Return a batch of type torch.Tensor
.TensorType.NUMPY
or 'np'
: Return a batch of type np.ndarray
.TensorType.JAX
or 'jax'
: Return a batch of type jax.numpy.ndarray
.ChannelDimension
or str
, optional, defaults to ChannelDimension.FIRST
) β The channel dimension format for the output image. Can be one of:
"channels_first"
or ChannelDimension.FIRST
: image in (num_channels, height, width) format."channels_last"
or ChannelDimension.LAST
: image in (height, width, num_channels) format.ChannelDimension
or str
, optional) β The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:
"channels_first"
or ChannelDimension.FIRST
: image in (num_channels, height, width) format."channels_last"
or ChannelDimension.LAST
: image in (height, width, num_channels) format."none"
or ChannelDimension.NONE
: image in (height, width) format.Preprocess an image or batch of images.
VitMatteForImageMatting class transformers.VitMatteForImageMatting < source >( config )
Parameters
ViTMatte framework leveraging any vision backbone e.g. for ADE20k, CityScapes.
forward < source >( pixel_values: typing.Optional[torch.Tensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None labels: typing.Optional[torch.Tensor] = None return_dict: typing.Optional[bool] = None ) β transformers.models.vitmatte.modeling_vitmatte.ImageMattingOutput
or tuple(torch.FloatTensor)
Parameters
torch.FloatTensor
of shape (batch_size, num_channels, height, width)
) β Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using AutoImageProcessor. See VitMatteImageProcessor.call() for details. bool
, optional) β Whether or not to return the attentions tensors of all attention layers in case the backbone has them. See attentions
under returned tensors for more detail. bool
, optional) β Whether or not to return the hidden states of all layers of the backbone. See hidden_states
under returned tensors for more detail. bool
, optional) β Whether or not to return a ModelOutput instead of a plain tuple. torch.LongTensor
of shape (batch_size, height, width)
, optional) β Ground truth image matting for computing the loss. Returns
transformers.models.vitmatte.modeling_vitmatte.ImageMattingOutput
or tuple(torch.FloatTensor)
A transformers.models.vitmatte.modeling_vitmatte.ImageMattingOutput
or a tuple of torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the configuration (VitMatteConfig) and inputs.
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) β Loss.
alphas (torch.FloatTensor
of shape (batch_size, num_channels, height, width)
) β Estimated alpha values.
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) β Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each stage) of shape (batch_size, sequence_length, hidden_size)
. Hidden-states (also called feature maps) of the model at the output of each stage.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) β Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, patch_size, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The VitMatteForImageMatting forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Examples:
>>> from transformers import VitMatteImageProcessor, VitMatteForImageMatting >>> import torch >>> from PIL import Image >>> from huggingface_hub import hf_hub_download >>> processor = VitMatteImageProcessor.from_pretrained("hustvl/vitmatte-small-composition-1k") >>> model = VitMatteForImageMatting.from_pretrained("hustvl/vitmatte-small-composition-1k") >>> filepath = hf_hub_download( ... repo_id="hf-internal-testing/image-matting-fixtures", filename="image.png", repo_type="dataset" ... ) >>> image = Image.open(filepath).convert("RGB") >>> filepath = hf_hub_download( ... repo_id="hf-internal-testing/image-matting-fixtures", filename="trimap.png", repo_type="dataset" ... ) >>> trimap = Image.open(filepath).convert("L") >>> >>> inputs = processor(images=image, trimaps=trimap, return_tensors="pt") >>> with torch.no_grad(): ... alphas = model(**inputs).alphas >>> print(alphas.shape) torch.Size([1, 1, 640, 960])< > Update on GitHub
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.3