This is a recently introduced model so the API hasn’t been tested extensively. There may be some bugs or slight breaking changes to fix it in the future. If you see something strange, file a Github Issue.
The GLPN model was proposed in Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim. GLPN combines SegFormer’s hierarchical mix-Transformer with a lightweight decoder for monocular depth estimation. The proposed decoder shows better performance than the previously proposed decoders, with considerably less computational complexity.
The abstract from the paper is the following:
Depth estimation from a single image is an important task that can be applied to various fields in computer vision, and has grown rapidly with the development of convolutional neural networks. In this paper, we propose a novel structure and training strategy for monocular depth estimation to further improve the prediction accuracy of the network. We deploy a hierarchical transformer encoder to capture and convey the global context, and design a lightweight yet powerful decoder to generate an estimated depth map while considering local connectivity. By constructing connected paths between multi-scale local features and the global decoding stream with our proposed selective feature fusion module, the network can integrate both representations and recover fine details. In addition, the proposed decoder shows better performance than the previously proposed decoders, with considerably less computational complexity. Furthermore, we improve the depth-specific augmentation method by utilizing an important observation in depth estimation to enhance the model. Our network achieves state-of-the-art performance over the challenging depth dataset NYU Depth V2. Extensive experiments have been conducted to validate and show the effectiveness of the proposed approach. Finally, our model shows better generalisation ability and robustness than other comparative models.
Summary of the approach. Taken from the original paper.This model was contributed by nielsr. The original code can be found here.
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with GLPN.
( num_channels = 3 num_encoder_blocks = 4 depths = [2, 2, 2, 2] sr_ratios = [8, 4, 2, 1] hidden_sizes = [32, 64, 160, 256] patch_sizes = [7, 3, 3, 3] strides = [4, 2, 2, 2] num_attention_heads = [1, 2, 5, 8] mlp_ratios = [4, 4, 4, 4] hidden_act = 'gelu' hidden_dropout_prob = 0.0 attention_probs_dropout_prob = 0.0 initializer_range = 0.02 drop_path_rate = 0.1 layer_norm_eps = 1e-06 decoder_hidden_size = 64 max_depth = 10 head_in_index = -1 **kwargs )
Parameters
int
, optional, defaults to 3) —
The number of input channels. int
, optional, defaults to 4) —
The number of encoder blocks (i.e. stages in the Mix Transformer encoder). List[int]
, optional, defaults to [2, 2, 2, 2]
) —
The number of layers in each encoder block. List[int]
, optional, defaults to [8, 4, 2, 1]
) —
Sequence reduction ratios in each encoder block. List[int]
, optional, defaults to [32, 64, 160, 256]
) —
Dimension of each of the encoder blocks. List[int]
, optional, defaults to [7, 3, 3, 3]
) —
Patch size before each encoder block. List[int]
, optional, defaults to [4, 2, 2, 2]
) —
Stride before each encoder block. List[int]
, optional, defaults to [1, 2, 5, 8]
) —
Number of attention heads for each attention layer in each block of the Transformer encoder. List[int]
, optional, defaults to [4, 4, 4, 4]
) —
Ratio of the size of the hidden layer compared to the size of the input layer of the Mix FFNs in the
encoder blocks. str
or function
, optional, defaults to "gelu"
) —
The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu"
,
"relu"
, "selu"
and "gelu_new"
are supported. float
, optional, defaults to 0.0) —
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. float
, optional, defaults to 0.0) —
The dropout ratio for the attention probabilities. float
, optional, defaults to 0.02) —
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. float
, optional, defaults to 0.1) —
The dropout probability for stochastic depth, used in the blocks of the Transformer encoder. float
, optional, defaults to 1e-06) —
The epsilon used by the layer normalization layers. int
, optional, defaults to 64) —
The dimension of the decoder. int
, optional, defaults to 10) —
The maximum depth of the decoder. int
, optional, defaults to -1) —
The index of the features to use in the head. This is the configuration class to store the configuration of a GLPNModel. It is used to instantiate an GLPN model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the GLPN vinvino02/glpn-kitti architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
>>> from transformers import GLPNModel, GLPNConfig
>>> # Initializing a GLPN vinvino02/glpn-kitti style configuration
>>> configuration = GLPNConfig()
>>> # Initializing a model from the vinvino02/glpn-kitti style configuration
>>> model = GLPNModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
Preprocess an image or a batch of images.
( do_resize: bool = True size_divisor: int = 32 resample = <Resampling.BILINEAR: 2> do_rescale: bool = True **kwargs )
Parameters
bool
, optional, defaults to True
) —
Whether to resize the image’s (height, width) dimensions, rounding them down to the closest multiple of
size_divisor
. Can be overridden by do_resize
in preprocess
. int
, optional, defaults to 32) —
When do_resize
is True
, images are resized so their height and width are rounded down to the closest
multiple of size_divisor
. Can be overridden by size_divisor
in preprocess
. PIL.Image
resampling filter, optional, defaults to Resampling.BILINEAR
) —
Resampling filter to use if resizing the image. Can be overridden by resample
in preprocess
. bool
, optional, defaults to True
) —
Whether or not to apply the scaling factor (to make pixel values floats between 0. and 1.). Can be
overridden by do_rescale
in preprocess
. Constructs a GLPN image processor.
( images: Union do_resize: Optional = None size_divisor: Optional = None resample = None do_rescale: Optional = None return_tensors: Union = None data_format: ChannelDimension = <ChannelDimension.FIRST: 'channels_first'> input_data_format: Union = None **kwargs )
Parameters
PIL.Image.Image
or TensorType
or List[np.ndarray]
or List[TensorType]
) —
Images to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
passing in images with pixel values between 0 and 1, set do_normalize=False
. bool
, optional, defaults to self.do_resize
) —
Whether to resize the input such that the (height, width) dimensions are a multiple of size_divisor
. int
, optional, defaults to self.size_divisor
) —
When do_resize
is True
, images are resized so their height and width are rounded down to the
closest multiple of size_divisor
. PIL.Image
resampling filter, optional, defaults to self.resample
) —
PIL.Image
resampling filter to use if resizing the image e.g. PILImageResampling.BILINEAR
. Only has
an effect if do_resize
is set to True
. bool
, optional, defaults to self.do_rescale
) —
Whether or not to apply the scaling factor (to make pixel values floats between 0. and 1.). str
or TensorType
, optional) —
The type of tensors to return. Can be one of:None
: Return a list of np.ndarray
.TensorType.TENSORFLOW
or 'tf'
: Return a batch of type tf.Tensor
.TensorType.PYTORCH
or 'pt'
: Return a batch of type torch.Tensor
.TensorType.NUMPY
or 'np'
: Return a batch of type np.ndarray
.TensorType.JAX
or 'jax'
: Return a batch of type jax.numpy.ndarray
.ChannelDimension
or str
, optional, defaults to ChannelDimension.FIRST
) —
The channel dimension format for the output image. Can be one of:ChannelDimension.FIRST
: image in (num_channels, height, width) format.ChannelDimension.LAST
: image in (height, width, num_channels) format.ChannelDimension
or str
, optional) —
The channel dimension format for the input image. If unset, the channel dimension format is inferred
from the input image. Can be one of:"channels_first"
or ChannelDimension.FIRST
: image in (num_channels, height, width) format."channels_last"
or ChannelDimension.LAST
: image in (height, width, num_channels) format."none"
or ChannelDimension.NONE
: image in (height, width) format.Preprocess the given images.
( config )
Parameters
The bare GLPN encoder (Mix-Transformer) outputting raw hidden-states without any specific head on top. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
( pixel_values: FloatTensor output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) → transformers.modeling_outputs.BaseModelOutput or tuple(torch.FloatTensor)
Parameters
torch.FloatTensor
of shape (batch_size, num_channels, height, width)
) —
Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using
AutoImageProcessor. See GLPNImageProcessor.call() for details. bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. Returns
transformers.modeling_outputs.BaseModelOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.BaseModelOutput or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (GLPNConfig) and inputs.
last_hidden_state (torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
) — Sequence of hidden-states at the output of the last layer of the model.
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The GLPNModel forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import AutoImageProcessor, GLPNModel
>>> import torch
>>> from datasets import load_dataset
>>> dataset = load_dataset("huggingface/cats-image")
>>> image = dataset["test"]["image"][0]
>>> image_processor = AutoImageProcessor.from_pretrained("vinvino02/glpn-kitti")
>>> model = GLPNModel.from_pretrained("vinvino02/glpn-kitti")
>>> inputs = image_processor(image, return_tensors="pt")
>>> with torch.no_grad():
... outputs = model(**inputs)
>>> last_hidden_states = outputs.last_hidden_state
>>> list(last_hidden_states.shape)
[1, 512, 15, 20]
( config )
Parameters
GLPN Model transformer with a lightweight depth estimation head on top e.g. for KITTI, NYUv2. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
( pixel_values: FloatTensor labels: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) → transformers.modeling_outputs.DepthEstimatorOutput or tuple(torch.FloatTensor)
Parameters
torch.FloatTensor
of shape (batch_size, num_channels, height, width)
) —
Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using
AutoImageProcessor. See GLPNImageProcessor.call() for details. bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. torch.FloatTensor
of shape (batch_size, height, width)
, optional) —
Ground truth depth estimation maps for computing the loss. Returns
transformers.modeling_outputs.DepthEstimatorOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.DepthEstimatorOutput or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (GLPNConfig) and inputs.
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Classification (or regression if config.num_labels==1) loss.
predicted_depth (torch.FloatTensor
of shape (batch_size, height, width)
) — Predicted depth for each pixel.
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape (batch_size, num_channels, height, width)
.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, patch_size, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The GLPNForDepthEstimation forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Examples:
>>> from transformers import AutoImageProcessor, GLPNForDepthEstimation
>>> import torch
>>> import numpy as np
>>> from PIL import Image
>>> import requests
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> image_processor = AutoImageProcessor.from_pretrained("vinvino02/glpn-kitti")
>>> model = GLPNForDepthEstimation.from_pretrained("vinvino02/glpn-kitti")
>>> # prepare image for the model
>>> inputs = image_processor(images=image, return_tensors="pt")
>>> with torch.no_grad():
... outputs = model(**inputs)
... predicted_depth = outputs.predicted_depth
>>> # interpolate to original size
>>> prediction = torch.nn.functional.interpolate(
... predicted_depth.unsqueeze(1),
... size=image.size[::-1],
... mode="bicubic",
... align_corners=False,
... )
>>> # visualize the prediction
>>> output = prediction.squeeze().cpu().numpy()
>>> formatted = (output * 255 / np.max(output)).astype("uint8")
>>> depth = Image.fromarray(formatted)