The Fuyu model was created by ADEPT, and authored by Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar.

The authors introduced Fuyu-8B, a decoder-only multimodal model based on the classic transformers architecture, with query and key normalization. A linear encoder is added to create multimodal embeddings from image inputs.

By treating image tokens like text tokens and using a special image-newline character, the model knows when an image line ends. Image positional embeddings are removed. This avoids the need for different training phases for various image resolutions. With 8 billion parameters and licensed under CC-BY-NC, Fuyu-8B is notable for its ability to handle both text and images, its impressive context size of 16K, and its overall performance.

Inputs need to be passed through a specific Processor to have the correct formats. A processor requires an image_processor and a tokenizer. Hence, inputs can be loaded via:

FuyuConfig

class transformers.FuyuConfig

< source >

( vocab_size = 262144 hidden_size = 4096 intermediate_size = 16384 num_hidden_layers = 36 num_attention_heads = 64 hidden_act = 'relu2' max_position_embeddings = 16384 image_size = 300 patch_size = 30 num_channels = 3 initializer_range = 0.02 layer_norm_eps = 1e-05 use_cache = True tie_word_embeddings = False rope_theta = 25000.0 rope_scaling = None qk_layernorm = True hidden_dropout = 0.0 attention_dropout = 0.0 partial_rotary_factor = 0.5 pad_token_id = None bos_token_id = 1 eos_token_id = 2 text_config = None **kwargs )

Parameters

vocab_size (int, optional, defaults to 262144) — Vocabulary size of the Fuyu model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling FuyuForCausalLM
hidden_size (int, optional, defaults to 4096) — Dimension of the hidden representations.
intermediate_size (int, optional, defaults to 16384) — Dimension of the MLP representations.
num_hidden_layers (int, optional, defaults to 36) — Number of hidden layers in the Transformer encoder.
num_attention_heads (int, optional, defaults to 64) — Number of attention heads for each attention layer in the Transformer encoder.
hidden_act (str or function, optional, defaults to "relu2") — The non-linear activation function (function or string) in the decoder.
max_position_embeddings (int, optional, defaults to 16384) — The maximum sequence length that this model might ever be used with.
image_size (int, optional, defaults to 300) — The input image size.
patch_size (int, optional, defaults to 30) — The input vision transformer encoding patch size.
num_channels (int, optional, defaults to 3) — The input image number of channels.
initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (float, optional, defaults to 1e-05) — The epsilon used by the rms normalization layers.
use_cache (bool, optional, defaults to True) — Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True. Whether to tie weight embeddings
tie_word_embeddings (bool, optional, defaults to False) — Whether to tie input and output embeddings.
rope_theta (float, optional, defaults to 25000.0) — The base period of the RoPE embeddings.
rope_scaling (Dict, optional) — Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is {"type": strategy name, "factor": scaling factor}. When using this flag, don’t update max_position_embeddings to the expected new maximum. See the following thread for more information on how these scaling strategies behave: https://www.reddit.com/r/LocalFuyu/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an experimental feature, subject to breaking API changes in future versions.
qk_layernorm (bool, optional, defaults to True) — Whether or not to normalize the Queries and Keys after projecting the hidden states
hidden_dropout (float, optional, defaults to 0.0) — The dropout ratio after applying the MLP to the hidden states.
attention_dropout (float, optional, defaults to 0.0) — The dropout ratio after computing the attention scores.
partial_rotary_factor (float, optional, defaults to 0.5) — Percentage of the query and keys which will have rotary embedding.
pad_token_id (int, optional) — The id of the padding token.
bos_token_id (int, optional, defaults to 1) — The id of the beginning-of-sequence token.
eos_token_id (Union[int, List[int]], optional, defaults to 2) — The id of the end-of-sequence token. Optionally, use a list to set multiple end-of-sequence tokens.
text_config (dict, optional) — Dictionary of configuration options used to initialize the language```Aut.

This is the configuration class to store the configuration of a FuyuForCausalLM. It is used to instantiate an Fuyu model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the adept/fuyu-8b.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

>>> from transformers import FuyuConfig

>>> # Initializing a Fuyu fuyu-7b style configuration
>>> configuration = FuyuConfig()

FuyuForCausalLM

class transformers.FuyuForCausalLM

< source >

( config: FuyuConfig )

Parameters

config (FuyuConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

Fuyu Model with a language modeling head on top for causal language model conditioned on image patches and text. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_ids: LongTensor = None image_patches: Tensor = None image_patches_indices: Tensor = None attention_mask: Optional = None position_ids: Optional = None past_key_values: Optional = None inputs_embeds: Optional = None use_cache: Optional = None labels: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) → transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)

Parameters

input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.

Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.

What are input IDs?
attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:
- 1 for tokens that are not masked,
- 0 for tokens that are masked.
What are attention masks?

Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.

If past_key_values is used, optionally only the last decoder_input_ids have to be input (see past_key_values).

If you want to change padding behavior, you should read modeling_opt._prepare_decoder_attention_mask and modify to your needs. See diagram 1 in the paper for more information on the default strategy.
- 1 indicates the head is not masked,
- 0 indicates the head is masked.
image_patches (torch.FloatTensor of shape (batch_size, num_total_patches, patch_size_ x patch_size x num_channels), optional) — Image patches to be used as continuous embeddings. The patches are flattened and then projected to the hidden size of the model.
image_patches_indices (torch.LongTensor of shape (batch_size, num_total_patches + number_of_newline_tokens + number_of_text_tokens, patch_size_ x patch_size x num_channels ), optional) — Indices indicating at which position the image_patches have to be inserted in input_embeds.
position_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1].

What are position IDs?
past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) — Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don’t have their past key value states given to this model) of shape (batch_size, 1) instead of all decoder_input_ids of shape (batch_size, sequence_length).
inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.
use_cache (bool, optional) — If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).
output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.
output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.
return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
labels (torch.LongTensor of shape (batch_size, sequence_length), optional) — Labels for computing the masked language modeling loss. Indices should either be in [0, ..., config.vocab_size] or -100 (see input_ids docstring). Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size].

Returns

transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)

A transformers.modeling_outputs.CausalLMOutputWithPast or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (FuyuConfig) and inputs.

loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) — Language modeling loss (for next-token prediction).
logits (torch.FloatTensor of shape (batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) — Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head))

Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.
hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

The FuyuForCausalLM forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

>>> from transformers import FuyuProcessor, FuyuForCausalLM
>>> from PIL import Image
>>> import requests

>>> processor = FuyuProcessor.from_pretrained("adept/fuyu-8b")
>>> model = FuyuForCausalLM.from_pretrained("adept/fuyu-8b")

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> prompt = "Generate a coco-style caption.\n"

>>> inputs = processor(text=text_prompt, images=image, return_tensors="pt")
>>> outputs = model(**inputs)

>>> generated_ids = model.generate(**model_inputs, max_new_tokens=7)
>>> generation_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generation_text)
'A bus parked on the side of a road.'

FuyuImageProcessor

class transformers.FuyuImageProcessor

< source >

( do_resize: bool = True size: Optional = None resample: Resampling = <Resampling.BILINEAR: 2> do_pad: bool = True padding_value: float = 1.0 padding_mode: str = 'constant' do_normalize: bool = True image_mean: Union = 0.5 image_std: Union = 0.5 do_rescale: bool = True rescale_factor: float = 0.00392156862745098 patch_size: Optional = None **kwargs )

Parameters

do_resize (bool, optional, defaults to True) — Whether to resize the image to size.
size (Dict[str, int], optional, defaults to {"height" -- 1080, "width": 1920}): Dictionary in the format {"height": int, "width": int} specifying the size of the output image.
resample (PILImageResampling, optional, defaults to Resampling.BILINEAR) — PILImageResampling filter to use when resizing the image e.g. PILImageResampling.BILINEAR.
do_pad (bool, optional, defaults to True) — Whether to pad the image to size.
padding_value (float, optional, defaults to 1.0) — The value to pad the image with.
padding_mode (str, optional, defaults to "constant") — The padding mode to use when padding the image.
do_normalize (bool, optional, defaults to True) — Whether to normalize the image.
image_mean (float, optional, defaults to 0.5) — The mean to use when normalizing the image.
image_std (float, optional, defaults to 0.5) — The standard deviation to use when normalizing the image.
do_rescale (bool, optional, defaults to True) — Whether to rescale the image.
rescale_factor (float, optional, defaults to 1 / 255) — The factor to use when rescaling the image.
patch_size (Dict[str, int], optional, defaults to {"height" -- 30, "width": 30}): Dictionary in the format {"height": int, "width": int} specifying the size of the patches.

This class should handle the image processing part before the main FuyuForCausalLM. In particular, it should handle:

Processing Images: Taking a batch of images as input. If the images are variable-sized, it resizes them based on the desired patch dimensions. The image output is always img_h, img_w of (1080, 1920)

Then, it patches up these images using the patchify_image function.
Creating Image Input IDs: For each patch, a placeholder ID is given to identify where these patches belong in a token sequence. For variable-sized images, each line of patches is terminated with a newline ID.
Image Patch Indices: For each image patch, the code maintains an index where these patches should be inserted in a token stream.

call

< source >

( images **kwargs )

Preprocess an image or a batch of images.

FuyuProcessor

class transformers.FuyuProcessor

< source >

( image_processor tokenizer )

Parameters

image_processor (FuyuImageProcessor) — The image processor is a required input.
tokenizer (LlamaTokenizerFast) — The tokenizer is a required input.

Constructs a Fuyu processor which wraps a Fuyu image processor and a Llama tokenizer into a single processor.

FuyuProcessor offers all the functionalities of FuyuImageProcessor and LlamaTokenizerFast. See the call() and decode() for more information.

call

< source >

( text = None images = None add_special_tokens: bool = True return_attention_mask: bool = True padding: Union = False truncation: Union = None max_length: Optional = None stride: int = 0 pad_to_multiple_of: Optional = None return_overflowing_tokens: bool = False return_special_tokens_mask: bool = False return_offsets_mapping: bool = False return_token_type_ids: bool = False return_length: bool = False verbose: bool = True return_tensors: Union = None **kwargs ) → FuyuBatchEncoding

Parameters

text (str, List[str]) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).
images (PIL.Image.Image, List[PIL.Image.Image]) — The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a number of channels, H and W are image height and width.

Returns

FuyuBatchEncoding

A FuyuBatchEncoding with the following fields:

input_ids — Tensor of token ids to be fed to a model. Returned when text is not None.
image_patches — List of Tensor of image patches. Returned when images is not None.
image_patches_indices — Tensor of indices where patch embeddings have to be inserted by the model.
attention_mask — List of indices specifying which tokens should be attended to by the model when return_attention_mask=True.

Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the text and kwargs arguments to LlamaTokenizerFast’s call() if text is not None to encode the text. To prepare the image(s), this method forwards the images and kwargs arguments to FuyuImageProcessor’s call() if images is not None. Please refer to the doctsring of the above two methods for more information.

Fuyu

Overview

FuyuConfig

class transformers.FuyuConfig

FuyuForCausalLM

class transformers.FuyuForCausalLM

forward

FuyuImageProcessor

class transformers.FuyuImageProcessor

__call__

FuyuProcessor

class transformers.FuyuProcessor

__call__

call

call