The Fuyu model was created by ADEPT, and authored by Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar.
The authors introduced Fuyu-8B, a decoder-only multimodal model based on the classic transformers architecture, with query and key normalization. A linear encoder is added to create multimodal embeddings from image inputs.
By treating image tokens like text tokens and using a special image-newline character, the model knows when an image line ends. Image positional embeddings are removed. This avoids the need for different training phases for various image resolutions. With 8 billion parameters and licensed under CC-BY-NC, Fuyu-8B is notable for its ability to handle both text and images, its impressive context size of 16K, and its overall performance.
The Fuyu
models were trained using bfloat16
, but the original inference uses float16
The checkpoints uploaded on the hub use torch_dtype = 'float16'
which will be
used by the AutoModel
API to cast the checkpoints from torch.float32
to torch.float16
.
The dtype
of the online weights is mostly irrelevant, unless you are using torch_dtype="auto"
when initializing a model using model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto")
. The reason is that the model will first be downloaded ( using the dtype
of the checkpoints online) then it will be cast to the default dtype
of torch
(becomes torch.float32
). Users should specify the torch_dtype
they want, and if they don’t it will be torch.float32
.
Finetuning the model in float16
is not recommended and known to produce nan
, as such the model should be fine-tuned in bfloat16
.
Tips:
git clone https://github.com/persimmon-ai-labs/adept-inference
, then get the checkpoints:git clone https://github.com/persimmon-ai-labs/adept-inference
wget path/to/fuyu-8b-model-weights.tar
tar -xvf fuyu-8b-model-weights.tar
python src/transformers/models/fuyu/convert_fuyu_weights_to_hf.py --input_dir /path/to/downloaded/fuyu/weights/ --output_dir /output/path \
--pt_model_path /path/to/fuyu_8b_release/iter_0001251/mp_rank_00/model_optim_rng.pt
--ada_lib_path /path/to/adept-inference
For the chat model:
wget https://axtkn4xl5cip.objectstorage.us-phoenix-1.oci.customer-oci.com/n/axtkn4xl5cip/b/adept-public-data/o/8b_chat_model_release.tar tar -xvf 8b_base_model_release.tar
Then, model can be loaded via:
from transformers import FuyuConfig, FuyuForCausalLM
model_config = FuyuConfig()
model = FuyuForCausalLM(model_config).from_pretrained('/output/path')
Inputs need to be passed through a specific Processor to have the correct formats. A processor requires an image_processor and a tokenizer. Hence, inputs can be loaded via:
from PIL import Image
from transformers import AutoTokenizer
from transformers.models.fuyu.processing_fuyu import FuyuProcessor
from transformers.models.fuyu.image_processing_fuyu import FuyuImageProcessor
tokenizer = AutoTokenizer.from_pretrained('adept-hf-collab/fuyu-8b')
image_processor = FuyuImageProcessor()
processor = FuyuProcessor(image_processor=image_processor, tokenizer=tokenizer)
text_prompt = "Generate a coco-style caption.\\n"
bus_image_url = "https://huggingface.co/datasets/hf-internal-testing/fixtures-captioning/resolve/main/bus.png"
bus_image_pil = Image.open(io.BytesIO(requests.get(bus_image_url).content))
inputs_to_model = processor(text=text_prompt, images=image_pil)
This model was contributed by Molbap. The original code can be found here.
Fuyu uses a sentencepiece
based tokenizer, with a Unigram
model. It supports bytefallback, which is only available in tokenizers==0.14.0
for the fast tokenizer.
The LlamaTokenizer
is used as it is a standard wrapper around sentencepiece.
The authors suggest to use the following prompt for image captioning: f"Generate a coco-style caption.\\n"
( vocab_size = 262144 hidden_size = 4096 intermediate_size = 16384 num_hidden_layers = 36 num_attention_heads = 64 hidden_act = 'relu2' max_position_embeddings = 16384 image_size = 300 patch_size = 30 num_channels = 3 initializer_range = 0.02 layer_norm_eps = 1e-05 use_cache = True tie_word_embeddings = False rope_theta = 25000.0 rope_scaling = None qk_layernorm = True hidden_dropout = 0.0 attention_dropout = 0.0 partial_rotary_factor = 0.5 pad_token_id = None bos_token_id = 1 eos_token_id = 2 text_config = None **kwargs )
Parameters
int
, optional, defaults to 262144) —
Vocabulary size of the Fuyu model. Defines the number of different tokens that can be represented by the
inputs_ids
passed when calling FuyuForCausalLM int
, optional, defaults to 4096) —
Dimension of the hidden representations. int
, optional, defaults to 16384) —
Dimension of the MLP representations. int
, optional, defaults to 36) —
Number of hidden layers in the Transformer encoder. int
, optional, defaults to 64) —
Number of attention heads for each attention layer in the Transformer encoder. str
or function
, optional, defaults to "relu2"
) —
The non-linear activation function (function or string) in the decoder. int
, optional, defaults to 16384) —
The maximum sequence length that this model might ever be used with. int
, optional, defaults to 300) —
The input image size. int
, optional, defaults to 30) —
The input vision transformer encoding patch size. int
, optional, defaults to 3) —
The input image number of channels. float
, optional, defaults to 0.02) —
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. float
, optional, defaults to 1e-05) —
The epsilon used by the rms normalization layers. bool
, optional, defaults to True
) —
Whether or not the model should return the last key/values attentions (not used by all models). Only
relevant if config.is_decoder=True
. Whether to tie weight embeddings bool
, optional, defaults to False
) —
Whether to tie input and output embeddings. float
, optional, defaults to 25000.0) —
The base period of the RoPE embeddings. Dict
, optional) —
Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
{"type": strategy name, "factor": scaling factor}
. When using this flag, don’t update
max_position_embeddings
to the expected new maximum. See the following thread for more information on how
these scaling strategies behave:
https://www.reddit.com/r/LocalFuyu/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an
experimental feature, subject to breaking API changes in future versions. bool
, optional, defaults to True
) —
Whether or not to normalize the Queries and Keys after projecting the hidden states float
, optional, defaults to 0.0) —
The dropout ratio after applying the MLP to the hidden states. float
, optional, defaults to 0.0) —
The dropout ratio after computing the attention scores. float
, optional, defaults to 0.5) —
Percentage of the query and keys which will have rotary embedding. int
, optional) —
The id of the padding token. int
, optional, defaults to 1) —
The id of the beginning-of-sequence token. Union[int, List[int]]
, optional, defaults to 2) —
The id of the end-of-sequence token. Optionally, use a list to set multiple end-of-sequence tokens. dict
, optional) —
Dictionary of configuration options used to initialize the language```Aut
. This is the configuration class to store the configuration of a FuyuForCausalLM. It is used to instantiate an Fuyu model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the adept/fuyu-8b.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
( config: FuyuConfig )
Parameters
Fuyu Model with a language modeling head on top for causal language model conditioned on image patches and text. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
( input_ids: LongTensor = None image_patches: Tensor = None image_patches_indices: Tensor = None attention_mask: Optional = None position_ids: Optional = None past_key_values: Optional = None inputs_embeds: Optional = None use_cache: Optional = None labels: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) → transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)
Parameters
torch.LongTensor
of shape (batch_size, sequence_length)
) —
Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
it.
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
torch.Tensor
of shape (batch_size, sequence_length)
, optional) —
Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
If past_key_values
is used, optionally only the last decoder_input_ids
have to be input (see
past_key_values
).
If you want to change padding behavior, you should read modeling_opt._prepare_decoder_attention_mask
and modify to your needs. See diagram 1 in the paper for more
information on the default strategy.
torch.FloatTensor
of shape (batch_size, num_total_patches, patch_size_ x patch_size x num_channels)
, optional) —
Image patches to be used as continuous embeddings. The patches are flattened and then projected to the
hidden size of the model. torch.LongTensor
of shape (batch_size, num_total_patches + number_of_newline_tokens + number_of_text_tokens, patch_size_ x patch_size x num_channels )
, optional) —
Indices indicating at which position the image_patches have to be inserted in input_embeds. torch.LongTensor
of shape (batch_size, sequence_length)
, optional) —
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]
.
tuple(tuple(torch.FloatTensor))
, optional, returned when use_cache=True
is passed or when config.use_cache=True
) —
Tuple of tuple(torch.FloatTensor)
of length config.n_layers
, with each tuple having 2 tensors of shape
(batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape
(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.
Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
blocks) that can be used (see past_key_values
input) to speed up sequential decoding.
If past_key_values
are used, the user can optionally input only the last decoder_input_ids
(those that
don’t have their past key value states given to this model) of shape (batch_size, 1)
instead of all
decoder_input_ids
of shape (batch_size, sequence_length)
.
torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
, optional) —
Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert input_ids
indices into associated vectors than the
model’s internal embedding lookup matrix. bool
, optional) —
If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see
past_key_values
). bool
, optional) —
Whether or not to return the attentions tensors of all attention layers. See attentions
under returned
tensors for more detail. bool
, optional) —
Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for
more detail. bool
, optional) —
Whether or not to return a ModelOutput instead of a plain tuple. torch.LongTensor
of shape (batch_size, sequence_length)
, optional) —
Labels for computing the masked language modeling loss. Indices should either be in [0, ..., config.vocab_size]
or -100 (see input_ids
docstring). Tokens with indices set to -100
are ignored
(masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size]
. Returns
transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)
A transformers.modeling_outputs.CausalLMOutputWithPast or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (FuyuConfig) and inputs.
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) — Language modeling loss (for next-token prediction).
logits (torch.FloatTensor
of shape (batch_size, sequence_length, config.vocab_size)
) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (tuple(tuple(torch.FloatTensor))
, optional, returned when use_cache=True
is passed or when config.use_cache=True
) — Tuple of tuple(torch.FloatTensor)
of length config.n_layers
, with each tuple having 2 tensors of shape
(batch_size, num_heads, sequence_length, embed_size_per_head)
)
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) — Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) — Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The FuyuForCausalLM forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Examples:
>>> from transformers import FuyuProcessor, FuyuForCausalLM
>>> from PIL import Image
>>> import requests
>>> processor = FuyuProcessor.from_pretrained("adept/fuyu-8b")
>>> model = FuyuForCausalLM.from_pretrained("adept/fuyu-8b")
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> prompt = "Generate a coco-style caption.\n"
>>> inputs = processor(text=text_prompt, images=image, return_tensors="pt")
>>> outputs = model(**inputs)
>>> generated_ids = model.generate(**model_inputs, max_new_tokens=7)
>>> generation_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generation_text)
'A bus parked on the side of a road.'
( do_resize: bool = True size: Optional = None resample: Resampling = <Resampling.BILINEAR: 2> do_pad: bool = True padding_value: float = 1.0 padding_mode: str = 'constant' do_normalize: bool = True image_mean: Union = 0.5 image_std: Union = 0.5 do_rescale: bool = True rescale_factor: float = 0.00392156862745098 patch_size: Optional = None **kwargs )
Parameters
bool
, optional, defaults to True
) —
Whether to resize the image to size
. Dict[str, int]
, optional, defaults to {"height" -- 1080, "width": 1920}
):
Dictionary in the format {"height": int, "width": int}
specifying the size of the output image. PILImageResampling
, optional, defaults to Resampling.BILINEAR
) —
PILImageResampling
filter to use when resizing the image e.g. PILImageResampling.BILINEAR
. bool
, optional, defaults to True
) —
Whether to pad the image to size
. float
, optional, defaults to 1.0) —
The value to pad the image with. str
, optional, defaults to "constant"
) —
The padding mode to use when padding the image. bool
, optional, defaults to True
) —
Whether to normalize the image. float
, optional, defaults to 0.5) —
The mean to use when normalizing the image. float
, optional, defaults to 0.5) —
The standard deviation to use when normalizing the image. bool
, optional, defaults to True
) —
Whether to rescale the image. float
, optional, defaults to 1 / 255
) —
The factor to use when rescaling the image. Dict[str, int]
, optional, defaults to {"height" -- 30, "width": 30}
):
Dictionary in the format {"height": int, "width": int}
specifying the size of the patches. This class should handle the image processing part before the main FuyuForCausalLM. In particular, it should handle:
Processing Images: Taking a batch of images as input. If the images are variable-sized, it resizes them based on the desired patch dimensions. The image output is always img_h, img_w of (1080, 1920)
Then, it patches up these images using the patchify_image function.
Creating Image Input IDs: For each patch, a placeholder ID is given to identify where these patches belong in a token sequence. For variable-sized images, each line of patches is terminated with a newline ID.
Image Patch Indices: For each image patch, the code maintains an index where these patches should be inserted in a token stream.
Preprocess an image or a batch of images.
( image_processor tokenizer )
Parameters
Constructs a Fuyu processor which wraps a Fuyu image processor and a Llama tokenizer into a single processor.
FuyuProcessor offers all the functionalities of FuyuImageProcessor and LlamaTokenizerFast. See the
call() and decode()
for more information.
( text = None images = None add_special_tokens: bool = True return_attention_mask: bool = True padding: Union = False truncation: Union = None max_length: Optional = None stride: int = 0 pad_to_multiple_of: Optional = None return_overflowing_tokens: bool = False return_special_tokens_mask: bool = False return_offsets_mapping: bool = False return_token_type_ids: bool = False return_length: bool = False verbose: bool = True return_tensors: Union = None **kwargs ) → FuyuBatchEncoding
Parameters
str
, List[str]
) —
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
is_split_into_words=True
(to lift the ambiguity with a batch of sequences). PIL.Image.Image
, List[PIL.Image.Image]
) —
The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a
number of channels, H and W are image height and width. Returns
FuyuBatchEncoding
A FuyuBatchEncoding
with the following fields:
text
is not None
.images
is not None
.return_attention_mask=True
.Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the text
and kwargs
arguments to LlamaTokenizerFast’s call() if text
is not None
to
encode the text. To prepare the image(s), this method forwards the images
and kwargs
arguments to
FuyuImageProcessor’s call() if images
is not None
. Please refer to the doctsring
of the above two methods for more information.