The BridgeTower model was proposed in BridgeTower: Building Bridges Between Encoders in Vision-Language Representative Learning by Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan. The goal of this model is to build a bridge between each uni-modal encoder and the cross-modal encoder to enable comprehensive and detailed interaction at each layer of the cross-modal encoder thus achieving remarkable performance on various downstream tasks with almost negligible additional performance and computational costs.

Vision-Language (VL) models with the TWO-TOWER architecture have dominated visual-language representation learning in recent years. Current VL models either use lightweight uni-modal encoders and learn to extract, align and fuse both modalities simultaneously in a deep cross-modal encoder, or feed the last-layer uni-modal representations from the deep pre-trained uni-modal encoders into the top cross-modal encoder. Both approaches potentially restrict vision-language representation learning and limit model performance. In this paper, we propose BRIDGETOWER, which introduces multiple bridge layers that build a connection between the top layers of uni-modal encoders and each layer of the crossmodal encoder. This enables effective bottom-up cross-modal alignment and fusion between visual and textual representations of different semantic levels of pre-trained uni-modal encoders in the cross-modal encoder. Pre-trained with only 4M images, BRIDGETOWER achieves state-of-the-art performance on various downstream vision-language tasks. In particular, on the VQAv2 test-std set, BRIDGETOWER achieves an accuracy of 78.73%, outperforming the previous state-of-the-art model METER by 1.09% with the same pre-training data and almost negligible additional parameters and computational costs. Notably, when further scaling the model, BRIDGETOWER achieves an accuracy of 81.15%, surpassing models that are pre-trained on orders-of-magnitude larger datasets.

Usage tips and examples

BridgeTower consists of a visual encoder, a textual encoder and cross-modal encoder with multiple lightweight bridge layers. The goal of this approach was to build a bridge between each uni-modal encoder and the cross-modal encoder to enable comprehensive and detailed interaction at each layer of the cross-modal encoder. In principle, one can apply any visual, textual or cross-modal encoder in the proposed architecture.

BridgeTowerConfig

BridgeTower

Overview

Usage tips and examples

BridgeTowerConfig

class transformers.BridgeTowerConfig

from_text_vision_configs

BridgeTowerTextConfig

class transformers.BridgeTowerTextConfig

BridgeTowerVisionConfig

class transformers.BridgeTowerVisionConfig

BridgeTowerImageProcessor

class transformers.BridgeTowerImageProcessor

preprocess

BridgeTowerProcessor

class transformers.BridgeTowerProcessor

__call__

BridgeTowerModel

class transformers.BridgeTowerModel

forward

BridgeTowerForContrastiveLearning

class transformers.BridgeTowerForContrastiveLearning

forward

BridgeTowerForMaskedLM

class transformers.BridgeTowerForMaskedLM

forward

BridgeTowerForImageAndTextRetrieval

class transformers.BridgeTowerForImageAndTextRetrieval

forward

call