The DeiT model was proposed in Training data-efficient image transformers & distillation through attention by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou. The Vision Transformer (ViT) introduced in Dosovitskiy et al., 2020 has shown that one can match or even outperform existing convolutional neural networks using a Transformer encoder (BERT-like). However, the ViT models introduced in that paper required training on expensive infrastructure for multiple weeks, using external data. DeiT (data-efficient image transformers) are more efficiently trained transformers for image classification, requiring far less data and far less computing resources compared to the original ViT models.

Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. However, these visual transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption. In this work, we produce a competitive convolution-free transformer by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data. More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and models.

This model was contributed by nielsr. The TensorFlow version of this model was added by amyeroberts.

Usage tips

Resources

A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with DeiT.

If you’re interested in submitting a resource to be included here, please feel free to open a Pull Request and we’ll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.

DeiTConfig

DeiTFeatureExtractor

DeiTImageProcessor

DeiT

Overview

Usage tips

Resources

DeiTConfig

class transformers.DeiTConfig

DeiTFeatureExtractor

class transformers.DeiTFeatureExtractor

__call__

DeiTImageProcessor

class transformers.DeiTImageProcessor

preprocess

DeiTModel

class transformers.DeiTModel

forward

DeiTForMaskedImageModeling

class transformers.DeiTForMaskedImageModeling

forward

DeiTForImageClassification

class transformers.DeiTForImageClassification

forward

DeiTForImageClassificationWithTeacher

class transformers.DeiTForImageClassificationWithTeacher

forward

TFDeiTModel

class transformers.TFDeiTModel

call

TFDeiTForMaskedImageModeling

class transformers.TFDeiTForMaskedImageModeling

call

TFDeiTForImageClassification

class transformers.TFDeiTForImageClassification

call

TFDeiTForImageClassificationWithTeacher

class transformers.TFDeiTForImageClassificationWithTeacher

call

call