As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pretraining phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pretraining, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.

This model was contributed by victorsanh. This model jax version was contributed by kamalkraj. The original code can be found here.

Usage tips

Resources

A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with DistilBERT. If you’re interested in submitting a resource to be included here, please feel free to open a Pull Request and we’ll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.

Combining DistilBERT and Flash Attention 2

First, make sure to install the latest version of Flash Attention 2 to include the sliding window attention feature.

Make also sure that you have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of flash-attn repository. Make also sure to load your model in half-precision (e.g. torch.float16)

DistilBertConfig

DistilBertTokenizer

DistilBertTokenizerFast

Pytorch

Hide Pytorch content

DistilBERT

Overview

Usage tips

Resources

Combining DistilBERT and Flash Attention 2

DistilBertConfig

class transformers.DistilBertConfig

DistilBertTokenizer

class transformers.DistilBertTokenizer

build_inputs_with_special_tokens

convert_tokens_to_string

create_token_type_ids_from_sequences

get_special_tokens_mask

DistilBertTokenizerFast

class transformers.DistilBertTokenizerFast

build_inputs_with_special_tokens

create_token_type_ids_from_sequences

DistilBertModel

class transformers.DistilBertModel

forward

DistilBertForMaskedLM

class transformers.DistilBertForMaskedLM

forward

DistilBertForSequenceClassification

class transformers.DistilBertForSequenceClassification

forward

DistilBertForMultipleChoice

class transformers.DistilBertForMultipleChoice

forward

DistilBertForTokenClassification

class transformers.DistilBertForTokenClassification

forward

DistilBertForQuestionAnswering

class transformers.DistilBertForQuestionAnswering

forward

TFDistilBertModel

class transformers.TFDistilBertModel

call

TFDistilBertForMaskedLM

class transformers.TFDistilBertForMaskedLM

call

TFDistilBertForSequenceClassification

class transformers.TFDistilBertForSequenceClassification

call

TFDistilBertForMultipleChoice

class transformers.TFDistilBertForMultipleChoice

call

TFDistilBertForTokenClassification

class transformers.TFDistilBertForTokenClassification

call

TFDistilBertForQuestionAnswering

class transformers.TFDistilBertForQuestionAnswering

call

FlaxDistilBertModel

class transformers.FlaxDistilBertModel

__call__

FlaxDistilBertForMaskedLM

class transformers.FlaxDistilBertForMaskedLM

__call__

FlaxDistilBertForSequenceClassification

class transformers.FlaxDistilBertForSequenceClassification

__call__

FlaxDistilBertForMultipleChoice

class transformers.FlaxDistilBertForMultipleChoice

__call__

FlaxDistilBertForTokenClassification

class transformers.FlaxDistilBertForTokenClassification

__call__

FlaxDistilBertForQuestionAnswering

class transformers.FlaxDistilBertForQuestionAnswering

__call__

call

call

call

call

call

call