This is a reading note of the paper: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Most sentences are just copy and paste from the original paper. XD
Abstract
BERT, which stands for Bidirectional Encoder Representations from Transformers, is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. It can be fine tuned with one additional output layer to create models for various tasks.
1 Introduction
Language model pre-training is an effective way for many NLP tasks including sentence level tasks like natural language inference and paraphrasing, as well as token level tasks such as named entity recognition and question answering.
There are two existing strategies for applying pre-trained language representations to downstream tasks:
-
feature-based, such as ELMo (Peters et al., 2018a), uses task-specific architectures that include the pre-trained representations as additional features;
-
fine-tuning, such as the Generative Pre-trained Transformer (OpenAI GPT) (Radford et al., 2018), introduces minimal task-specific parameters, and is trained on the downstream tasks by simply fine-tuning all pretrained parameters.
The authors argue that current techniques restrict the power of the pre-trained representations, especially for the fine-tuning approaches. This is because standard models are unidirectional, which limits the choice of architectures that can be used during pre-training. Such restrictions are sub-optimal for sentence-level tasks, and could be very harmful when applying finetuning based approaches to token-level tasks such as question answering, where it is crucial to incorporate context from both directions.
BERT alleviates the previously mentioned unidirectionality constraint by using a “masked language model” (MLM) pre-training objective, with a “next sentence prediction” task that jointly pretrains text-pair representations.
The contribution of this paper includes:
-
Demonstrate the importance of bidirectional pre-training for language representations.
-
Show that pre-trained representations reduce the need for many heavily-engineered task specific architectures.
-
BERT advances the state of the art for eleven NLP tasks.
2 Related Work
Briefly review the most widely-used approached in the history of pre-training general language representations.
2.1 Unsupervised Feature-based Approaches
ELMo and its predecessor generalize traditional word embedding research along a different dimension. They extract context-sensitive features from a left-to-right and a right-to-left language model. The contextual representation of each token is the concatenation of the left-to-right and right-to-left representations.
2.2 Unsupervised Fine-tuning Approaches
Sentence or document encoders which produce contextual token representations have been pre-trained from unlabeled text and fine-tuned for a supervised downstream task. The advantage of these approaches is that few parameters need to be learned from scratch. Left-to-right language modeling and auto-encoder objectives have been used for pre-training such models.
2.3 Transfer Learning from Supervised Data
There has also been work showing effective transfer from supervised tasks with large datasets, such as natural language inference (Conneau et al., 2017) and machine translation (McCann et al., 2017).
3 BERT
There are two steps in the framework:
- pre-training: the model is trained on unlabeled data over different pre-training tasks.
- fine-tuning: the BERT model is first initialized with the pre-trained parameters,
The question-answering example in Figure 1 will serve as a running example for this section.
The unified architecture across different tasks is a distinctive feature of BERT. There is a minimal difference between the pre-trained architecture and the final downstream architecture.
Model Architecture BERT’s model architecture is a multi-layer bidirectional Transformer encoder based on the original implementation described in Vaswani et al. (2017)