Embeddings are a key tool in transfer learning in NLP. Earlier this year, the paper “Deep contextualized word representations” introduced ELMo (2018), a new technique for embedding words into real vector space using bidirectional LSTMs trained on a language modeling objective. In addition to beating previous performance benchmarks, using ELMo as a pre-trained embedding for other NLP tasks allows for a potential 10x reduction in training data.

A sequence of works appeared in rapid succession this summer, first ULMFit by Howard and Ruder and then GPT from Radford et al. both matching or surpassing ELMo by using different architectures (e.g. Transformer instead of bi-LSTM for GPT) or different fine-tuning methods (e.g.discriminative fine-tuning in ULMFit).

From BERT paper (arxiv 1810.04805)

But these methods all share a common feature: language modeling (LM) is the unsupervised learning task used in the pre-training stage. In language modeling, the goal is to generate the next word in a sentence conditioned to previous words in a sentence: P(xi|x1xi1) such that the resulting distribution over sentences P(x1xn) is as close as possible to the distribution over sentences in the original training corpus. Intuitively, a good language model should capture significant information about not only individual words but also relationships between words based on their positions in sentences relative to each other.


BERT is different from ELMo and company primarily because it targets a different training objective. The main limitation of the earlier works is an inability to take into account both left and right contexts of the target word, since the language model objective is generated from left to right, adding successive words to a sentence. Even ELMo, which uses a bidirectional LSTM, simply concatenated the left-to-right and right-to-left information, meaning that the representation couldn’t take advantage of both left and right contexts simultaneously.

BERT replaces language modeling with a modified objective they called “masked language modeling”. In this model, words in a sentence are randomly erased and replaced with a special token (“masked”) with some small probability, 15%. Then, a Transformer is used to generate a prediction for the masked word based on the unmasked words surrounding it, both to the left and right.

Using this new objective, BERT is able to achieve state-of-the-art performance on a variety of tasks in the GLUE benchmark. Furthermore, the amount of task-specific customization is extremely limited, suggesting that the information needed to accomplish all of these tasks is  contained in the BERT embedding and in a very explicit form.

For additional details on BERT, including its other pre-training objective called “next-sentence-prediction”, check out the paper here.