John Inacay, Mike Wang, and Wiley Wang (All authors contributed equally)

  1. Why are Transformers Important?

Transformers have taken the world of Natural Language Processing (NLP) and Deep Learning by storm. This neural network model architecture was introduced in the paper Attention Is All You Need in 2017 as an alternative mechanism for attention and has quickly become a dominant technique in Natural Language Processing. Google’s BERT and OpenAI’s GPT-3 are both state of the art language models that are predominantly based on the Transformer architecture.

Before the transformer architecture was introduced, NLP used many different problem-specific models for each NLP problem. Now, it’s common to only have a single model as the backbone to handle many different tasks. As an analogy to the field of Computer Vision, Convolution Neural Networks (CNN) are commonly used for problems such as Object Detection, Image Classification, and Instance Segmentation. The CNN that provides the backbone to the network generally extracts intermediate level features such as edges and blobs within the image. These intermediate level features are typically valuable to many different Computer Vision tasks which allow users to apply the same network using transfer learning to many different problems. Similarly, transformer architectures such as BERT generally extracts intermediate level features such as syntax and word embeddings that are useful to many different tasks such as sentiment classification and machine translation. Transformer architectures such as BERT allow users to apply the same pretrained network to new problems and achieve significantly higher performance than before.

2. What is Attention? And how does it work in Transformers?

Attention models are networks that introduce a weighting of signals based on importance. For a language example, consider from Cheng et Al., 2016 the sentence “The FBI is chasing a criminal on the run”. When reading the sentence, certain words are focused on due to context with the relationship of the current word. When reading the word “criminal”, “FBI” and “chasing” have a strong weighting along with the immediately nearby words. A reader may unconsciously do this as they are reading, but a neural network would need to specifically design parts of it to replicate this attention to certain words.

Attention as a model mechanism attempts to replicate this focusing of relevant information as a signal within the network architecture. Transformers accomplishes this by using scaled-dot product attention to calculate focus as a vector of “scores” for importance. Let’s propose a neural network composed of an encoding network to transform an input into an intermediate embedding and a decoding network for the output for a task.

Let’s say we have s_t as the hidden state on the decoder and h_i as the hidden state of encoder, we can formulate it as a dot product between them. We scale it by dimensions of the vector n as a simple dot product can cause a problem with softmax used further down the network by causing extremely small gradients. With this formulation we get the Scaled Dot Product function. (See this post for more information on attention)

Transformers uses this scaled dot product scoring by having 3 learnable weight layers that are applied to the same encoded input. These outputs are called Key (K), Query (Q), and Value (V) embeddings of dimension d_k. The intention of Key and Query embeddings are s_t and h_i from the scaled dot product formula. We get the scores as input to the softmax, and the final attention embedding is as follows:

3. What is Time2Vec?

In the transformer sequence to sequence network, we often need to encode time, or position. In the original transformer, these vectors encode the position of words and distances between them. They can be added or concatenated to the word embeddings.

In a 2019 paper, “Time2Vec: Learning a Vector Representation of Time”, the time vector is learned. The time function is constructed to meet the following properties; periodicity, invariance to Time Rescaling, and simplicity.

Where i is the i-th element, and τ is the notion of time. In a neural network, it is represented as a learned layer.

The original transformer position encoding shares similar intent. A visualized example from “The Illustrated Transformer” is as follows:

4. How do you use BERT?

Transfer learning is widely used in the field of both Computer Vision and Natural Language Processing. In the case of using transformer based architectures such as BERT, transfer learning is commonly used to adapt or fine tune a network to a new task. Some examples of potential applications are sentiment classification and machine translation (translating english to french). Transfer learning is the process of taking a network that has already been pretrained on a task (for example BERT was trained on the problem of language modeling with a large dataset) and fine tuning it on a specific task. One of the advantages of fine tuning an existing network is that the new task often needs many fewer examples to train than if the user wanted to train a network from scratch. In addition, fine tuning a network using transfer learning usually produces significantly higher performance than training a network from scratch. This higher performance suggests that features learned on the previous task are often still useful and can be reused on the new task.

For a Natural Language Processing problem, an easy solution would be to take a pretrained BERT neural network from HuggingFace and apply it to your specific problem. Some example applications people are able to fine tune BERT for include sentiment analysis and spam detection. For real world problems that require faster processing speeds, DistilBERT is a smaller and faster network that runs 60% faster but still achieves 95% of the original BERT’s performance.

5. Scalability

Unlike Recurrent Neural Networks (RNN) such as Long Short Term Memory (LSTM) and Gated Recurrent Units (GRU), Transformers do not need to process sequential data in order. Before the advent of Transformers, researchers had experimented with adding the attention mechanism to LSTMs and GRUs and achieved significant increases in performance. People realized that the attention mechanism was powerful even without the recurrence in RNNs which led to the creation of the Transformer architecture. The inherent parallelizability of the architecture allows the transformer to scale much better than RNNs as the length of the input sequence increases in size. This parallelizability is one of the properties that allow OpenAI’s GPT-3 to successfully scale to up to 175 billion parameters.

6. What is DETR?

Attention is a powerful network representation that is used beyond NLP and sequence to sequence problems. One of the examples, DETR, shows its application in the 2D object detection area.

Detection Transformer (DETR) is an object detection network from Facebook Research using a Transformer head to produce multi-class bounding box detections. As transformers generate importance on a sequence of inputs in NLP, they can generate importance on 2D x-y coordinates in the CV domain. DETR encodes the 2D encoding location of a part of an image instead of a time encoding like Time2Vec.

The model interestingly formulates the problem differently than traditional object detection frameworks. Instead of dividing an image into patches and proposing a set number of bounding boxes per patch, the network outputs a set number of direct bounding boxes. The former approach requires post-processing like Non-max suppression, but the latter now just requires filtering by confidence value of each direct bounding box.

7. Implication of Transformers

The Transformer architecture represents the state of the art in Natural Language Processing. This architecture demonstrates how powerful the attention mechanism can be. In addition, the inherent parallelizability of the transformer allows us to scale neural networks to be much larger and able to train on larger datasets. By combining the attention mechanism with increased scalability, transformers have transformed the way we understand NLP.

8. Resources

The Illustrated Transformer

HuggingFace Transformers

A Visual Guide to Using BERT for the First Time

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Transformer (machine learning model)

Time2Vec: Learning a Vector Representation of Time

Stock predictions with state-of-the-art Transformer and Time Embeddings

We’re a team of Machine Learning Engineers exploring and researching deep learning technologies