Transformer:
The Self-Attention Mechanism
·
Published in
Machine
Intelligence and Deep Learning
·
11 min read
·
May 2, 2022
22
This post
gives a brief overview of the popular self-attention mechanism introduced in
‘Attention is All You Need’ paper, which is a de-facto of Machine Learning
tasks nowadays.
A video
presentation is available at this link.
The implementation can
be found at this link.
Blog by- Zubaidah Al-Mashhadani, Sudipto Baul
Intuition of the
paper:
The transformers
propose a simple network architecture that is based on the attention mechanism.
With the use of parallelization, the translation tasks are superior in quality
while consuming significantly less time than the sequential models such as recurrent
neural networks. Moreover, the paper shows that the transformer generalizes
well to other tasks with large or sparse data.
So, what is the
transformer?
If we think of
transformer in the simplest way, it is basically a Blackbox where it has input
that we want to translate and output of the translation. Now, by examining the
Blackbox, we see that it consists of an encoder and a decoder component.
The encoding
component it is a stack of six encoders, similarly the decoding component. The
encoders are designed such that they have an identical structure, every encoder
has two sublayers including the Feed Forward Neural Network, and the Self-
Attention layer. Hence, the input first go through the self-attention then the
feed-forward layer. Likewise, the decoder also has the same layers in addition
to a third layer that takes place between the previously introduced layers,
this layer is called the Encoder — Decoder Attention. It helps the decoder to
focus on relevant parts of the input sentence.
Let’s take a
step back to see how exactly the model deals with an input sentence to output a
desired translation. In natural Language Processing (NLP), generally words can
be seen as distinct inputs that have a relation with one another in the sense
of meanings. Thus, we transform each word into a vector as words cannot be
naively passed to a model or neural network without being encoded into a
numerical form, hence comes the use of the embedding algorithms. There are
multiple approaches to generate words embeddings that can be mainly split into
two categories: the probabilistic approaches and count-based approach. The
choice of the word embedding is the most significant step in the preprocessing
when performing an NLP task. Some of the widely used embedding algorithms are
BERT, word2vec, GloVe … etc.
The embedding
only happens in the bottom most encoder. Where all encoders receive a list of
vectors in size of 512. After embedding the words of the input sentence, each
vector flows through two layers of encoder. We can now observe a very important
property of the transformers that is it works in parallel. In other words, the
word at each position flow through its own path in the encoder. Meanwhile in
RNN such as LSTM the flow of data was sequential not parallel and that consumes
more time, hence, transformers are much faster using parallelization property.
Figure 1 shows the architecture of the transformer.
This graph should show the neural net
connection between the encoder
and the decoder.
Figure 1.
Transformer Architecture
The Encoder:
The encoder maps
an input sequence of symbol representation (x1, x2, …, xn)
to a sequence of continuous representations Z = (z1, …, z1), given Z the
decoder generates an output sequence (y1, …, ym) of
symbols one element at a time. The transformers apply this process using
stacked self-attention and pointwise, fully connected layers for the encoder
and the decoder. As explained earlier, the encoder receives a list of vectors
as the input. It processes this list by passing it to self-attention layer,
then to a feed-forward neural network, and sends out the output to the
following encoder.
We mentioned
self-attention multiple times so far, but what do we mean by self-attention? To
understand this let’s explore the idea of attention first. As humans, we use
our visual attention to focus on important features when looking at a picture
for example, refer to the cat image shown below, we would notice the features
that are detected by the red squares can make a refer about the image while not
paying attention to the background for instance in the purple rectangles,
moreover, if we cover those important features then we probably will not be
able to make a correct inference.
In similar way,
we can explain the relationship between words in a given
sentence. As illustrated below, where we can see “watching” we expect an
encounter as a movie, show, or a play very soon. Meanwhile, the word French
describe is related to the show, however it is not directly related to
“watching”. Hence, self-attention basically allows the model to look at the
other positions in the input sequence while processing each word, which will
lead to a better encoding.
Self-Attention:
Now that we
discussed the importance and role of self-attention in transformers, it is time
to explore the calculations, and the implementation of it as follows:
1- Create three
vectors for each word by multiplying each token (1x512) by three matrices
(Query, Key, Value) each (512x64) that were trained so we will have three
vectors for each word such that each vector of size (1x64).
2- Calculate the
score (weight) by using dot product of the query vector with the key vector of
the words (tokens). The score will determine how much focus to place on other
parts of the inputs as we encode a word at a specific position.
3- Now divide
the score by the square root of the dimension of the key vector which equals 8
in this case.
4- Apply SoftMax
operation to normalize the scores from 0 to 1 such that all weights will add up
to 1.
5- Next,
multiply each vector value with the score.
6- Finally sum
up the weighted value vectors to represent the encoding of the specified token.
7- Repeat for
all words, we end up having an attention map that fully encode the words using
attention.
Matrix
calculation of Self-Attention:
We start by
calculating the Query, Key, and Value matrices. This is obtained by multiplying
the matrix of the packed embeddings, by the weight matrices (WQ, Wk, WV) as shown below:
X = The User Question. WQKV are neural network
weights.
These three matrices (Q,K,V) were crucial
to our understanding of the attention architecture.
X * WQ. = Q(query)
X * WK =. K(context)
X * WV = V(content)
Figure 2. Matrix
Calculations of Self-Attention
Where, each row
in the matrix (X) corresponds to a word in the input sentence.
Next, obtaining
the score, by multiplying the Query by the Key matrix, dividing by the square
root of the dimension of the key, and also applying
the SoftMax operation, and multiply
the scores by the values, are all done in one step as follows:
Same as the Function in the Google Paper.
Multi-Headed
Attention:
The
self-attention layer is refined further by the addition of “multi-headed”
attention. This does improve the performance of the attention layer by
expanding the model’s ability to focus on different positions while encoding
for better predictions. Moreover, it gives the attention layer multiple
“representation subspaces” as in the multi-headed attention we have multiple
Query, Key, Value weights matrices instead of one. The transformer uses eight
attention heads, which leads to having eight sets of Q,
K, V matrices and eventually, end up having eight Z-matrices. Where, the
attention is calculated separately in eight different attention heads.
This arises a
challenge! The feed-forward layer is not expecting eight matrices, in fact it
is expecting a single matrix. To overcome this, first concatenate the Z
matrices from all the attention heads. Then, multiply by a weight matrix Wo
that was trained jointly with the model. Finally, this would result in Z
matrix, which will capture the information from all the attention heads and
this matrix is passed to the feed-forward layer.
Positional
Encoding:
As we discussed
earlier, each word is represented by a vector using embedding algorithms which
yields a token. Now, since the transformers works in parallel, it is important
to keep track of the words position within a sentence. The transformers
overcome this issue by adding a vector to each input embedding. These vectors
follow a pattern that the model learns to help determine the position of each
word, or the distance between different words in the sentence. The idea behind
it, is that it provides a meaningful distance between the embedding vectors
once they are projected into the Query, Key, and Value vectors and during the
dot-product attention. This is called positional encoding. For example, if the
embedding has a dimensionality of 4, the positional encoding is as follows:
“Since our model
contains no recurrence and no convolution (is straightforward), in order for the model to make use of the order of the
sequence, we must inject some information about the relative or absolute
position of the tokens in sequence. To this end, we add “positional encodings”
to the input embeddings…”
Google Paper
Figure 3. Positional
Encoding
It’s important
to note that the positional encodings have the same dimensions as the
embeddings so that both can be summed as shown in the previous example.
Sine and Cosine
functions of different frequencies are used:
PE(pos,2i) =
sin(pos / 100002i / d_model)
PE(pos, 2i+1) =
cos(pos / 100002i / d_model)
Where pos is
the position and i is the
dimensions. Such that, each dimension of the positional encoding corresponds to
a sinusoid.
In each
sublayer, such as self-attention and FFNN, in each encoder has a residual
connection around it followed by a layer-normalization step.
To put
everything, we learnt about the encoder in one big picture we can see it as
follows:
Figure 4. Encoder
Architecture
The Decoder:
We saw that the encoder
takes the input sequence and process it. The output of the top encoder is
transformed into a set of attention vectors K and V. These are used by each
encoder in its “encoder-decoder attention” layer that helps the decoder focus
on suitable positions in the input sentence. The self-attention layers in the
decoder are slightly different than the ones in the encoder, as in the decoder
the self-attention layer is only allowed to attend to earlier positions in the
output sequence. This is achieved by masking the future positions before the
SoftMax step in the self-attention calculation. Moreover, the “Encoder —
Decoder Attention” layer works similar to the
multiheaded self-attention except that it creates Queries matrix from the layer
below it and takes the keys and values matrix from the output of the encoder.
Final Linear and
SoftMax Layer:
The output of
the decoder is a vector of stacked outputs that we need to turn into a word.
Here comes the role of the final layer that is followed by the SoftMax layer.
The final layer is a fully connected neural network that projects the vector
produced by the stack of decoders into a larger vector that is called a logits
vector, each cell of the logits vector corresponds to the score of a unique
word. Then, the SoftMax layer turn those scores into probabilities that all
would add up to 1.0. Finally, the cell with the highest probability is then
chosen and the output is the word that is associated with it for this time
step.
Experimentation:
The original
model in the paper [1] was trained on the standard WMT 2014 English-German dataset
consisting of 4.5 million sentence pairs and English-French dataset
consisting of 36M sentences. Sentence pairs batched together by approximate
sequence length. Adam optimizer used with variable learning rate. Dropout was
applied to the output of each sub-layer before being added to the input of the
next sub-layer and normalized. Also, dropout was applied to the sums of the
embeddings and the positional encodings in both the encoder and decoder stacks.
Label smoothing was also applied.
Modification for
the project:
For this course
project, we applied the model on standard WMT 2014 English-German dataset
only following the code of [2]. Also, parts of the training process were
modified. Positional encoding was learned instead of static. Moreover, static
learning rate was used and no label smoothing was
employed.
Results:
The table 1
below represents the different variations of hyperparameters of the transformer
model. Also, it presents the perplexity score (PPL), BLEU score and number of
trainable parameters used for each version on the EN-DE dataset. At the top,
with the lowest values of the hyperparameters and hence lowest number of
parameters, is the base model. At the bottom, there is the big version of the
transformer with highest number of parameters as a
consequence of higher value of the hyperparameters although it achieves
the best BLEU and PPL scores.
Table 1. Variation
of the transformer model [1]
Comparison of transformer’s
results with the other models introduced in previous works are given in the
table 2 below. The BLEU score is given for English-German (EN-DE)
and English-French (EN-FR) tasks. The training cost is also
provided for the stated tasks in terms of Floating Point
Operations (FLOPs).
Table 2. Comparison
of transformer with other models [1]
Modified model’s
results:
We achieved BLEU
score of 35.38 and perplexity (PPL) score of 5.238 for the modified version of
the model on the EN-DE translation task which is comparable to 26.4 BLEU score
and 4.33 PPL score for the original model. Probably, the better BLEU score for the
modified version was because of the use of learned positional embedding instead
of a static one.
Working of the
attention mechanism:
Translation of a
sentence from German (src) to English (predicted
trg) using the modified model can be found in
Figure 5 below. The true translation of the sentence (trg)
is also given for comparison. The model could translate the sentence almost
perfectly.
Figure 5.
Translation of a sentence using transformer
To understand the
attention mechanism in greater depth, the attention weights of each head was
extracted from the model for the above translation. The attention matrices
formed by the attention weights over the translation of each word (EN-DE) for
the eight heads used in the model, is given in Figure 6 (lighter color means
higher value). It can be observed that mostly the attention value is higher
along the diagonal showing that the model gives attention to the specific word
most of the time. But there are some lightly shaded areas in some cases
representing the attention given to the neighbors for translation of a
particular word. Also, the attention given to the words differs over the heads
showing that each head picks up the importance from different perspectives during
translation. For example, true translation of German word ‘ein’
is ‘a’. From the figure, for some heads, attention is given to ‘a’
and not for others while translating ‘ein’.
Figure 6. Attention
matrices formed by the heads for a translation
Conclusion:
The Transformer
model is the first sequence transduction model based entirely on attention. It
replaces the recurrent layers with multi-headed self-attention. For translation
tasks, it can be trained significantly faster than recurrent or convolutional architectures.
It outperforms all previously reported ensembles achieving a new state of the
art result on WMT 2014 English-German translation task. The model can be
extended to problems involving other input/output modalities- image, audio,
video. Also local, restricted attention mechanisms can be investigated.
References:
[1] Vaswani,
Ashish, et al. “Attention is all you need.” Advances in
neural information processing systems 30 (2017).
[2] https://github.com/bentrevett/pytorch-seq2seq/attention
·Writer for
Machine
Intelligence and Deep Learning
Follow
More from Sudipto Baul and Machine Intelligence and Deep Learning