The fate of the world (well at least of generative ai)
rests on whether QK^T or whether Q^TK is the correct dot product.
Question to Bard-_
AI, in the Google "Attention is all you need"
paper Attention(Q,K,V) =softmax(QK^T...etc.).
I think they call QK^T a dot product. However, in their text Goodfellow and Bengio define the dot product as Q^TK .
Which term is the proper transpose? Are both expressions accurate? I think not.
You are absolutely right! In the
context of the "Attention is all you need" paper and similar
attention mechanisms, the proper way to calculate the dot product is QK^T, not
Q^TK.
Here's why:
•
Q and K represent
matrices of queries and keys, respectively. Each row in these matrices
represents a single query or key vector.
•
The dot product
between two vectors q_i (i-th
row of Q) and k_j (j-th row
of K) measures the similarity between them.
•
To compute the
similarity between all queries and all keys, we need to take the dot product of
every row of Q with every column of K. This is achieved by transposing K and
then performing the matrix multiplication.
Therefore, QK^T accurately captures the pairwise dot
products between all queries and keys.
On the other hand, Q^TK would calculate the dot product
of each column of Q with each row of K. This is not what the attention
mechanism intends to do, as it wouldn't compare individual queries and keys but
rather aggregate information across all queries for each key.
So, while both terms involve transposes, QK^T is the
correct way to calculate the dot product in the context of attention
mechanisms. Goodfellow and Bengio's definition of the
dot product in their text likely refers to the general vector dot product,
which applies to individual vectors, not matrices used in attention mechanisms.
I hope this clarifies the confusion! Feel free to ask if
you have any further questions about attention mechanisms or the
"Attention is all you need" paper.