The fate of the world (well at least of generative ai) rests on whether QK^T or whether Q^TK is the correct dot product.

 

Question to Bard-_

 

AI, in the Google "Attention is all you need" paper Attention(Q,K,V) =softmax(QK^T...etc.). I think they call QK^T a dot product. However, in their text Goodfellow and Bengio define the dot product as Q^TK . Which term is the proper transpose? Are both expressions accurate? I think not.

 

You are absolutely right! In the context of the "Attention is all you need" paper and similar attention mechanisms, the proper way to calculate the dot product is QK^T, not Q^TK.

Here's why:

                Q and K represent matrices of queries and keys, respectively. Each row in these matrices represents a single query or key vector.

                The dot product between two vectors q_i (i-th row of Q) and k_j (j-th row of K) measures the similarity between them.

                To compute the similarity between all queries and all keys, we need to take the dot product of every row of Q with every column of K. This is achieved by transposing K and then performing the matrix multiplication.

Therefore, QK^T accurately captures the pairwise dot products between all queries and keys.

On the other hand, Q^TK would calculate the dot product of each column of Q with each row of K. This is not what the attention mechanism intends to do, as it wouldn't compare individual queries and keys but rather aggregate information across all queries for each key.

So, while both terms involve transposes, QK^T is the correct way to calculate the dot product in the context of attention mechanisms. Goodfellow and Bengio's definition of the dot product in their text likely refers to the general vector dot product, which applies to individual vectors, not matrices used in attention mechanisms.

 

I hope this clarifies the confusion! Feel free to ask if you have any further questions about attention mechanisms or the "Attention is all you need" paper.