Attention Implementation

It is better to use row to denote the input dimension and column to denote the channel dimension.

Cross Attention #

In the example of UNINEXT. We have a Features extracted from visual image and prompt say $F_v$ and $F_p$.

$F_v$ is of shape $n$, $m$.
$F_p$ is of shape $L$, $d$.

We want to use Cross Attention with to update $F_v$ with $F_p$ by the following formula: $$ F_v’ = F_v + CrossAttention(F_v, F_p) $$ here we mean use $F_v$ as query and $F_p$ as key and value.

最后的输出是一个$N$，$d$的矩阵，其中$N$是输入的序列长度，$d_output$是输出的维度。如果d_output 和 d相同，可以直接加上去，如果不同，可以用一个线性层来映射到d_output的维度。 $$ Atten(F_v, F_p) = softmax(\frac{Q_{F_v},K_{F_p}^T}{\sqrt{d}})V_{F_p} $$

$$ F_v’ = F_v + CrossAttention(F_v, F_p) $$ Bi-directional Attention 还要反过来 $$ F_p’ = F_p + CrossAttention(F_p, F_v) $$

Markdown Cheat Sheet

8 October 2023·251 words·2 mins

Tutorial markdown

Psychology III

8 October 2023·387 words·2 mins

Psychology English

UNINEXT论文

8 October 2023·1473 words·7 mins

论文 Video Instance Segmentation

Cross Attention #

Related