You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[CPU] GQA supports head_sink input for smooth softmax (microsoft#25269)
### Description
It is an extension of [Smooth
Softmax](microsoft#21867) feature.
The difference is that each head has a learnable smooth factor that
adding to the denominator of softmax. The smooth factor is like an extra
element that joins the softmax.
The usage of the smooth factor in softmax is like the following:
```math
softmax_{i} = \frac{exp(x_{i})}{exp(s)+ \sum_{j} exp(x_{j})}
```
The head_sink is a float tensor with length of number of attention
heads. For h-th head, `head_sink[h]` is used as smooth factor s. When
head_sink is not provided, constant 0 is used as smooth factor s.
Changes:
- [x] Update operator spec to add an optional new input `head_sink`
- [x] Implement CPU (MLAS) kernel.
- [x] Update test_gqa_cpu.py to test it.
CUDA kernel will be updated later in a separate PR.
0 commit comments