-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[KVCache] Per Layer Sliding Window #17928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KVCache] Per Layer Sliding Window #17928
Conversation
Update main
3fb27bb
to
936d500
Compare
With some further testing and investigation, there is an additional MLC-LLM/TVM bug related to excessive prefilling (even without the per-layer sliding window changes outlined here) that may be causing inference slowdown. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you @joshua-j-hong!
Just see some conflicts with upstream. Likely we need to do a rebase. Related changes are the recent FFI refactor and a namespace rename from |
Conflicts and tests are fixed. Current plans are for a future change that will add optional parameters to the KVCache, some of which will be for layer sliding window. This will ensure that no hardcoding of values will be needed and the KVCache will be backwards compatible. |
Adds per layer sliding window functionality to the KV Cache. Correctness is mostly achieved, but there are some cases where single tokens are strange. The corresponding MLC-LLM PR is mlc-ai/mlc-llm#3248 A full list of changes and additions are below - Add a new attention type for per-layer sliding window called `MHA_SLIDING` - Add corresponding vectors for per-layer sliding window offset calculations - For sliding window attention enabled KV-cache, regular sliding window is disabled to prevent page eviction - Gemma3 has different rope parameters for local sliding window layers. This should be passed as a parameter for the KVCache, but currently these values are hardcoded
Adds per layer sliding window functionality to the KV Cache. Correctness is mostly achieved, but there are some cases where single tokens are strange. The corresponding MLC-LLM PR is mlc-ai/mlc-llm#3248
A full list of changes and additions are below
MHA_SLIDING