Skip to content

Conversation

joshua-j-hong
Copy link
Contributor

@joshua-j-hong joshua-j-hong commented May 7, 2025

Adds per layer sliding window functionality to the KV Cache. Correctness is mostly achieved, but there are some cases where single tokens are strange. The corresponding MLC-LLM PR is mlc-ai/mlc-llm#3248

A full list of changes and additions are below

  • Add a new attention type for per-layer sliding window called MHA_SLIDING
  • Add corresponding vectors for per-layer sliding window offset calculations
  • For sliding window attention enabled KV-cache, regular sliding window is disabled to prevent page eviction
  • Gemma3 has different rope parameters for local sliding window layers. This should be passed as a parameter for the KVCache, but currently these values are hardcoded

@joshua-j-hong joshua-j-hong changed the title KV Cache Per Layer Sliding Window [KVCache] Per Layer Sliding Window May 7, 2025
@joshua-j-hong joshua-j-hong force-pushed the jjhong_KV_alt_sliding_window branch from 3fb27bb to 936d500 Compare May 8, 2025 03:42
@joshua-j-hong
Copy link
Contributor Author

joshua-j-hong commented May 8, 2025

With some further testing and investigation, there is an additional MLC-LLM/TVM bug related to excessive prefilling (even without the per-layer sliding window changes outlined here) that may be causing inference slowdown.

Copy link
Contributor

@MasterJH5574 MasterJH5574 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @joshua-j-hong!

@MasterJH5574
Copy link
Contributor

MasterJH5574 commented Jun 11, 2025

Just see some conflicts with upstream. Likely we need to do a rebase. Related changes are the recent FFI refactor and a namespace rename from relax_vm to vm. I'll check the PR again after updating.

@joshua-j-hong
Copy link
Contributor Author

Conflicts and tests are fixed. Current plans are for a future change that will add optional parameters to the KVCache, some of which will be for layer sliding window. This will ensure that no hardcoding of values will be needed and the KVCache will be backwards compatible.

@MasterJH5574 MasterJH5574 merged commit 23bcbc5 into apache:main Jun 23, 2025
10 checks passed
ShiboXing pushed a commit to ShiboXing/tvm that referenced this pull request Aug 10, 2025
Adds per layer sliding window functionality to the KV Cache.
Correctness is mostly achieved, but there are some cases where single
tokens are strange. The corresponding MLC-LLM PR is
mlc-ai/mlc-llm#3248

A full list of changes and additions are below

- Add a new attention type for per-layer sliding window called `MHA_SLIDING`
- Add corresponding vectors for per-layer sliding window offset calculations
- For sliding window attention enabled KV-cache, regular sliding window is
disabled to prevent page eviction
- Gemma3 has different rope parameters for local sliding window layers.
This should be passed as a parameter for the KVCache,
but currently these values are hardcoded
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants