Skip to content

Commit 6c7a5ad

Browse files
KVCrush method for cache eviction
1 parent f55880d commit 6c7a5ad

File tree

14 files changed

+1085
-27
lines changed

14 files changed

+1085
-27
lines changed

site/docs/concepts/optimization-techniques/kvcache-eviction-algorithm.md

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,3 +60,86 @@ It can be enabled by setting the `CacheEvictionConfig.apply_rotation` field to `
6060
* Cache rotation is only targeted for the regular, linear LLaMa-like RoPE application and may degrade accuracy on models that use other RoPE schemes.
6161

6262
* Cache rotation is currently only supported for the models with uniform V embedding sizes across the layers.
63+
64+
## (Optional) KVCrush
65+
66+
KVCrush enhances the standard H2O/SnapKV eviction by selecting the most representative blocks from the evictable area using clustering analysis, rather than simply evicting the low score blocks.
67+
68+
### Algorithm Overview
69+
70+
1. **Indicator Creation**: Generate binary indicators for tokens based on importance scores
71+
2. **Anchor Point Generation**: Create reference patterns using configurable modes
72+
3. **Distance Calculation**: Measure Hamming distance between block patterns and the anchor point
73+
4. **Representative Selection**: Select blocks to best represent context diversity
74+
75+
### Configuration
76+
Setup KVCrush config parameters and pass it to ```CacheEvictionConfig```. Sample code to allocate KVCrush a budget of 2 blocks and use MEAN anchor mode is following.
77+
```cpp
78+
const ov::genai::CacheEvictionConfig EXAMPLE_CACHE_EVICTION_CONFIG =
79+
{32, 32, 192, ov::genai::AggregationMode::NORM_SUM, false, 8, KVCrushConfig(2, KVCrushAnchorPointMode::MEAN)};
80+
```
81+
```python
82+
CacheEvictionConfig(
83+
start_size=32,
84+
recent_size=128,
85+
max_cache_size=448,
86+
aggregation_mode=AggregationMode.NORM_SUM,
87+
apply_rotation=False,
88+
snapkv_window_size=8,
89+
kvcrush_config=KVCrushConfig(budget=2, anchor_point_mode=KVCrushAnchorPointMode.MEAN)
90+
)
91+
```
92+
93+
**Anchor Point Modes:**
94+
- `RANDOM`: Random binary pattern
95+
- `ZEROS`: All zeros pattern
96+
- `ONES`: All ones pattern
97+
- `MEAN`: Mean of indicators across blocks
98+
- `ALTERNATE`: Alternating 0-1 pattern
99+
100+
### Performance Comparison on LongBench
101+
102+
**Note:** Values in **`this style`** indicate performance equal to or better than the "512, 0" configuration.
103+
104+
#### H2O
105+
The following table shows accuracy results comparing standard H2O eviction with KVCrush.
106+
107+
Configuration format: H2O budget (tokens), KVCrush budget (blocks), Anchor Point
108+
109+
| Configuration | qasper | samsum | trec |
110+
|---------------|--------|--------|------|
111+
| **FP16 (baseline)** | 21.43 | 34.83 | 1.00 |
112+
| **512, 0** | 12.40 | 34.39 | 0.50 |
113+
| **384, 128/32, MEAN** | **`12.91`** | 34.15 | **`0.50`** |
114+
| **384, 128/32, ALTERNATE** | **`12.55`** | **`34.39`** | **`0.50`** |
115+
| **384, 128/32, RANDOM** | 12.25 | 34.16 | **`0.50`** |
116+
| **480, 32/32, MEAN** | **`12.54`** | 33.79 | **`1.00`** |
117+
| **480, 32/32, ALTERNATE** | **`12.49`** | **`34.59`** | **`1.00`** |
118+
| **480, 32/32, RANDOM** | 12.37 | **`34.83`** | **`0.50`** |
119+
| **448, 64/32, MEAN** | **`12.85`** | **`34.61`** | **`1.00`** |
120+
| **448, 64/32, ALTERNATE** | **`12.61`** | **`34.41`** | **`1.00`** |
121+
| **448, 64/32, RANDOM** | **`12.43`** | 34.38 | **`1.00`** |
122+
| **KVCrush - Best** | **`12.91`** | **`34.83`** | **`1.00`** |
123+
124+
#### SnapKV
125+
The following table shows accuracy results comparing standard SnapKV eviction with KVCrush.
126+
127+
Configuration format: SnapKV budget (tokens), KVCrush budget (blocks), Anchor Point
128+
129+
| Configuration | qasper | samsum | trec |
130+
|---------------|--------|--------|------|
131+
| **FP16 (baseline)** | 21.43 | 34.83 | 0.50 |
132+
| **512, 0** | 12.33 | 34.21 | 1.00 |
133+
| **384, 128/32, MEAN** | **`12.78`** | **`34.32`** | **`1.00`** |
134+
| **384, 128/32, ALTERNATE** | 11.87 | **`34.42`** | **`1.00`** |
135+
| **384, 128/32, RANDOM** | **`12.66`** | 34.05 | 0.50 |
136+
| **480, 32/32, MEAN** | **`12.97`** | 34.12 | 0.50 |
137+
| **480, 32/32, ALTERNATE** | **`13.14`** | **`34.22`** | 0.50 |
138+
| **480, 32/32, RANDOM** | **`13.01`** | **`34.40`** | 0.50 |
139+
| **448, 64/32, MEAN** | **`12.83`** | **`34.69`** | 0.50 |
140+
| **448, 64/32, ALTERNATE** | **`13.57`** | **`34.55`** | **`1.00`** |
141+
| **448, 64/32, RANDOM** | **`13.38`** | **`34.26`** | **`1.00`** |
142+
| **KVCrush - Best** | **`13.57`** | **`34.69`** | **`1.00`** |
143+
144+
145+

src/cpp/include/openvino/genai/cache_eviction.hpp

Lines changed: 68 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,14 +19,76 @@ enum class AggregationMode {
1919
* of a given token in cache */
2020
};
2121

22+
/**
23+
* @brief Represents the mode of how anchor points are formed in KVCrush Cache eviction algorithm
24+
*/
25+
enum class KVCrushAnchorPointMode {
26+
RANDOM, /**<In this mode the anchor point is a random binary vector of 0s and 1s > */
27+
ZEROS, /**<In this mode the anchor point is a vector of 0s */
28+
ONES, /**<In this mode the anchor point is a vector of 1s */
29+
MEAN, /**<In this mode the anchor point is a random binary vector of 0s and 1s, where individual values are decided
30+
based on majority value */
31+
ALTERNATE /**In this mode the anchor point is a vector of alternate 0s and 1s */
32+
};
33+
34+
class KVCrushConfig {
35+
public:
36+
/**
37+
* @brief Configuration struct for the KVCrush cache eviction algorithm.
38+
*/
39+
/**
40+
* @class KVCrushConfig
41+
* @brief Configuration class for KVCrush cache mechanism.
42+
*
43+
* This class encapsulates the configuration parameters for the KVCrush cache,
44+
* including cache budget, anchor point mode, and random seed.
45+
*/
46+
47+
KVCrushConfig() = default;
48+
49+
/**
50+
* @brief Constructs a KVCrushConfig with the specified parameters.
51+
* @param budget_ The cache budget, representing the number of blocks to store.
52+
* @param anchor_point_mode_ The anchor point mode for KVCrush (see KVCrushAnchorPointMode).
53+
* @param rng_seed_ Optional random seed for reproducibility (default is 0).
54+
*/
55+
56+
KVCrushConfig(size_t budget_, KVCrushAnchorPointMode anchor_point_mode_, size_t rng_seed_ = 0)
57+
: budget(budget_),
58+
anchor_point_mode(anchor_point_mode_),
59+
rng_seed(rng_seed_) {}
60+
61+
/*KVCrush Cache budget - number of blocks*/
62+
std::size_t budget = 0;
63+
/*KVCrush Anchor point mode*/
64+
KVCrushAnchorPointMode anchor_point_mode = KVCrushAnchorPointMode::RANDOM;
65+
size_t rng_seed = 0;
66+
std::size_t get_budget() const {
67+
return budget;
68+
}
69+
};
70+
2271
/**
2372
* @brief Configuration struct for the cache eviction algorithm.
2473
*/
2574
class CacheEvictionConfig {
2675
public:
2776
CacheEvictionConfig() = default;
2877

29-
CacheEvictionConfig(size_t start_size, size_t recent_size, size_t max_cache_size, AggregationMode aggregation_mode_, bool apply_rotation_ = false, size_t snapkv_window_size_ = 8) : aggregation_mode(aggregation_mode_), apply_rotation(apply_rotation_), snapkv_window_size(snapkv_window_size_), m_start_size(start_size), m_recent_size(recent_size), m_max_cache_size(max_cache_size) {
78+
CacheEvictionConfig(size_t start_size,
79+
size_t recent_size,
80+
size_t max_cache_size,
81+
AggregationMode aggregation_mode_,
82+
bool apply_rotation_ = false,
83+
size_t snapkv_window_size_ = 8,
84+
const KVCrushConfig& kvcrush_config_ = KVCrushConfig(0, KVCrushAnchorPointMode::RANDOM))
85+
: aggregation_mode(aggregation_mode_),
86+
apply_rotation(apply_rotation_),
87+
snapkv_window_size(snapkv_window_size_),
88+
m_start_size(start_size),
89+
m_recent_size(recent_size),
90+
m_max_cache_size(max_cache_size),
91+
kvcrush_config(kvcrush_config_) {
3092
OPENVINO_ASSERT(start_size, "CacheEvictionConfig.start_size must be non-zero");
3193
OPENVINO_ASSERT(recent_size, "CacheEvictionConfig.recent_size must be non-zero");
3294
OPENVINO_ASSERT(max_cache_size, "CacheEvictionConfig.max_cache_size must be non-zero");
@@ -35,7 +97,6 @@ class CacheEvictionConfig {
3597
OPENVINO_ASSERT(max_cache_size > (start_size + recent_size),
3698
"CacheEvictionConfig.max_cache_size must be larger than CacheEvictionConfig.start_size + CacheEvictionConfig.recent_size");
3799
m_evictable_size = m_max_cache_size - m_start_size - m_recent_size;
38-
39100
}
40101

41102
/** @return Number of tokens between the "start" and "recent" areas of KV cache that
@@ -76,6 +137,11 @@ class CacheEvictionConfig {
76137
* following the SnapKV article approach (https://arxiv.org/abs/2404.14469). **/
77138
size_t snapkv_window_size = 8;
78139

140+
/** KVCrush configuration for this cache eviction algorithm.
141+
* KVCrush is an additional mechanism that allows to retain some tokens in the cache
142+
* even if they are not among the most important ones.*/
143+
KVCrushConfig kvcrush_config;
144+
79145
private:
80146
/** Number of tokens in the *beginning* of KV cache that should be retained
81147
* in the KV cache for this sequence during generation. Must be non-zero and a multiple of the KV cache block size for

src/cpp/src/continuous_batching/cache_eviction.cpp

Lines changed: 33 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -191,7 +191,7 @@ namespace ov::genai {
191191
CacheEvictionAlgorithm::CacheEvictionAlgorithm(const CacheEvictionConfig &eviction_config, size_t block_size,
192192
size_t num_decoder_layers, size_t max_pool_window_size) :
193193
m_eviction_config(eviction_config), m_block_size(block_size), m_num_decoder_layers(num_decoder_layers),
194-
m_score_manager(block_size, num_decoder_layers, max_pool_window_size, eviction_config.aggregation_mode, eviction_config.get_start_size() / block_size)
194+
m_score_manager(block_size, num_decoder_layers, max_pool_window_size, eviction_config.aggregation_mode, eviction_config.get_start_size() / block_size), m_kvcrush_algo(eviction_config.kvcrush_config, block_size)
195195
{
196196
OPENVINO_ASSERT(!(m_eviction_config.get_start_size() % m_block_size),
197197
"CacheEvictionConfig.start_size in tokens must be a multiple of block size ", m_block_size);
@@ -236,6 +236,38 @@ namespace ov::genai {
236236
size_t num_blocks_to_evict = get_num_blocks_to_evict(decoder_layer_idx);
237237
auto evicted_block_indices = get_indices_of_blocks_to_evict(scores_for_all_evictable_blocks, num_blocks_to_evict);
238238

239+
// KVCrush: start
240+
bool should_apply_kvcrush = (m_eviction_config.kvcrush_config.budget > 0) &&
241+
(evicted_block_indices.size() >= m_eviction_config.kvcrush_config.budget);
242+
if (should_apply_kvcrush) {
243+
size_t num_tokens_in_evictable_blocks = scores_for_all_evictable_blocks.size() * m_block_size;
244+
245+
auto kvcrush_retained_block_indices = m_kvcrush_algo.get_indices_of_blocks_to_retain_using_kvcrush(
246+
num_tokens_in_evictable_blocks,
247+
evicted_block_indices,
248+
m_score_manager.get_scores()[decoder_layer_idx]);
249+
250+
// Remove the indices in kvcrush_retained_block_indices from evicted_block_indices
251+
if (!kvcrush_retained_block_indices.empty()) {
252+
// Convert both vectors to sets for efficient operations
253+
std::unordered_set<std::size_t> retained_set(kvcrush_retained_block_indices.begin(),
254+
kvcrush_retained_block_indices.end());
255+
256+
// Create a new vector containing only elements not in retained_set
257+
std::vector<std::size_t> filtered_evicted_indices;
258+
filtered_evicted_indices.reserve(evicted_block_indices.size());
259+
260+
for (const auto& idx : evicted_block_indices) {
261+
if (retained_set.find(idx) == retained_set.end()) {
262+
filtered_evicted_indices.push_back(idx);
263+
}
264+
}
265+
// Replace the original vector with the filtered one
266+
evicted_block_indices = std::move(filtered_evicted_indices);
267+
}
268+
}
269+
// KVCrush: end
270+
239271
m_num_evicted_tokens += evicted_block_indices.size() * m_block_size;
240272

241273
// No longer need to track the overall "heavy-hitter" attention scores for freshly evicted blocks

src/cpp/src/continuous_batching/cache_eviction.hpp

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
#include "openvino/openvino.hpp"
1212
#include "continuous_batching/attention_output.hpp"
1313
#include "openvino/genai/cache_eviction.hpp"
14+
#include "continuous_batching/kvcrush.hpp"
1415

1516
namespace ov::genai {
1617

@@ -188,6 +189,7 @@ class CacheEvictionAlgorithm {
188189
void remove_scores_of_evicted_blocks(const std::vector<std::size_t>& evicted_block_indices, size_t decoder_layer_idx);
189190

190191
CacheEvictionConfig m_eviction_config;
192+
KVCrushAlgorithm m_kvcrush_algo;
191193
std::size_t m_block_size;
192194
std::size_t m_num_evicted_tokens = 0;
193195
std::size_t m_num_decoder_layers;

0 commit comments

Comments
 (0)