Why attention-aware eviction beats random eviction (with data)
At high eviction rates, choosing which tokens to drop matters enormously. Here is what the numbers show. The experiment We ran KV cache eviction at two rates on Llama-3-8B, measuring perplexity deg...

Source: DEV Community
At high eviction rates, choosing which tokens to drop matters enormously. Here is what the numbers show. The experiment We ran KV cache eviction at two rates on Llama-3-8B, measuring perplexity degradation (lower is better) versus a full-cache baseline: Eviction rate Importance-based Random Advantage 70% +2.59% PPL +3.86% PPL 1.27 pp 80% +3.61% PPL +5.13% PPL 1.52 pp The gap grows as you evict more. At 70% eviction the importance scorer saves you 1.27 percentage points of perplexity. Push to 80% and it saves 1.52 pp. This is not a coincidence. Why it happens Random eviction is memoryless — it has the same probability of dropping the single token that unlocks subject-verb agreement across 400 tokens as it does of dropping a filler word. The attention-aware scorer assigns each token an importance weight based on how much accumulated attention mass it has received across all heads. Tokens that many heads consistently attend to survive; tokens that nobody looks at get evicted first. At low