The 12 approaches I tested before finding one that works
I keep seeing ML papers that only show the final method. No dead ends, no "we tried X and it was a disaster." Just polished results on a polished pipeline. This is the opposite of that. Here is a c...

Source: DEV Community
I keep seeing ML papers that only show the final method. No dead ends, no "we tried X and it was a disaster." Just polished results on a polished pipeline. This is the opposite of that. Here is a complete record of every approach I tested building NexusQuant, a KV cache compressor for LLMs. Including the ones that failed spectacularly. The problem I was trying to solve KV cache is what limits LLM context windows. A 70B model on a single A100 can handle roughly 128K tokens before running out of memory. Every approach I tested was measuring one thing: can I cut that memory footprint without hurting model quality? The metric is perplexity delta (lower is better, negative means quality improved). The target was less than 1% degradation at meaningful compression. Attempt 1: PCA rotation before quantization Reasoning: if I rotate the KV vectors into a principal component basis, the quantization grid should align better with the actual data distribution. Result: 3x worse perplexity than basel