Cachemir: Fully Homomorphic Encrypted Inference of Generative Large Language Model with KV Cache

arXivarX

Privacy-preserving LLM inference using Fully Homomorphic Encryption (FHE), specifically optimizing the management of the Key-Value (KV) cache for autoregressive decoding.

View on arXiv

Defensibility

4.0/10

citations

co_authors

Platform Dominationlow

Market Consolidationmedium

Displacement Horizon3+ years

REASONING

Cachemir addresses one of the most significant bottlenecks in privacy-preserving AI: the stateful nature of Large Language Models. While standard FHE can handle simple feed-forward passes, autoregressive generation requires maintaining a KV cache that grows over time. In FHE, every operation increases 'noise' in the ciphertext; managing this noise across the iterative process of token generation is a major technical hurdle. The project scores a 4 on defensibility because while the underlying math is complex and represents a deep technical moat, the project currently exists as a research artifact (0 stars, 6 forks) rather than a production-grade library. Its primary value is the algorithmic approach to encrypted KV cache management. Frontier labs like OpenAI or Google are unlikely to adopt FHE in the short term because it remains orders of magnitude slower than plaintext inference; they are more likely to rely on Trusted Execution Environments (TEEs) or Multi-Party Computation (MPC). The main competition comes from specialized FHE firms like Zama (Concrete-ML) or academic projects like Bolt. The '3+ years' displacement horizon reflects the time needed for FHE hardware acceleration (like chips from ChainReaction or Optalysys) to make these algorithms commercially viable for LLM-scale models.

COMPOSABILITY

TECH STACK

PythonPyTorchFHE (Fully Homomorphic Encryption)Lattigo/OpenFHE (implied)Transformer Architecture

INTEGRATION

reference_implementation

homomorphic_encryptionprivacy_preserving_inferencekv_cache_optimizationautoregressive_decoding

READINESS

Composabilityalgorithm

Depthreference_implementation

Novelty