Your GPUs Are Working Hard. They’re Just Not Doing Useful Work.
Here’s a number worth sitting with: modern AI inference systems commonly show 100% GPU utilization while wasting 60% of their compute cycles recalculating context they already processed. The GPUs aren’t idle. They’re redlining in neutral — burning power, burning budget, and burning user patience to redo work the system shouldn’t have discarded in the first place.
At AI Infrastructure Field Day 5, Solidigm and MinIO presented a precise diagnosis of this problem and a production-ready architectural response. Their collaboration reframes high-density NVMe storage not as a data repository but as distributed shared context memory — a new infrastructure tier designed to eliminate the recomputation tax that is quietly making AI inference economically indefensible at scale.
The Recomputation Tax: What It Costs and Why It Compounds
Every NVIDIA H100 ships with 80GB of High Bandwidth Memory (HBM). A meaningful portion of that capacity goes to the model itself, leaving limited room for the Key-Value cache — the in-session memory that stores conversation context while a user interacts with an AI application. When concurrent requests scale up and HBM fills, the system evicts existing context to make room for incoming sessions. When that evicted user returns, the GPU must reprocess the entire prompt history from scratch before generating a single new token. That reprocessing phase — called prefill — is what the Solidigm/MinIO team calls the recomputation tax.
The cost compounds at every dimension. Time to First Token, the metric users experience, can climb from seconds to over a minute as context lengths grow. GPU cycles that should generate useful output spend their time regenerating output already produced. And in a public cloud environment running H200 nodes, the all-in cost of that inefficiency runs to roughly $2 million per node per year. That’s not a rounding error. It’s a structural flaw in how most AI inference infrastructure was designed because it was designed before multi-turn agentic workloads became the dominant use case.
G3.5: The Memory Tier That Shouldn’t Exist — But Has To
The Solidigm/MinIO solution introduces what they call the G3.5 tier: a layer positioned between local HBM and traditional long-term object storage, purpose-built to handle the petabyte-scale memory requirements of modern AI superpods. A superpod running thousands of concurrent GPU sessions may require 16 to 20 petabytes of shared memory to maintain context coherence across users — a scale that DRAM cannot reach physically or financially.
The software layer at the center of this architecture is MemKV (MKV), a distributed shared context memory system that communicates directly with NVMe drives using io_uring and O_DIRECT, bypassing kernel-level file system overhead entirely. MKV doesn’t behave like traditional enterprise storage. It behaves like a memory manager because that’s precisely the problem it was designed to solve.
Paired with Solidigm’s D5-P5336 QLC drives — offering up to 122TB per drive, among the highest densities available in production NVMe — MKV allows GPUs to offload KV caches and retrieve them at near-line speeds rather than recomputing them from scratch. The benchmark results are not incremental improvements: MKV increased concurrent request handling by 43x, reduced Time to First Token by 27x, and demonstrated linear scalability to 12 terabytes per second of aggregated throughput supporting up to 16,000 concurrent sessions. The GPU utilization that remains goes to decoding — the work that delivers value — rather than prefill recomputation.
What Deployment Actually Requires
The performance profile demands infrastructure discipline that teams shouldn’t underestimate. Moving data at 400Gbps to 800Gbps is where network tuning becomes a precision exercise. Solidigm noted that at these speeds, even the physical length of fiber cables affects buffer math. TCP, the standard transport, carries compute overhead that becomes meaningful at this velocity, and moving to RDMA to bypass CPU involvement adds its own tuning complexity. These aren’t dealbreakers, but they’re genuine engineering constraints that architects need to plan around rather than discover in production.
The MKV storage engine itself required a custom approach to small writes, since traditional file systems introduce metadata bottlenecks that defeat the memory-class performance the architecture requires. The team also addressed CXL directly, characterizing it as a promising but premature standard for the innovation velocity that AI infrastructure currently demands. Software-defined solutions like MKV are deployable today against the workloads that exist today — which matters more than architectural elegance that arrives after the problem has compounded.
Why This Matters
AI inference economics are approaching an inflection point. As context windows grow, agentic workloads multiply concurrent sessions, and enterprises demand multi-turn conversation quality at scale, the recomputation tax compounds faster than GPU procurement can compensate. Buying more H100s to absorb the inefficiency of KV cache eviction is the wrong answer — and at $2 million in wasted utilization per node per year, it’s an expensive one.
The G3.5 memory tier represents a structural correction to an architectural assumption that no longer holds: that HBM alone can carry the context requirements of production-scale agentic AI. It can’t. The scale at which modern superpods operate demands a shared memory layer that lives between the GPU and long-term storage, delivers near-line retrieval speeds, and scales linearly as session concurrency grows. Solidigm and MinIO demonstrated at AIIFD5 that this layer is buildable today, at production scale, with economics that make the investment straightforward to justify.
The recomputation tax isn’t a GPU problem. It’s an architecture problem. And architecture problems have architecture solutions.