
oMLX, an open-source LLM inference server, has launched tiered KV caching for macOS, a feature poised to dramatically improve the performance of large language models (LLMs) on Apple Silicon. This innovation directly addresses the issue of lengthy prefill times, which has been a significant bottleneck for local LLM inference on these devices. The announcement by Alex Cheema highlights the system's ability to avoid redundant prefills, even persisting KV caches to disk across sessions.
The core of this advancement lies in oMLX's tiered KV cache management, which utilizes both hot in-memory (RAM) and cold SSD storage. When the hot cache reaches capacity, less frequently used KV blocks are offloaded to the SSD in safetensors format. This means that if a previously processed prefix reappears in a new request, it can be quickly restored from disk rather than being recomputed from scratch, drastically cutting down on initial token generation time.
This caching mechanism is particularly crucial for Apple Silicon, where the prefill phase—the initial processing of a prompt—can be exceptionally long. By persisting KV caches, oMLX ensures that users avoid re-prefilling the same context repeatedly, making local LLMs more practical for demanding applications. Users, especially those engaged in agentic coding workflows, can experience a substantial reduction in "time to first token" (TTFT) for follow-up requests.
oMLX functions as a comprehensive LLM inference server tailored for Apple Silicon, offering features like continuous batching, multi-model serving, and an OpenAI/Anthropic-compatible API. The introduction of tiered KV caching is a significant step towards making local LLM inference on Macs more efficient and user-friendly. This development allows for a smoother, more responsive experience, enabling more complex and extended interactions with local AI models.