Kimi Linear boosts throughput by 1.54×, reduces token cost with new model

Kimi Moonshot’s announcement highlights advancements in compute fungibility through the introduction of the hybrid Kimi Linear model, which improves data processing capabilities across multiple data centers. This development addresses previously faced challenges, such as KV cache transfer overhead, and achieves a 1.54× increase in throughput while reducing P90 token transfer time by 64%. These efficiency gains, made possible by compact KV caches, are expected to significantly lower costs associated with token generation, facilitating Prefill-as-a-Service (PrfaaS) operations.

Kimi Linear: Kimi Linear is a hybrid linear attention architecture developed by Moonshot AI’s Kimi team, incorporating Kimi Delta Attention to deliver efficient performance for agentic intelligence and long-context scenarios without compromising quality. It features open-sourced kernels and integration with inference frameworks like vLLM, serving as a drop-in replacement for traditional full attention. In the news, Kimi Linear enables cross-datacenter prefill/decode disaggregation by reducing KV cache size, making heterogeneous hardware inference practical.
Kimi_Moonshot: Kimi_Moonshot is the official X account for Kimi.ai, a Moonshot AI product line focused on advanced AI models and agentic tools to empower users. It regularly shares technical updates, model releases, and innovations from the Kimi ecosystem. The account posted the news announcing Prefill/Decode advancements powered by Kimi Linear.
Prefill-as-a-Service: Prefill-as-a-Service (PrfaaS) is Moonshot AI’s cross-datacenter serving architecture that separates prefill and decode computations, offloading long-context prefill to specialized clusters across heterogeneous hardware. It addresses KV cache transfer challenges to enable cost-effective token generation. The news validates PrfaaS using a scaled-up Kimi Linear model, demonstrating gains in inference efficiency.

`json
{
“Efficiency Enabler”: “Compact KV caches from models like Kimi Linear address traditional bottlenecks in distributed inference.”,
“Architecture Advance”: “Kimi Linear’s hybrid design enhances linear attention with hardware-efficient mechanisms for effective scaling.”,
“Inference Disaggregation”: “Prefill/Decode disaggregation allows for multi-datacenter operations by managing prefill on dedicated resources.”
}
`