Google speeds up local AI inference by 3x with Multi-Token Prediction for Gemma 4

Google has announced the release of Multi-Token Prediction (MTP) drafters for its Gemma 4 AI models, enabling an up to 3x increase in inference speed without compromising output quality. This innovative method, known as speculative decoding, leverages a lightweight drafter model that predicts multiple tokens simultaneously, allowing the main model to verify these predictions in parallel and significantly reducing the common one-token-at-a-time bottleneck. The MTP drafters are designed to work with existing software and hardware, making them accessible through platforms like Hugging Face, Kaggle, and Ollama. This development aligns with a broader trend in the AI sector where efficiency gains through software optimizations are prioritized over merely increasing hardware capabilities, highlighted by the impact of efficiency-focused models like DeepSeek.

Google: Google is a global technology company that develops internet services, cloud infrastructure, hardware, and advanced AI models through its Google DeepMind and Google AI divisions. In this news, Google is introducing Multi-Token Prediction drafters and speculative decoding for its Gemma 4 model family, aiming to significantly speed up local AI inference without requiring new hardware.
Kaggle: Kaggle is an online data science and machine learning community owned by Google, offering datasets, code notebooks, and cloud-based compute for experimentation and education. Here, Kaggle is one of the venues where Google is making Gemma 4 MTP drafters available, lowering friction for practitioners who want to test and benchmark the new speculative decoding setup.
Ollama: Ollama is a developer-focused platform and runtime for running large language models locally on desktops and laptops with simple tooling and model packaging. In this news, Ollama is highlighted as a supported environment where Google’s Gemma 4 MTP drafters can be used to accelerate local inference, bringing the speed gains directly to users who favor local-first LLM setups.
Gemma 4: Gemma 4 is Google’s latest family of open-weight multimodal AI models, released under the Apache 2.0 license and designed for on-device and server-grade deployment across sizes from small edge models to larger dense and MoE variants. In this article, Gemma 4 is the target model family that gains major inference-speed improvements via Multi-Token Prediction drafters using speculative decoding, making locally run models more responsive while preserving output quality.
Hugging Face: Hugging Face is an AI platform and open-source ecosystem that hosts machine learning models, datasets, and libraries used by developers to build and deploy AI applications. In this context, it serves as one of the primary distribution hubs where Google has published the Gemma 4 Multi-Token Prediction drafters so developers can easily integrate faster inference into their workflows.

Product: Google’s Gemma 4 release emphasized advanced reasoning and agentic workflows, and the new Multi-Token Prediction drafters extend that focus by optimizing how quickly those capabilities can be served on existing hardware.
Ecosystem: Hugging Face, Kaggle, and Ollama have all been actively promoting Gemma 4 integrations, positioning the model family as a first-class option in open-source and local-deployment tooling ecosystems.
Inference_Trend: Speculative decoding and related multi-token prediction techniques have become a key focus across the AI industry as developers seek software-side gains in latency and throughput instead of relying solely on ever-larger or more expensive hardware.