Google has introduced Multi-Token Prediction (MTP) drafters for its Gemma 4 AI model, enabling users to achieve up to three times faster inference speeds on local hardware without compromising output quality. This innovation employs a technique called speculative decoding, where a smaller “drafter” model predicts multiple tokens simultaneously, which the main model then verifies, bypassing the traditional one-token-at-a-time limitation. MTP drafters are available for use on platforms such as Hugging Face and Kaggle, and they support various inference engines, making them ideal for responsive applications such as real-time chat and voice interfaces on consumer devices. This advancement follows the market shocks caused by other efficient models, highlighting the ongoing demand for speed and performance in AI technology.

Google: Google, through its DeepMind division, develops lightweight open-weight AI models optimized for local and edge deployment, including multimodal capabilities for text, vision, and audio. The company emphasizes developer accessibility via Apache 2.0 licensing and integrations with popular frameworks. In this news, Google released Multi-Token Prediction drafters for its Gemma 4 models to enable faster inference on consumer hardware without quality loss.
Gemma 4: Gemma 4 is a family of open multimodal models from Google DeepMind, featuring compact variants for on-device use and larger ones for advanced reasoning and agentic workflows. Designed with efficiency in mind, it incorporates optimizations like local-global attention and adaptive processing for diverse inputs. The article focuses on new MTP drafters that accelerate Gemma 4’s inference via speculative decoding while preserving output quality.
speculative decoding: Speculative decoding accelerates language model inference by pairing a fast drafter with the main model to generate and verify multiple tokens in parallel. It remains a serving optimization compatible with existing architectures and tools. Google’s application in MTP drafters for Gemma 4 brings this technique to mainstream open-source use on everyday hardware.
Multi-Token Prediction: Multi-Token Prediction (MTP) is a speculative decoding architecture where lightweight drafter models predict multiple tokens ahead, verified efficiently by the target model using shared KV cache. Google tailored MTP drafters specifically for the Gemma 4 family to exploit idle compute resources. This release targets improved responsiveness for local AI applications like chat and voice without altering model behavior.

`json
{
“Deployment Focus”: “MTP drafter optimizations enhance local applications such as real-time chat, voice interfaces, and agentic workflows on consumer hardware.”,
“Model Availability”: “MTP drafters for Gemma 4 are accessible on platforms like Hugging Face, Kaggle, and Ollama under the Apache 2.0 license.”,
“Framework Integration”: “These drafters are compatible with inference engines such as vLLM, MLX, SGLang, and Hugging Face Transformers.”
}
`