StepFun’s new voice AI model, StepAudio 2.5 Realtime, has achieved first place across all five voice AI benchmarks tested in April 2026, outperforming competitors like GPT Realtime 1.5 and Gemini Live. Developed by the Shanghai-based AI lab founded in 2023 by Jiang Daxin, the model features real-time speech processing in both Chinese and English and boasts advanced paralinguistic comprehension that allows it to interpret non-verbal cues such as emotional tone and vocal speed. This innovation comes at a time when leading developers are racing to improve stability in AI persona interactions, using specialized reinforcement learning techniques to enhance performance in long-form conversations.
StepFun: StepFun is a Shanghai-based AI laboratory focused on developing high-performing large language models and voice AI systems. It recently released StepAudio 2.5 Realtime, an end-to-end real-time speech model supporting Chinese and English with advanced persona customization features. The lab applies specialized training methods to improve model stability in interactive scenarios.
Jiang Daxin: Jiang Daxin is the founder of StepFun, established in April 2023. He previously spent 16 years at Microsoft leading initiatives including Bing, Cortana, and Azure cognitive services. His experience informs StepFun’s approach to building competitive AI models in both text and voice domains.
StepAudio 2.5 Realtime: StepAudio 2.5 Realtime is an end-to-end real-time voice AI model developed by StepFun that processes audio input and output directly without intermediate text conversion. It incorporates training on extensive persona datasets and roleplay-specific reinforcement learning to maintain character consistency during extended interactions. The model also demonstrates strong comprehension of non-verbal acoustic cues such as emotional tone and speaking rate.
`json
{
“AI Persona Training”: “Reinforcement learning techniques focused on persona stability are addressing challenges in maintaining character during extended AI interactions.”,
“Voice AI Competition”: “Top AI developers are enhancing real-time voice models, comparing them directly to established systems like those from OpenAI.”,
“Chinese AI Innovation”: “AI labs in Shanghai are creating models that prioritize understanding acoustic features alongside dialogue quality.”
}
`
