Perplexity details SFT and RL pipeline for enhanced search accuracy using Qwen models

Perplexity has announced enhancements to its search-augmented language models through a novel two-stage post-training pipeline that incorporates Supervised Fine-Tuning (SFT) and on-policy Reinforcement Learning (RL). This approach enables improvements in search accuracy, citation quality, and overall efficiency, allowing its Qwen models to match or exceed the factual accuracy of competitors like GPT models while operating at a lower cost. The pipeline is particularly notable for its careful architecture that balances the crucial aspects of deployment compliance with search optimization, which is essential as Perplexity aims to streamline workflows for enterprises.

Qwen: Qwen is Alibaba’s open-source large language model series featuring efficient mixture-of-experts architectures and multimodal capabilities, with recent releases like Qwen3.5 and Qwen3.6 emphasizing agentic reasoning and vision integration. These models support long context windows and tool use, making them suitable for advanced AI applications. Perplexity’s research uses Qwen3 family models, such as Qwen3.5-397B-A17B, as starting points for their SFT+RL pipeline to enhance search accuracy, citation quality, and efficiency.
Alibaba: Alibaba, via Alibaba Cloud, is a major provider of cloud computing infrastructure and generative AI tools, including the development and hosting of the Qwen LLM series. Recent advancements in their AI portfolio focus on scalable, open-weight models for enterprise intelligence and agentic tasks. Alibaba’s Qwen3.5 models form the base for Perplexity’s post-training experiments, enabling cost-effective search agents that compete with proprietary alternatives.
Perplexity: Perplexity AI is a leading AI-powered answer engine that combines real-time web search with multiple large language models to deliver accurate, cited responses and support complex workflows. The company recently introduced Perplexity Computer, an agent orchestration system that automates enterprise tasks across apps like Gmail and Salesforce. In their new research, Perplexity details a two-stage post-training pipeline—SFT for guardrails and on-policy RL for search optimization—applied to Qwen base models to produce superior search-augmented agents.

{“Qwen Progress”: “The neural models from Alibaba, particularly the Qwen series, have shown advancements in reasoning capabilities, with the recent models continuing development efforts around search and factuality.”}