Perplexity serves Qwen3 235B models on Nvidia GB200 racks

Perplexity AI has announced advancements in serving their post-trained Qwen3 235B models using NVIDIA’s GB200 NVL72 Blackwell racks, which provide significant performance improvements for high-throughput inference on large mixture-of-experts (MoE) models. The GB200 architecture, featuring enhanced tensor cores and a rack-scale NVLink interconnect, enables superior parallelism and efficiency in processing, as evidenced by decreased latencies in operations—dropping all-reduce latency from 586.1µs on previous H200 systems to 313.3µs on the new GB200 setup. This shift aligns with industry trends towards leveraging MoE architectures to optimize serving efficiency, facilitating Perplexity’s capacity to manage traffic effectively while reducing costs.

ARM: ARM develops licensable processor architectures emphasizing energy efficiency and scalability for servers and mobile devices. The NVIDIA Grace CPUs in GB200 NVL72 racks are built on ARM architecture to support large model hosting alongside Blackwell GPUs.
Grace: NVIDIA Grace is an ARM-based CPU superchip tailored for high-performance data center and AI applications. It forms the CPU backbone in GB200 NVL72 racks, pairing with 72 Blackwell GPUs to deliver the memory and compute resources needed for trillion-parameter MoE models.
SHARP: SHARP is NVIDIA’s Scalable Hierarchical Aggregation and Reduction Protocol integrated into NVLink switches to accelerate collective operations like all-reduce. Perplexity’s implementation uses SHARP to lower latencies in attention layers and expert-parallel reductions during Qwen3 inference on GB200 platforms.
NVIDIA: NVIDIA is a leading designer of GPUs and AI accelerators powering data center computing for training and inference workloads. In this research, Perplexity deploys NVIDIA GB200 NVL72 Blackwell racks featuring Grace CPUs and Blackwell GPUs to serve Qwen3 235B models, achieving superior performance through rack-scale interconnects and tensor core enhancements.
NVLink: NVLink is NVIDIA’s high-speed direct GPU-to-GPU interconnect technology that supports massive bandwidth for multi-GPU communication. In GB200 NVL72 racks, NVLink creates a rack-scale domain connecting 72 GPUs, enabling efficient parallelism for MoE model prefill and decode operations unattainable on prior Hopper systems.
ConnectX-7: ConnectX-7 is NVIDIA’s smart network adapter supporting InfiniBand and Ethernet for high-throughput data center networking with in-network computing features. In the GB200 NVL72 setup, ConnectX-7 handles InfiniBand communication between prefill and decode nodes as well as across racks.
Qwen3 235B: Qwen3 235B is a flagship mixture-of-experts large language model from Alibaba’s Qwen series, noted for strong performance in coding, math, and reasoning tasks. Perplexity serves post-trained versions of this model on NVIDIA Blackwell hardware to handle a portion of its production traffic using specialized inference techniques like prefill-decode disaggregation.

{“MoE Optimization”: “Industry shifts toward mixture-of-experts architectures leverage hardware like Blackwell for disaggregated prefill and decode to boost serving efficiency.”, “Blackwell Inference”: “NVIDIA Blackwell GB200 NVL72 excels in high-throughput inference for large MoE models due to rack-scale NVLink and advanced tensor cores.”, “Perplexity Deployment”: “Perplexity AI uses Blackwell for real-time search scaling with MoE models.”}