Perplexity has introduced its Runtime-Optimized Serving Engine (ROSE) to enhance the deployment of various AI models, including those with trillion parameters. The integration of CuTeDSL into ROSE significantly accelerates the development of specialized GPU kernels, enabling the optimization of AI models for NVIDIA Hopper and Blackwell GPUs. This transition to CuTeDSL also alleviates compilation bottlenecks associated with CUDA, facilitating quicker debugging and iteration, while allowing targeted optimizations for prefill and decode processes to improve performance and efficiency across their product offerings.
Qwen: Qwen is Alibaba Cloud’s open-source family of large language models excelling in multilingual tasks, coding, math, and agentic reasoning, with recent releases like Qwen 3.6-Plus. Perplexity customizes and deploys Qwen models, including Qwen3 variants, for its products. ROSE uses CuTeDSL kernels specifically tuned for Qwen’s QK norm and MoE layers.
ROSE: ROSE is Perplexity’s in-house Runtime-Optimized Serving Engine designed for flexible deployment of diverse AI models across research and production. It handles request scheduling, inter-node communication, and execution pipelining for LLMs and embeddings. In the recent blog, Perplexity highlights CuTeDSL’s role in enhancing ROSE’s kernel library for prefill/decode specialization and MoE routing.
Sonar: Sonar is Perplexity’s API for generating web-grounded AI responses with citations, streaming, and customizable search options, compatible with OpenAI formats. It relies on models served by the ROSE inference engine. Perplexity’s products, including Sonar, benefit from ROSE’s CuTeDSL-optimized kernels for efficient inference.
NVIDIA: NVIDIA develops GPUs and architectures optimized for AI workloads, including the Hopper architecture with Transformer Engine and the newer Blackwell platform for supercomputing-scale inference. Its tools like CUTLASS and Triton support model deployment. Perplexity runs ROSE on NVIDIA Hopper and Blackwell GPUs, leveraging CuTeDSL for peak kernel performance.
Search: Search refers to Perplexity’s core AI search API delivering accurate, real-time answers powered by custom transformer models. It is served via the ROSE engine alongside other APIs. The integration of CuTeDSL into ROSE improves performance for Perplexity’s Search functionality on advanced NVIDIA GPUs.
CuTeDSL: CuTeDSL is a Python-based domain-specific language from NVIDIA’s CUTLASS library for writing high-performance GPU kernels using CuTe layout algebra and MLIR for just-in-time compilation to PTX. It enables fine-grained control over hardware primitives similar to CUTLASS C++ but with faster iteration and debugging. Perplexity uses CuTeDSL in ROSE to specialize kernels for inference operations like QK norm, RMS norm, and MoE dispatch across various configurations.
Embeddings: Embeddings is Perplexity’s API for generating vector embeddings from models hosted in ROSE for applications like ranking and retrieval. ROSE’s embedding engines handle online batching optimized by CuTeDSL kernels. This supports Perplexity’s broader product ecosystem with high-performance inference.
Perplexity: Perplexity AI is an AI-powered answer engine offering real-time, cited responses via web, apps, and APIs like Sonar. It builds and hosts custom models in-house on NVIDIA GPUs using its ROSE inference engine. The company has integrated CuTeDSL into ROSE to accelerate development of specialized GPU kernels for optimal performance on Hopper and Blackwell architectures.
Triton Inference Server: Triton Inference Server, now part of NVIDIA Dynamo-Triton, is an open-source platform for deploying and scaling AI inference models from multiple frameworks like TensorRT and PyTorch. It standardizes model serving across cloud and edge. Perplexity’s ROSE maintains compatibility with Triton interfaces while extending custom optimizations via CuTeDSL.
Runtime-Optimized Serving Engine: Runtime-Optimized Serving Engine, known as ROSE, is Perplexity’s custom inference engine for serving models from small embeddings to trillion-parameter LLMs with features like batching, KV cache management, and custom layers. It supports NVIDIA Triton-compatible interfaces and powers Perplexity’s APIs. ROSE integrates CuTeDSL to build optimized GPU kernels tailored for Hopper and Blackwell hardware.
Model Deployment: Perplexity post-trains Qwen models with SFT and RL pipelines to boost search accuracy and efficiency in products powered by ROSE.
Developer Experience: Switching to CuTeDSL from CUDA reduces template expansion bottlenecks, enabling faster debugging and iteration for ROSE kernel development.
Inference Optimization: CuTeDSL allows Perplexity to specialize kernels at compile-time for prefill and decode phases, improving latency and throughput on Hopper and Blackwell GPUs.
