Nature Medicine study finds general-purpose LLMs outperform medical AI

A recent study published in Nature Medicine reveals that general-purpose large language models (LLMs) are outperforming dedicated medical AI products on clinical tasks evaluated by physicians. The research compared models like GPT-5.2, Gemini 3.1 Pro, and Claude Opus 4.6 against traditional medical tools such as OpenEvidence and UpToDate Expert AI. In assessments involving 100 physician questions, clinicians favored the frontier models for their superior completeness and clarity. This finding aligns with broader trends suggesting that healthcare professionals are increasingly turning to conversational AI for support in a variety of clinical scenarios.

GPT-5.2: GPT-5.2 is OpenAI’s advanced frontier large language model series optimized for professional knowledge work, reasoning, and long-context understanding across multiple variants. Released in late 2025, it excels in complex problem-solving and agentic tasks. It was one of the general-purpose models tested in the Nature Medicine study that outperformed specialized medical AI tools on clinical exam questions and physician queries.
OpenEvidence: OpenEvidence is an AI-powered medical search engine and clinical decision support platform designed for healthcare professionals, serving as an official AI partner to major journals including The New England Journal of Medicine and JAMA. It delivers evidence-based answers drawn from peer-reviewed literature to support clinical decisions at the point of care. In the Nature Medicine study, it represented a dedicated medical AI product evaluated against general-purpose frontier models on physician-reviewed tasks.
Gemini 3.1 Pro: Gemini 3.1 Pro is Google’s upgraded multimodal AI model focused on advanced reasoning, tool use, and reliable performance in agentic and complex workflows. It builds on prior versions with enhancements for factuality and multi-step execution across domains. The study evaluated it as a leading general-purpose LLM that clinicians preferred over dedicated medical AI products for real-world clinical use.
Claude Opus 4.6: Claude Opus 4.6 is Anthropic’s flagship Opus-tier model known for strong agentic planning, coding capabilities, long-context handling, and multidisciplinary reasoning. Released in early 2026, it demonstrates leading performance on evaluations involving complex tasks and information synthesis. It served as one of the frontier general LLMs in the Nature Medicine comparison that surpassed specialized medical AIs on blinded clinician preferences.
UpToDate Expert AI: UpToDate Expert AI is a generative AI clinical reasoning tool from Wolters Kluwer that provides conversational, evidence-backed answers grounded in the curated content of the UpToDate platform. It assists clinicians with differential diagnoses and real-world decision support through natural language queries. The Nature Medicine study directly compared it to general frontier LLMs, where it was outperformed on completeness and clarity in live clinical questions.

AI Model Comparison: Recent evaluations of AI tools in medicine show general-purpose frontier models excelling in nuanced clinical reasoning over narrowly trained medical systems.
Physician AI Adoption: Healthcare professionals increasingly rely on conversational AI for point-of-care support across diverse clinical scenarios.