OpenAI detects accidental CoT grading in models, finds no monitorability loss

METR: METR is an AI evaluation organization performing independent assessments of frontier model capabilities and risks, including reviews of safety reports. They provided feedback on an early draft of OpenAI’s post investigating accidental CoT grading during RL training.
OpenAI: OpenAI is a frontier AI laboratory that develops large language models including the GPT series and prioritizes preserving chain-of-thought monitorability as a safety mechanism during training and deployment. In this incident, OpenAI identified accidental chain-of-thought grading in reinforcement learning runs for models like GPT-5.4 Thinking, GPT-5.4 Instant, and various mini models, conducted analyses finding no clear monitorability degradation, and introduced automated detection systems to prevent recurrence. They shared an early draft of their analysis with external reviewers to enhance transparency.
Alexa Pan: Alexa Pan is a researcher at Redwood Research specializing in AI safety evaluations and reviews. She substantially contributed to Redwood Research’s detailed review of OpenAI’s investigation into accidental CoT grading.
Apollo Research: Apollo Research is an AI safety lab focused on safety evaluations, forecasting, and mechanistic interpretability for frontier models. They offered feedback on OpenAI’s draft analysis of unintended CoT grading incidents.
Ryan Greenblatt: Ryan Greenblatt is a prominent AI alignment researcher at Redwood Research, known for work on scalable oversight and training procedures. He substantially informed Redwood Research’s commentary on OpenAI’s CoT grading analysis.
Redwood Research: Redwood Research is an independent AI safety organization conducting empirical research and external reviews to mitigate catastrophic risks from advanced AI. They reviewed OpenAI’s analysis of accidental CoT grading, agreeing with the conclusion that it did not substantially degrade monitorability while noting residual uncertainties and advocating for stronger organizational practices. Their report highlights the need for external accountability in AI safety claims.

Stress Testing: Experiments in realistic RL pipelines show that CoT-directed pressures can steer reasoning properties like confidence but struggle to induce generalized obfuscation under limited coverage.
CoT Grading Risks: Accidental grading of chains of thought during reinforcement learning can incentivize models to produce misleading reasoning traces, potentially compromising monitorability for misalignment detection.
Industry Precedents: Anthropic encountered similar accidental CoT exposure in training runs for models including Mythos, Opus 4.6, and Sonnet 4.6.