FrontierMath: Tiers 1–4 (v2) goes live after audit fixes 42% of errors

FrontierMath: Tiers 1–4 (v2) has been launched following an independent audit that addressed significant errors in 42% of the problems, ultimately leading to higher scores across the board. Currently leading the benchmark are GPT-5.5, scoring 85% on Tiers 1–3, and Google’s AI co-mathematician at 76% on Tier 4. This project began in April after OpenAI reported unexpected errors during their internal review, prompting improvements funded by OpenAI, who retains exclusive access to about 80% of the dataset. The refined FrontierMath will shift its focus toward open research problems as it nears saturation, emphasizing its potential as a reliable tool for math evaluation in the AI field.

Epoch: Epoch is the organization behind the FrontierMath benchmark. It holds out a portion of the dataset from OpenAI, conducted the independent audit using AI flagging and mathematician review, and maintains the public leaderboard with updated model scores.
OpenAI: OpenAI is an AI research and deployment company. It funded the development of FrontierMath Tiers 1–4 and conducted an internal review that identified errors, prompting the subsequent audit and v2 release.
FrontierMath: FrontierMath is a benchmark project focused on advanced mathematical problems drawn from real research. It released an updated Tiers 1–4 (v2) version after an independent audit corrected errors in many problems. OpenAI funded much of its development and maintains exclusive access to a large portion of the dataset.

`json
{
“Future Direction”: “The project is focusing on open research problems as the current dataset nears saturation for advanced models.”,
“Model Evaluation”: “Leading AI models are assessed on the updated FrontierMath benchmark, with scores updated for notable recent Claude models.”,
“Benchmark Improvement”: “FrontierMath Tiers 1–4 v2 has been enhanced through audit and error correction for improved reliability as a math evaluation tool.”
}
`