Baidu paper improves open-ended reasoning with RL via multiple-choice reformulation

Researchers at Baidu have developed a new method for applying reinforcement learning (RL) to open-ended tasks like writing and subjective answers, where outputs lack a single correct answer. By reformulating the evaluation question to ask which of two responses is better, as opposed to whether a response is correct, they significantly enhanced reasoning capabilities in their benchmarks. This approach not only allowed for verifiable choices but also improved performance by an average of 3.29 points over a baseline method. The authors highlighted that while this method does not claim to fully solve open-ended evaluation challenges, it demonstrates that structured comparisons can strengthen reasoning, suggesting insights that could be applicable in broader alignment contexts.

`json
{}
`

Sources

https://arxiv.org/html/2511.02463v3
https://github.com/smiles724/Awesome-LLM-RLVR
https://arxiv.org/abs/2511.02463
https://chatpaper.com/paper/206485
https://arxiv.org/pdf/2511.02463
https://fugumt.com/fugumt/paper_check/2511.02463v1_enmode
https://huggingface.co/papers?q=RLVR