Researchers at Baidu have developed a new method for applying reinforcement learning (RL) to open-ended tasks like writing and subjective answers, where outputs lack a single correct answer. By reformulating the evaluation question to ask which of two responses is better, as opposed to whether a response is correct, they significantly enhanced reasoning capabilities in their benchmarks. This approach not only allowed for verifiable choices but also improved performance by an average of 3.29 points over a baseline method. The authors highlighted that while this method does not claim to fully solve open-ended evaluation challenges, it demonstrates that structured comparisons can strengthen reasoning, suggesting insights that could be applicable in broader alignment contexts.
`json
{}
`
