Microsoft study reveals AI assistants corrupt 25% of documents during editing

A new Microsoft paper reveals that current AI assistants are prone to corrupting document content during extended editing tasks, with frontier models damaging about 25% of document content on average. This issue arises because AI models struggle to maintain document integrity over multiple edits, which is essential for reliable delegated work. The study, which tested 19 models across 52 professional domains using the DELEGATE-52 benchmark, found that failures often involved significant errors that compounded over time, particularly in larger files and distracting contexts, undermining the effectiveness of these tools in real-world applications.

arXiv: arXiv is an open-access online repository for scholarly preprints primarily in physics, mathematics, computer science, and related disciplines, facilitating rapid sharing of research before peer review. It recently published the Microsoft Research paper ‘LLMs Corrupt Your Documents When You Delegate,’ which tests AI models on long editing sequences. arXiv is preparing to transition into an independent nonprofit organization in July 2026.
Microsoft: Microsoft operates Microsoft Research, a division dedicated to advancing AI technologies in areas like natural language processing and human-computer interaction. Researchers from Microsoft recently released a preprint evaluating large language models’ reliability in multi-step document editing workflows across diverse professional domains. The study reveals persistent challenges in maintaining document integrity during delegated AI tasks, highlighting limitations in current AI assistants for extended real-world applications.

`json
{
“Tool Limitations”: “Agentic tool use does not prevent document corruption, which is worsened by larger files and irrelevant additional documents.”,
“Error Accumulation”: “AI models cause infrequent but significant document corruptions that build up silently over multiple edits, reducing trustworthiness in delegated tasks.”,
“Benchmark Introduction”: “DELEGATE-52 benchmark evaluates AI assistants through reversible editing tasks simulating extensive workflows in diverse professional fields.”
}
`