Notion AI has unveiled its latest feature, Custom Agents, designed to enhance automation for team tasks by utilizing triggers such as schedules, Slack, and email integrations. This development is the result of significant iterative work, including multiple rebuilds that addressed issues like short context windows and unreliable models. The focus on enhancing agent usefulness is prevalent through specialized evaluations, including Notion’s Last Exam, reflecting the company’s commitment to building effective AI solutions for knowledge work.
Notion: Notion is an AI-powered workspace platform that integrates notes, databases, and collaboration tools for knowledge workers. Its Token Town team develops advanced AI capabilities, including Custom Agents for automating team workflows with external integrations like Gmail. In a recent podcast, Token Town leaders detailed the iterative rebuilds of Notion AI to prioritize future model advancements and agent usefulness.
Simon Last: Simon Last is a co-founder of Notion who leads engineering efforts on its AI features. He oversees the Token Town team responsible for building robust agent systems and shares insights on hackathons, security integration, and self-verifying coding agents. In the latest Latent Space interview, he outlines the history of Notion AI rebuilds and envisions agent-driven software factories.
Sarah Sachs: Sarah Sachs is the AI engineering lead at Notion, with prior experience at Google and Robinhood. She manages low-ego AI teams focused on evals, retrieval systems, and MCP versus CLI trade-offs. Recently, she discussed Token Town’s approach to agent harnesses and meeting notes as workflow automation starters in a deep-dive podcast.
Custom Agents: Enable proactive automation of recurring team tasks through triggers like schedules, Slack, and email integrations.
Evaluation Focus: Emphasizes agent usefulness via specialized evals like Notion’s Last Exam and the Model Behavior Engineer role.
Development Iterations: Multiple rebuilds addressed short context windows, unreliable models, and lack of tool-calling standards before achieving reliable performance.
