Story Summary Story
Last updated: 13 hours ago
A key metric for assessing AI performance is the length of tasks agents can autonomously complete, measured by the time human professionals require. Current frontier models show near 100% success on tasks taking humans under 4 minutes, but less than 10% success on tasks exceeding 4 hours. This task-completion time horizon correlates strongly with model success probability.
Analysis of historical data reveals that the length of tasks state-of-the-art models can complete with 50% reliability has been increasing exponentially over the past six years, doubling approximately every seven months. Extrapolating this trend suggests that within a decade, AI agents could independently complete a significant fraction of software tasks currently requiring human days or weeks. This exponential progress helps reconcile observed superhuman performance on narrow benchmarks with limited real-world utility in complex, multi-step projects. The research emphasizes using task length as a forecasting tool for understanding and preparing for future AI capabilities.
Generating comment summary... This may take a moment.