Story Summary Story
Last updated: 21 hours ago
A head-to-head comparison demonstrated a significant performance gap between two language model versions playing a long-horizon game. The superior model won the game without any losses, while the predecessor lagged substantially, becoming trapped in specific areas like a lighthouse for thousands of turns.
Key advantages for the advanced model included superior spatial awareness, effective utilization of map markers as obstacles, and the improvisation of multitasking to circumvent harness limitations. It also showed better long-term planning, such as executing a complex stalling strategy to win the final boss battle.
Weaknesses persisted in both models, notably unverified initial assumptions leading to extensive wasted time, as seen in a puzzle where one model ignored vital NPC hints. The advanced model also exhibited brittle tool use and a tendency to focus on single goals. Efficiency metrics showed the newer model required significantly fewer turns and tokens to achieve milestones, projecting a much faster overall completion time. Future work aims to test performance in more vision-dependent scenarios.
Generating comment summary... This may take a moment.