AI Delivery Brief · Atlassian · Evaluation
Longer reasoning needs a different definition of quality.
Atlassian’s Rovo Long Horizon work shows why longer agentic reasoning cannot be judged only by final output. Delivery teams need to evaluate the reasoning path, tool choices, traceability and recovery behaviour.
What changed
Atlassian described how the Rovo team rebuilt its AI orchestration with Long Horizon, replacing specialised subagents with one reasoning loop and supporting much longer iterations and tool calls.
Why it matters
Longer reasoning can improve complex work, but it also makes quality harder to inspect. A good answer may hide a poor route, weak evidence or unnecessary cost. For delivery teams, evaluation needs to include how the agent reached the answer, not just whether the final response looked right.
What delivery leaders should do next
- Add reasoning trace review to agent acceptance criteria.
- Track tool calls, latency, cost and recovery attempts.
- Separate output quality from process quality.
- Use evaluation criteria before scaling agents into business workflows.

