AI can write code. But can it maintain it over time?
That’s the question a new paper from Alibaba researchers sets out to answer.
They built SWE‑CI, a benchmark that tests AI agents on real‑world code evolution, not just one‑off fixes.
Here’s what makes it different: - 100 real Python codebases from 68 GitHub repos - Each spans ~233 days of development - ~71 commits per project on average
Instead of fixing a bug once, agents enter a continuous integration loop.
They must update code iteratively, adapt to new requirements, and keep everything working without breaking what’s already there.
This shifts the focus: From passing tests once → to sustaining code quality over time From static correctness → to long‑term maintainability
They even introduced a new metric: EvoScore. It rewards stability in later iterations and penalizes regressions as the code evolves.
They tested 18 AI coding agents.
The results tell a different story from the benchmarks.
Most models can write code just fine. Almost all of them struggle to maintain it over time.