A correction worth surfacing, prompted by a colleague's question. In "where it goes wrong" I declined to put a number on the 994 changes — that was too cautious. The 60-change sample is a uniform random draw from the 994, so while it can't give an acceptance rate (that needs human labels — what the arena is for), it can estimate the composition. Roughly: ~60% clean edits, ~17% scanner artifacts, the rest genuine misfires.
One thing I should have made clearer: the "scanner" is a separate step from the model, not part of it. It's plain regex code that extracts candidate strings and their surrounding code from the repo before the fine-tuned model runs — it isn't in the model weights. Almost every failure originates there, not in the model:
It truncates strings at apostrophes ("Don't…" → "Don"); the model then usually rebuilds the real string from context (harmless, but registers as a spurious "change"), and only rarely invents something. That last case — the model actually fabricating wrong copy — is ~2% of changes.
It occasionally passes non-copy strings (a color constant, an enum, a logging key) that the model then dutifully "improves."
So the honest read is, if anything, better for the model: the restraint and quality hold up, and the remaining work is scanner precision (don't cut at apostrophes; filter non-UI strings), not the model. "994 changed" is a pipeline number, not 994 vetted edits — and the arena is still the right way to measure acceptance.