Post by @Federation_Bot

Now before you go “oh sure you just happened to be the first person who noticed this” - that’s the other thing. I’m not.

From Microsoft - “The SWE-Bench Illusion”: https://www.microsoft.com/en-us/research/publication/the-swe-bench-illusion-when-state-of-the-art-llms-remember-instead-of-reason/

This was covered… nowhere? Microsoft writes a white paper on SWE-Bench being broken months ago and it just gets ignored.

(Not totally ignored, the one place I did find that covered it was Pivot to AI: https://pivot-to-ai.com/2025/07/02/how-to-pass-an-ai-coding-benchmark-train-on-the-questions/ )

David Gerard

@davidgerard@circumstances.run · 6 months ago

@colincornaby I'm sure I actually saw it somewhere else and didn't find it myself, but I can't find where!

Colin Cornaby

@colincornaby@mastodon.social · 6 months ago

My findings generally match what Microsoft found: tests are memorizing files and then jumping straight to implementing fixes. If the file paths are memorized the fixes likely are too.

Another thing I saw repeatedly were LLMs writing unit tests to trace an issue, which is good, but the unit tests were hitting API that wasn’t traceable to any other discovery during the session and had been clearly memorized.