Why the same codebase should always produce the same audit score
There is a failure mode in AI-powered analysis tools that does not get talked about enough, and we ran into it directly. When you submit the same repository twice — same commit, same inputs, same e...

Source: DEV Community
There is a failure mode in AI-powered analysis tools that does not get talked about enough, and we ran into it directly. When you submit the same repository twice — same commit, same inputs, same everything — you should get the same score. If the score changes between runs, the audit is not an audit. It is a random sample. Early in testing, we observed score variance across consecutive runs on identical inputs. Not small variance. Meaningful swings — enough to change the risk interpretation of a codebase entirely. A score that sits in one category on one run and a different category on the next is worse than useless for the people who depend on it most: founders preparing investor materials, compliance leads building audit evidence, CTOs making remediation decisions. This is a structural problem with LLM-based analysis, not an implementation bug, and it has a structural cause. Where the variance comes from Large language models are probabilistic by default. They sample from a probabili