87.4% of My Agent's Decisions Run on a 0.8B Model
87.4% of my AI agent's inference calls run on a 0.8B parameter model. Not as a demo. Not on a benchmark. In production, 24/7, for 18 days straight. Here's the data, and what it means for how we sho...

Source: DEV Community
87.4% of my AI agent's inference calls run on a 0.8B parameter model. Not as a demo. Not on a benchmark. In production, 24/7, for 18 days straight. Here's the data, and what it means for how we should be building agents. The Setup I run a personal AI agent called mini-agent — a perception-driven system that monitors my development environment, manages tasks, and assists with projects. The "brain" is Claude (Opus/Sonnet). It's powerful, but every call costs tokens and time. So I built a cascade layer: a local 0.8B model (Qwen2.5) handles decisions first. Only when it can't — or when the task genuinely needs deep reasoning — does the request escalate to a 9B model, then to Claude. After 18 days of continuous operation, I analyzed 12,265 inference calls. Here's what the data says. The Numbers Task Type Total Calls Local (0.8B) Rate Fallback Rate Chat classification 3,413 99.8% 0.2% (7 calls) Memory query routing 7,347 99.6% 0.4% (33 calls) Working memory update 1,505 0.3% 99.7% (by design