Daemons do the rest — all the necessary work that nobody owns
A taxonomy of recurring Product and Engineering work that doesn't need a human to remember it every week — just a process to hold the role.
For bounded orchestration decisions, the right model is often the smallest one that can pass a focused validation loop.
Most of the visible work in an engineering agent happens after it starts touching code: reading files, proposing changes, running tests, and opening PRs. The less visible cost is the orchestration work around that: deciding what context to fetch, which tool to call, and where the work should happen.
Repo inference is one of those steps. When Charlie receives a task, he often needs to decide which customer GitHub repository the task is actually about. The repo-inference step examines the customer’s repo inventory and selects the primary repo for the work. That sounds simple until the signal comes from a Linear comment, a Slack thread, a GitHub webhook, or a request that mentions a product feature rather than a repo name.
After the V2 rollout, repo inference became one of our larger orchestrator inference costs. It also ran at roughly 2× the event volume of the next-highest orchestrator inference path. That made it a good place to test a practical question: does this bounded routing step need a larger general model, or can a smaller model pass the validation loop?
The implementation that rolled out on April 27, 2026 moved repo inference from our gpt-5.4 preset to a gpt-5.4-nano preset. The default preset in infer-task-repo.ts changed from:
openai/gpt-5.4-low-reasoning-low-verbosity-priority
To:
openai/gpt-5.4-nano-low-reasoning-low-verbosity-priority
The preset still used low reasoning, low verbosity, and the priority service tier. The main change was the underlying model: gpt-5.4-nano.
We also tightened the repo-choice prompt. The older prompt relied on a looser evidence hierarchy. The updated prompt makes the decision process more explicit: first extract the actionable target from the human-facing request, then use direct repo mentions or mapped repo context, then match to inventory routing hints, package names, service names, and top-level paths. It also clarifies that provider names can be data sources rather than implementation targets. A request to summarize what happened in Linear should not automatically route to the Linear integration repo.
Switching models without reducing ambiguity only moves risk around. The smaller model needed a narrower job, not encouragement.
Repo inference is a bounded classification step. The model is not designing an architecture or writing a migration. It is choosing one repo from a finite inventory, with a small amount of structured context and a constrained output.
That makes it a good candidate for a smaller model, but only if the step is validated directly. We ran a focused harness over repo-inference fixtures and iterated on the prompt based on failures. The final harness iteration passed all 9 fixtures twice in a row: 9/9 + 9/9, meeting the requested two-run 100% accuracy bar.
That does not prove the model will never choose the wrong repo. It does mean the decision boundary was explicit enough to pass the cases we cared about before rollout. For an orchestrator step like this, the bar is practical: the task is narrow, the output is observable, and failures are cheap enough to catch in review or telemetry.
We compared a corrected 22-hour pre/post window around the cutover. The exact cutover minute varies depending on whether you use the cost analysis timestamp, the PR merge time, or the first observed nano production call, so the useful claim is the before/after economics, not the minute-by-minute boundary.
| Metric | Before | After | Change |
|---|---|---|---|
| Calls | 14,891 | 15,709 | +5% |
| Total cost | $639.17 | $65.06 | −89.8% |
| Cost per call | $0.0429 | $0.00414 | −90.4% |
| Relative cost per call | 1× | ~0.096× | ~10.4× cheaper |
| Estimated savings over post window | — | ~$574 | — |
| Annualized estimate | — | ~$229k/year | If traffic mix and volume hold |
The corrected nano rate card used for the analysis was $0.20 / $0.02 / $1.25 per 1M tokens for input, cached input, and output.
There is one useful nuance in the token mix. Post-cutover output tokens were similar to pre-cutover output tokens: about 1.84M after versus 1.98M before. After the switch, output tokens dominated the remaining cost. That means the next round of savings would likely come less from trimming input and more from keeping the output contract tight.
Latency improved, but the claim is narrower than the cost claim. Direct repo-inference LLM calls were about 10.1% faster at p50 and 7.1% faster on average. p95 was effectively flat, and p99 was worse in the sample because of outliers. We should not describe this as a full routing-phase or end-to-end tail-latency improvement. It was a modest latency win for the direct LLM call and a much larger cost win.
The useful rule is: use the smallest model that reliably solves a bounded, validated orchestration step.
For repo inference, that meant five things:
This is a pattern we expect to reuse. Agent systems run many small orchestration steps: classify, route, select, summarize, decide whether to fetch more context. Some need stronger models. Many do not. The work is separating those cases instead of treating every inference call like it needs the same tool.
In this case, the repo-inference task was narrow enough, observable enough, and validated enough to move down the model stack. The result was roughly 90% lower cost per call, with call volume slightly up and quality checks passing before rollout.
That is the kind of optimization that compounds when there are thousands of small orchestration calls per day. The model got smaller; the boundary around it got sharper. That was the important part.