90% cheaper repo inference with gpt-5.4 nano

Most of the visible work in an engineering agent happens after it starts touching code: reading files, proposing changes, running tests, and opening PRs. The less visible cost is the orchestration work around that: deciding what context to fetch, which tool to call, and where the work should happen.

Repo inference is one of those steps. When Charlie receives a task, he often needs to decide which customer GitHub repository the task is actually about. The repo-inference step examines the customer’s repo inventory and selects the primary repo for the work. That sounds simple until the signal comes from a Linear comment, a Slack thread, a GitHub webhook, or a request that mentions a product feature rather than a repo name.

After the V2 rollout, repo inference became one of our larger orchestrator inference costs. It also ran at roughly 2× the event volume of the next-highest orchestrator inference path. That made it a good place to test a practical question: does this bounded routing step need a larger general model, or can a smaller model pass the validation loop?

What changed

The implementation that rolled out on April 27, 2026 moved repo inference from our gpt-5.4 preset to a gpt-5.4-nano preset. The default preset in infer-task-repo.ts changed from:

openai/gpt-5.4-low-reasoning-low-verbosity-priority

To:

openai/gpt-5.4-nano-low-reasoning-low-verbosity-priority

The preset still used low reasoning, low verbosity, and the priority service tier. The main change was the underlying model: gpt-5.4-nano.

We also tightened the repo-choice prompt. The older prompt relied on a looser evidence hierarchy. The updated prompt makes the decision process more explicit: first extract the actionable target from the human-facing request, then use direct repo mentions or mapped repo context, then match to inventory routing hints, package names, service names, and top-level paths. It also clarifies that provider names can be data sources rather than implementation targets. A request to summarize what happened in Linear should not automatically route to the Linear integration repo.

Switching models without reducing ambiguity only moves risk around. The smaller model needed a narrower job, not encouragement.

Why this was safe enough to try

Repo inference is a bounded classification step. The model is not designing an architecture or writing a migration. It is choosing one repo from a finite inventory, with a small amount of structured context and a constrained output.

That makes it a good candidate for a smaller model, but only if the step is validated directly. We ran a focused harness over repo-inference fixtures and iterated on the prompt based on failures. The final harness iteration passed all 9 fixtures twice in a row: 9/9 + 9/9, meeting the requested two-run 100% accuracy bar.

That does not prove the model will never choose the wrong repo. It does mean the decision boundary was explicit enough to pass the cases we cared about before rollout. For an orchestrator step like this, the bar is practical: the task is narrow, the output is observable, and failures are cheap enough to catch in review or telemetry.

Results

We compared a corrected 22-hour pre/post window around the cutover. The exact cutover minute varies depending on whether you use the cost analysis timestamp, the PR merge time, or the first observed nano production call, so the useful claim is the before/after economics, not the minute-by-minute boundary.

Metric	Before	After	Change
Calls	14,891	15,709	+5%
Total cost	$639.17	$65.06	−89.8%
Cost per call	$0.0429	$0.00414	−90.4%
Relative cost per call	1×	~0.096×	~10.4× cheaper
Estimated savings over post window	—	~$574	—
Annualized estimate	—	~$229k/year	If traffic mix and volume hold

The corrected nano rate card used for the analysis was $0.20 / $0.02 / $1.25 per 1M tokens for input, cached input, and output.

There is one useful nuance in the token mix. Post-cutover output tokens were similar to pre-cutover output tokens: about 1.84M after versus 1.98M before. After the switch, output tokens dominated the remaining cost. That means the next round of savings would likely come less from trimming input and more from keeping the output contract tight.

Latency improved, but the claim is narrower than the cost claim. Direct repo-inference LLM calls were about 10.1% faster at p50 and 7.1% faster on average. p95 was effectively flat, and p99 was worse in the sample because of outliers. We should not describe this as a full routing-phase or end-to-end tail-latency improvement. It was a modest latency win for the direct LLM call and a much larger cost win.

What we learned

The useful rule is: use the smallest model that reliably solves a bounded, validated orchestration step.

For repo inference, that meant five things:

Bound the decision. Repo inference picks from a known inventory. There is no open-ended generation problem hiding inside the task.
Make the prompt procedural. Smaller models benefit from clear ordering: what evidence matters first, what counts as a tie-breaker, and which tempting signals should be ignored.
Validate against fixtures. Cost improvements are only useful if accuracy holds on the cases that matter.
Measure the right surface. Cost per call was the cleanest result. Direct-call latency improved modestly. End-to-end routing latency was not proven by this analysis.
Keep the output small. Once input gets cheap, verbose outputs can become the main remaining cost.

This is a pattern we expect to reuse. Agent systems run many small orchestration steps: classify, route, select, summarize, decide whether to fetch more context. Some need stronger models. Many do not. The work is separating those cases instead of treating every inference call like it needs the same tool.

In this case, the repo-inference task was narrow enough, observable enough, and validated enough to move down the model stack. The result was roughly 90% lower cost per call, with call volume slightly up and quality checks passing before rollout.

That is the kind of optimization that compounds when there are thousands of small orchestration calls per day. The model got smaller; the boundary around it got sharper. That was the important part.

90% cheaper repo inference with gpt-5.4 nano

What changed

Why this was safe enough to try

Results

What we learned

Continue reading

A Smarter Charlie, Powered by GPT-5.6

I rage-built email automation because HubSpot wanted me to talk to sales