Don’t ask if it works. Ask for proof.

There’s a pattern most of us fall into when working with AI coding agents. You ask the agent to make a change, it makes the change, and then you ask: “Will this work if some rows have null emails?” or “Does this handle concurrent requests?” The agent says yes. It always says yes.

This isn’t a bug in the model. It’s a bug in the question. You’re asking the LLM to evaluate its own work using the same context it used to produce it. That’s like asking the developer who wrote the code to also be the sole reviewer—except worse, because LLMs have a well-documented tendency to agree with the framing of your question.

“Does this look right?” is a classification question. The model pattern-matches against what “right” looks like and tells you what you want to hear. “Show me proof it works” is a generation question. It produces an artifact you can actually inspect. The fix is simple: stop asking yes/no questions about correctness. Ask for evidence instead, and then ask for proof.

stop asking yes/no questions about correctness. Ask for evidence instead.

Not all proof is equal, but almost any proof is better than “yes, that should work.”

What proof looks like

Execution proof is the strongest. The agent runs something and shows you the output. Test suites pass. Builds succeed. A curl request returns what you expect. The LLM can hallucinate an explanation, but it can’t fake a terminal output from a command it actually ran.

Before/after proof is nearly as strong. The agent shows you the state of something before the change and after. A failing test that now passes. An API response that changed. Error logs that disappeared. This gives you a diff of behavior, not just a diff of code.

Constructed verification is a step down but still valuable. The agent writes a new test or validation script to check its own work. The catch: it wrote both the code and the check, so it can have the same blind spot in both. But at least you can read the test and decide if it’s checking the right thing.

Reasoned explanation is where most people currently live, and it’s the weakest form. The agent traces through the logic and explains why it’s correct. This is still just the LLM talking—but forcing it to show its reasoning is better than a bare “yes.”

If you work every day with a coding agent, the goal is to push your interactions toward the top of this list.

In practice

Instead of: “Will this migration work if some rows have null emails?”

Ask: “Seed a test database with rows where email is null, run the migration, and show me the output.”

Instead of: “Is this actually fixing the N+1 query?”

Ask: “Add query logging, hit the endpoint, and show me the query count before and after your change.”

Instead of: “Will the transaction roll back if the third insert fails?”

Ask: “Force the third insert to fail, then query the database and show me the first two rows are absent.”

Instead of: “Does the rate limiter reset after the window expires?”

Ask: “Hit the rate limit, wait for the window to pass, hit it again, and show me it allows the request.”

The pattern is the same every time. Replace the yes/no question with a request to demonstrate the behavior. End your prompt with a request for proof, not a question about correctness.

When proof is hard to get

This works best when your agent can actually execute code—run tests, hit endpoints, query databases. If you’re copying code out of ChatGPT and pasting it into your editor, half of this doesn’t apply. It also gets harder when you don’t have good dev environments.

If you can’t spin up a test database, run migrations against sample data, or hit a staging API, the agent can’t demonstrate much. Investing in your dev environment—local Docker setups, seed data, test infrastructure—pays dividends here because it gives your agent (and you) a place to prove things work. At Charlie, we’ve found that the teams getting the most out of agents are almost always the ones with strong dev environments that agents can operate in freely.

Some things are genuinely hard to prove: UI rendering, subjective code quality, complex distributed system behavior. For those, reasoned explanation might be the best you get. That’s fine. The point isn’t that every interaction needs a proof artifact. It’s that you should default to asking for proof and fall back to explanation only when you have to.

The one-sentence version

When your agent finishes a task, don’t ask “does this work?” Ask it to show you that it works. You’ll catch more bugs, waste less time on false confidence, and build a habit that scales as you give agents more autonomy.

Appendix: More examples

Here are some good ways we’ve found to deal with some of the trickier edge cases we come across when working on complex applications.

Cron job concurrency: Instead of “does this handle overlapping runs?”—ask the agent to show the locking mechanism, simulate two concurrent invocations, and prove only one executes.

CSV edge cases: Instead of “does the export handle commas and newlines in fields?”—ask the agent to generate a CSV with adversarial data (commas, newlines, quotes), show you the raw output, then re-import and prove the round-trip is clean.

CI/CD failure handling: Instead of “will the workflow fail if tests fail?”—ask the agent to push a branch with a deliberately failing test and show you the workflow result.

Promise.allSettled error handling: Instead of “am I handling rejected promises correctly?”—ask the agent to test with a mix of resolving and rejecting promises and show the output for each case.

Migration preconditions: Instead of “will this unique constraint migration fail if duplicates exist?”—ask the agent to query the current database for duplicates and show you the count before you run anything.

Invite flow edge cases: Instead of “does the invite work if the email already has an account?”—ask the agent to create the conflicting state, trigger the invite, and show you exactly what happens.

Don’t ask if it works. Ask for proof.

What proof looks like

In practice

When proof is hard to get

The one-sentence version

Appendix: More examples

Continue reading

The End of Local

The Task Supply Problem