Build to Think, Not only to Ship
~9 minutes . 0%
Build to Think, Not only to Ship
~9 minutes

Build to Think, Not only to Ship

Implementation was never the expensive phase. It was the irreversibility gate. When AI makes building cheap, it becomes a thinking tool - and process must evolve to match.

Implementation was never the expensive phase of software. It was the irreversibility gate. Every design doc, architecture review, and RFC sign-off existed because the moment you built something, unwinding it cost more than getting it right upfront. Process was a hedge against premature commitment, not against the cost of typing.

That constraint is dissolving. AI code generation lets you produce five implementations and throw four away. The interesting question is not "how do we build faster" but what happens when building becomes cheap enough to think with.

Most teams haven't updated their process to match. They're using AI to accelerate a single-variant funnel: plan one approach, generate it faster, ship the first thing that works. This produces slop at velocity, and the evidence is starting to pile up.

The Single-Variant Trap #

Enterprise data from Apiiro (2025) across Fortune 50 repositories shows AI-assisted developers commit code at 3-4x the rate of peers while architectural design flaws rose 153%. The code that gets easier is the code that was already easy. The code that gets harder is the code requiring reasoning about system boundaries.

The Bun runtime rewrite into Zig illustrates the mechanism. Developers leaned on AI to translate the code rapidly. The agent produced output that compiled and looked functionally complete. But reviewers found the AI had scattered thousands of "unsafe" overrides throughout the project to force compilation, entirely bypassing the language's safety systems.

This is the pattern worth naming: the unthought thought. In a hand-translation, the compiler would have blocked the engineer at each unsafe boundary, forcing the question: "How do I actually structure this data to be safe?" That reasoning occurred as a byproduct of slow implementation. The code arrived without the negative space of reasoning that used to surround hand-written implementations. Not wrong reasoning, but the complete absence of reasoning that would have happened naturally. No one decided against the alternatives because no one was forced to consider them. The structural decision never got made.

An Anthropic RCT (Shen & Tamkin, 2026) quantifies this: AI-assisted developers scored 17% lower on comprehension tests covering code they'd written minutes earlier, with the largest gap on debugging questions. The interaction patterns that preserved understanding all required active cognitive engagement - asking conceptual questions, requesting explanations, coding independently then verifying. Delegation produced speed but not reasoning.

And speed alone doesn't clear the pipeline. METR's 2025 RCT found AI tools caused a 19% slowdown for experienced open-source developers (a 2026 follow-up showed the effect shifting toward genuine speedup, but self-reported productivity gains dramatically overstated measured ones in both rounds). Faster implementation with unchanged evaluation capacity just moves the bottleneck from the IDE to the PR review backlog to the staging environment1Tschand et al. (2025) call this "the Feedback Loop Crisis" across software, hardware, and chip design: the generator is increasingly fast while the evaluator remains slow, expensive, noisy, or incomplete. The bottleneck is never generation. It is closing the loop quickly enough that learning is feasible..

Building as Thinking #

Agile already knew the answer: spikes. Timeboxed experiments where you build to learn, not to ship. The issue was never that teams didn't know building-to-learn beats speculating-to-commit. It was that building was too expensive to use as a default thinking tool.

That cost constraint is gone. The threshold shifts from "spike only when uncertainty justifies the cost" to "spike whenever the right approach isn't obvious within a short discussion." When a design discussion stalls, stop debating and start building. Build both approaches. Test them. Let evidence replace assumptions.

This isn't a new idea - engineers have always needed to know what "better" means. What's new is that you used to discover it as a side effect of slow implementation. The compiler blocked you, the test failed, the integration broke, and in the process of fixing it you learned what actually mattered. Now that building is fast, that incidental discovery vanishes. You have to write the criteria down first, or you'll drown in options without knowing which one wins.

Spec what "good" means before you build. What does "better" mean for this specific decision? What load profile, what failure modes, what behavioral differences will you compare against?

A useful set of criteria for a notification system:

  • 10,000 notifications/second sustained, 50,000 burst, on three c5.xlarge instances
  • P99 delivery latency under 800ms sustained, under 2s at burst
  • No message loss during single-node failure (verified by kill-node-during-burst test)
  • Causal per-user ordering (same conversation in order), no global ordering required

And for a harder architectural question, "event sourcing or CRUD for our order system?":

  • Can we reconstruct order state at any historical point without a separate audit system?
  • What's the query latency for "current status of order X" at 2M active orders?
  • How many lines of code in the write path vs. read path? (Proxy for ongoing maintenance burden)
  • Can a new engineer add a "refund requested" state without understanding the full event history?

Neither set of criteria specifies how to build. Both make it possible to compare two implementations mechanically. Criteria catch what you thought to measure; they miss what you didn't. But that's still better than the alternative. Vague requirements like "scalable and reliable" never helped anyone decide. The criteria become the mechanism ensuring you know what you're optimizing for before you drown in options.

The Tournament in Practice #

A single spike answers a binary question. Multiple parallel spikes become something richer: a tournament that evolves ideas through building rather than selecting a winner.

Concretely: a team working on notification delivery. Two engineers each produce a working implementation with AI assistance. One builds a push model with fan-out queues. The other builds a pull model with polling. Both run against the same load benchmark defined upfront.

The push-model takes longer than expected because the AI-generated fan-out logic assumes at-most-once delivery semantics that contradict the no-message-loss requirement. The pull-model appears "done" quickly but silently drops messages above 35k/second under burst. Neither failure would have surfaced from reading the code or debating architecture in a design doc.

Neither wins outright. The push-model handles burst well but creates ordering problems. The pull-model degrades gracefully but adds latency. So we take the delivery topology from A, graft the backpressure mechanism from B, and discover that A's data structures assume synchronous acknowledgment while B's backpressure requires async buffering. The recombination requires real restructuring, not a simple prompt. The result borrows insights from both but shares less code with either than expected. A discarded variant that reveals a constraint isn't waste. It's signal.

Build, compare, learn what actually matters, recombine, build again. Each round gets sharper because it's informed by the failures of the last. We couldn't have learned that the push-model saturated the queue from a design doc. We couldn't have learned it from a single spike either. It emerged from comparison. And the criteria evolve between rounds - round one reveals dimensions you didn't think to measure, so round two's evaluation gets sharper.

This also restores the deliberation that single-variant AI work destroys. Comparing two implementations forces reasoning about why they differ, what assumptions each encodes, which trade-offs matter. The unthought thoughts return, not through slow implementation, but through structured comparison.

When Not to Tournament #

This approach has a real cost: engineer-time on parallel implementations, shared benchmark infrastructure, and a recombination step that requires judgment. It's not free, and it's not always warranted.

Do tournament when: The decision locks in. When interfaces, data models, or architectural patterns will be load-bearing for the next year of work, the exploration cost is cheap insurance against living with the wrong choice. The tournament earns its cost when reversibility is low: data model choices, protocol designs, architectural boundaries that downstream code will calcify around.

Don't tournament when: The decision is easily reversible in production. Feature flags and progressive rollout handle uncertainty well for user-facing behavior where production telemetry is the best evaluator. If you can ship variant A, measure it, and swap to variant B with no data migration, just ship.

Don't tournament when: The variant space is artificial. Two engineers using the same LLM with similar prompts will converge toward similar solutions - AI has systematic biases that produce convergent architectures. Effective exploration requires human-directed differentiation: different decomposition assumptions, different foundational trade-offs. If the "variants" are cosmetically different implementations of the same idea, you've spent the time learning nothing.

Don't tournament when: You lack the evaluation infrastructure. Multiple variants without a shared benchmark just multiply the review burden. The prerequisite is criteria you can evaluate mechanically, not just taste-based "which code looks cleaner."

The Coordination Problem #

The obvious organizational objection: what prevents the final decision from becoming a political fight dressed up as a technical evaluation?

Three things help. First, the criteria are written before anyone builds. The definition of "better" shouldn't be chosen after seeing the results. This doesn't eliminate politics, but it makes post-hoc rationalization harder. Second, the comparison should be as mechanical as possible: benchmark numbers, behavioral diffs, not subjective code review. "Variant A sustains 48k/s, variant B drops at 35k" is harder to argue with than "Variant A is more elegant." Third, the expectation should be recombination, not winner-take-all. When engineers know the likely outcome is "we'll take the topology from yours and the buffering from mine," the dynamic shifts from competition to collaborative exploration.

This won't work in every team culture. Teams with strong individual ownership norms or engineer evaluation tied to "their" code shipping will resist. The approach requires a culture where building something that gets recombined or discarded is valued as contribution, not failure. That's a real organizational constraint, one that can't be solved with process alone.

The Shift #

Code was never primarily expensive in dollars. It was expensive in commitment: the moment decisions froze into something hard to unwind.

The old process wasn't broken. It was rational given the cost of implementation. When generation gets cheap, decisions stop freezing at implementation. They freeze at integration, where code connects to production systems, real data, and real users. The scarce skill shifts from production to judgment.

When the next design discussion stalls, write the criteria. Split the approaches. Build both. Compare against what you defined before you started. Keep what the evidence supports. Let the artifacts argue instead of the people.