All posts

On Building with LLMs

AI engineering startups

Large language models are genuinely different from every prior category of software primitive. They’re non-deterministic, can’t be unit tested in any traditional sense, and their failure modes are subtle in ways that databases and APIs never were.

After a year of building production systems on top of them, here’s what I think matters.

The reliability problem is real but tractable

The first instinct when an LLM fails is to blame the model. Usually that’s wrong. In my experience, 80% of reliability failures are prompt failures — ambiguous instructions, missing context, poorly structured outputs.

The discipline of writing prompts is closer to writing technical specifications than it is to writing code. You need to be precise about:

  • What the model knows and doesn’t know
  • What format you expect the output in
  • What to do when the input is unexpected

Evals before vibes

The single highest leverage thing you can do is build an evaluation suite before you ship anything. Not unit tests — evaluations. A set of real inputs with expected outputs, where you can measure regression.

Without evals you’re flying blind. Every prompt change feels like it either fixed everything or broke everything. You can’t tell which.

The right abstraction level

Most teams I’ve seen make one of two mistakes: they either use LLMs where simpler tools would work, or they don’t use them where they’d be transformative.

A useful heuristic: LLMs are best at tasks where the space of valid outputs is large and the cost of a wrong answer is low. They’re worst at tasks with small valid output spaces and high cost of error.

Use them accordingly.