Verifying using Haskell in the Age of AI LLMs collapse the cost of unclever code, but not the cost of being wrong. Abstraction doesn't die — it migrates from amortizing human labor to defining the surface where verification happens. The harness enforces contracts at runtime. Haskell, refinement types, and QuickCheck were always about making the contract load-bearing and the implementation cheap. LLMs just made the implementation even cheaper, which makes the contract even more load-bearing. --- # Stanford Talk — Plain-Language Study Sheet (T-3 days) ## What the talk is actually about - LLMs write code fast. But fast doesn't mean right. - So you don't trust the LLM alone. You build a structure around it that catches its mistakes. - That structure is called a **harness**. The mistake-catchers are called **contracts**. - Whole talk in one breath: *the LLM is a fast guesser. You build the thing that checks its guesses.* ## Two big ideas - **Idea 1: contracts at the boundaries.** Every place where one part of your system hands work to another, you put a check. The LLM's output isn't trusted until something else verifies it. - **Idea 2: data beats plan.** Don't plan a lot, build a lot. Run small experiments fast and let what actually happens teach you. The harness is what makes the experiments cheap to run. ## Lines to memorize word-for-word - **Opening:** *"I'm an engineer who lived in Haskell for years, not a mathematician. The formal background shaped my taste; I don't claim it as a credential."* - This is humility insurance. It tells the room you know what you don't know, so they don't try to test you on it. - **Marcus line:** *"The company with the most information about what their own model can and can't do chose not to trust it alone for the use case that matters most to them."* - Translation: Anthropic built Claude. They know better than anyone what it can do. And even they didn't let it run unsupervised on their flagship coding product. - **The Litmus:** *"Did it move a signal, or did it move a document? Executing according to plan, or planning to execute?"* - Translation: did your work change something real, or did it just produce more writing about what you might do? Real progress moves a signal. Fake progress moves a document. - **The big sentence (use once, ideally near the end):** *"The neurosymbolic pattern was right for 25 years and uneconomical for 25 years. LLMs didn't change the architecture; they changed the price."* - Translation: putting checkers around AI is an old idea. People didn't do it because it was expensive. LLMs made it cheap. That's all that's new. ## Your strongest receipt — memorize these numbers - 5.5 weeks. 54 commits. 5,000 lines of methodology docs. 23 meetings = 23 hours. 78% of effort on planning, not building. - You called a freeze on March 30. You broke the freeze yourself within 4 days — 17 more commits. - **Why this matters on stage:** you're going to argue people should be disciplined about not over-planning. The audience needs to know you've personally failed at this. Otherwise you sound preachy. The fact that you can name the exact failure makes the whole rest of the talk credible. ## Three things you've actually built — name them - **drex-claude-skill** — your operations setup. It's the thing that ran the receipt above. Method tooling for running your company. - **zinegen** — the actual product. A Python program that generates magazines from member submissions. - **ryvm** — a search engine you built before Drex, in Haskell, using LiquidHaskell to formally verify parts of it. This is your "I've actually done formal methods" receipt. ## What "contract" means — be ready to define it on the spot - A contract has three parts: - **A boundary** — somewhere two pieces of your system meet. - **A predicate** — a rule that has to be true at that boundary. ("This number is positive." "This text isn't empty.") - **A checker** — something automatic that decides if the rule holds. Not a human reviewing it. Code. - **If you don't have all three, you don't have a contract — you have a wish.** - Contracts come in different strengths. Be ready to say which kind you mean: - **Strongest:** type-level checks proven by an SMT solver (your ryvm work). - **Strong:** code that runs at runtime and rejects bad output (your zinegen contrast linter). - **Weaker:** schema validation on JSON (the rest of zinegen). - **Not a contract at all:** "please don't do X" written in a prompt. That's a hope. Don't call it a contract. - One-liner if asked: *"A predicate at a boundary plus a deterministic checker that decides it. The LLM doesn't get to be the judge of its own output — that's the whole point."* ## The structure of the argument — three steps in order - **Step 1: think about your system as pure functions composing together.** - "Pure" means: no hidden state, no surprises, the same input always gives the same output. - Why it matters: if your pieces are pure, you can swap them, test them in isolation, and reason about them. - **Step 2: put contracts at the seams between the pieces.** - The LLM is the one piece that *can't* be made pure. So you put a checker at the boundary around it. - **Step 3: the contracts have to actually be verified by something deterministic.** - A contract no one checks is just a comment. - **The asymmetry to point out:** in normal code, you don't need heavy verification at every seam, because both sides are predictable. The LLM seam is special. That's the *one* place that needs the full apparatus. ## The deeper inversion (this is the smart-sounding insight) - For 50 years: writing code was cheap (humans), checking code was expensive. So most teams skipped the checking. - LLMs flip this: - Writing code is now nearly free. - Checking code is now the dominant cost. - And — separately — LLMs also make it cheap to write the *thing the checker needs* (specs, types, schemas). - **The line:** *"LLMs and formal methods have a complementary failure mode. Formal methods were too expensive because writing the specs was a brutal manual tax. LLMs are too unreliable to ship alone because they fluently generate plausible-but-wrong output. Each one fixes the other. The LLM writes the spec the human couldn't be bothered to write. The verifier catches the LLM bullshitting."* ## SMT solvers — what they are, in plain language - An SMT solver is a program that answers yes/no logic questions. Very fast, very narrow. - It does **not** "walk through" or "explain" anything. You hand it a math statement, it says yes, no, or "I gave up." That's all. - Liquid Haskell works by translating your refinement type contracts into SMT-shaped questions and asking Z3 (the SMT solver). You never see the proof — just the verdict. - The technical caveat to memorize: Liquid Haskell stays inside a logic dialect called **QF-LIA + uninterpreted functions** — that's the slice of math the SMT solver can crush quickly. Step outside that slice and you're in undecidable territory. - **Tight stage line:** *"The typechecker turns refinement obligations into formulas in a decidable logic; the SMT solver discharges them as a black-box decision procedure."* - **Don't confuse SMT with Coq/Lean.** Coq/Lean are *proof assistants* — they walk through proofs step by step with the user. SMT is push-button. Different tools, different jobs. ## The Sakkas paper — UCSD, ICSE 2025 - **Authors:** Sakkas, Sahu, Ong, Jhala. (Jhala is the LiquidHaskell guy. Has been since the 2010s.) - **The system has two names:** "XO" in early conference materials, "LHC" in the final published version. Either is fine; you can say either. - **What it does, plainly:** writing refinement type annotations by hand is brutal. So they built a system where an LLM proposes the annotations, and LiquidHaskell checks them. If the check fails, the LLM tries again. Loop until it passes or gives up. - **Why this is the perfect citation for your talk:** refinement types had a 25-year adoption problem because the annotation writing was too tedious. LLMs are *exactly* good at tedious structured writing. The thing that wasn't worth doing manually becomes worth doing when the boring part is free. - **The bigger story — Sakkas's whole dissertation:** he built three of these tools, all the same shape, three different problems: - **Rite** — fixes OCaml type errors. - **Seq2Parse** — fixes Python syntax errors. - **LHC** — writes refinement type annotations. - **Why this matters more than just one paper:** one lab applied the same architecture three times in three different domains. That's not a fluke — it's a method. Lets you say *"the proposer/checker pattern works across compilers, parsers, and verifiers."* ## The Mell paper — UPenn, OOPSLA 2025 - **What they argue:** if you fix everything except the input, an LLM call is essentially a pure function. Same input → same output. So you can compose LLM calls like math. - **The catch you have to say out loud:** that's only true at the *core*. Once you add randomness in sampling, tool calls, or memory across turns, it's not pure anymore. Real systems have those. So purity is a *design choice* you make, not something LLMs come with. - If someone says "but my LLM is non-deterministic" — agree, then point out: that's the effect layer, composed on top of a pure core. Don't fight it. ## Neurosymbolic — the word that's everywhere right now - It just means: combining neural networks (LLMs) with symbolic systems (logic, type checkers, parsers). - **The history:** - Wave 1: symbolic AI (the 1960s expert systems guys — McCarthy, Minsky). - Wave 2: neural networks alone. - Wave 3: hybrids. Neurosymbolic. - **Marcus has been arguing for Wave 3 since 2020.** He's been saying neural alone isn't enough — you need symbolic structure for real reasoning. - **The architecture is old.** What's new: LLMs got good enough at the boring half (annotations, formatting, structured output) that hybrids became affordable. - **There's a six-type taxonomy by Henry Kautz.** You only need three: - **Neural-then-symbolic** — LLM produces, checker verifies. *(This is your harness. This is LHC.)* - **Symbolic-around-neural** — a deterministic program that calls the LLM as a subroutine. *(Marcus says Anthropic's Claude Code is this.)* - **Neural-around-symbolic** — LLM that calls solvers/checkers as tools. - **Your harness is mostly the first one at the seams, with bits of the second at the orchestration level.** Both are flavors of the same pattern. ## Marcus — handle with care - He claims to have read leaked Claude Code source. He says it's a several-thousand-line deterministic program that wraps the LLM. - **You cite his analysis, not the leak itself.** That distinction matters if anyone challenges the ethics. - The strength of his argument: it's revealed preference. Anthropic, of all companies, didn't trust their own model unsupervised. - **Don't do his "I told you so" voice.** He's been waiting 25 years for vindication. You haven't. You just want to say *here's how to ship reliable software now.* - **Don't say "scale-invariant."** Two examples that agree with you isn't a pattern; it's a coincidence with extra steps. Say *"this shows up wherever being wrong is expensive"* instead. - **If asked about the ethics of citing the leak:** *"I cite Marcus's analysis, not the artifact. The leak is already in public discourse. The structural point either holds in code anyone can see publicly — Sakkas, my own zinegen — or it doesn't. The leak is corroboration if you accept it, footnote if you don't."* ## What to do with your formal background - **Refinement types / LiquidHaskell:** USE. You actually used it on ryvm. This is your strongest formal-methods receipt. - **Lambda calculus / purity:** USE CAREFULLY. The point you care about is composition, not the formalism itself. Always say the caveat about effect layers. - **Category theory:** SKIP. Stanford has actual category theorists in the building. If you wing it, you'll get caught. - **Why this is a trap:** if you say "categories are defined by their structure" or "we prove things about categories" — those sentences are *almost* right but not quite. A category is defined by objects + morphisms + composition + identity. Not "structure" in the loose sense. - **If someone explicitly asks if you mean category theory:** *"Category theory shaped my taste — the lesson I take is that interesting reasoning happens at the morphisms, the typed transformations between things, not at the things themselves. But I'm making the engineering claim, not the categorical one. I'd be making the same claim if I'd never seen a commutative diagram."* - That's it. Then stop talking. Don't elaborate. ## Q&A — the one rule - **Don't defend the claim wider than you made it.** - Most pushback will be "you're overgeneralizing." - Almost every right answer starts with: *"Agreed, the claim is narrower than that — here's the boundary."* - Concession is strength. Stanford rooms reward people who know what they don't know. ## Q&A — answers ready to fire - **"Isn't this just neurosymbolic AI?"** → Yes at category level — specifically the proposer/checker shape. Marcus has advocated the umbrella for 25 years. My claim is narrower: which contracts you can enforce cheaply, which verifier earns its place, how the harness amortizes structure. - **"Isn't this just dependent types?"** → Liquid Haskell deliberately stays in a logic the solver can crush. That's the difference. It's a decidability cut, not a category distinction. - **"Aren't you reinventing failed formal methods?"** → Annotation cost was the biggest barrier. LLMs collapse exactly that. Sakkas is the proof. - **"What if Anthropic ships something that obsoletes the zine generator?"** → The moat isn't the generator. It's the community of users and the taste they share. A model release ships features; it can't ship trust. *(This question is actually for Chielo, but be ready.)* - **"You said the freeze didn't hold — why should we trust your methodology?"** → Because the method flagged the drift before I could see it. A method you only listen to when it agrees with you isn't a method. - **"Have you measured productivity gains?"** → No. Won't pretend I have. What I can say: the failure mode shifted — from "did I write this correctly" to "does my contract say what I meant." That's a different problem and arguably a better one. - **"What does this approach cover and what doesn't it?"** → Covers structural stuff: schema compliance, property tests, measurable constraints like WCAG contrast. Doesn't cover: semantic accuracy, taste, truth, values. For those, the harness is at best a forcing function for honesty — it makes me name what I can't check. - **"Does this scale past one developer?"** → No, deliberately. The harness lets a small team punch above its weight, but it's not a substitute for hiring. That's a real limitation. - **"Isn't 'boring code' a situational claim, not a universal one?"** → Yes. Boring wins for CRUD, glue, integration — most product code. Loses for novel ML, distributed systems, safety-critical. Drex is the former. - **"Isn't this just Lean Startup?"** → No. Lean's MVP asks "will people pay?" My version asks "did this experiment move a real signal in the system I'm building?" Different commitment. - **"Is it ethical to cite a leaked source file?"** → I cite the analysis, not the artifact. The structural point holds in public code regardless. Footnote if you don't accept the leak. ## Things to never say - Don't say maintenance "isn't a thing." It moves; it doesn't disappear. Cost migrates from writing maintainable code to verifying generated code. - Don't say "RAG is dead." Say "the boundary moved." - Don't claim your harness is a new invention. Claim it's a *specific implementation* of an old idea. - Don't invoke category theory. - Don't say "scale-invariant." - Don't sound like Marcus. He sounds vindicated; you sound practical. - Don't say "lean into what LLMs are bad at." It sounds like a TED talk. Use the economics framing — *the cost of one half collapsed.* ## Pre-talk checklist (3 days, 2 days, 1 day, day-of) - **3 days out (today):** decide whether to do the contracts.lhs spike. If yes, 4-hour timebox tomorrow. If no, use the contrast linter as the live-product receipt. - **2 days out:** read Cheung & Seshia (Berkeley TR, UCB/EECS-2025-174). It's a different paper from Sakkas — it argues testing isn't enough for LLM-translated code; you need formal compositional reasoning. - **1 day out:** rehearse the compressed pitch, the Litmus, and the economics inversion until they come out flat under nerves. Stop adding new material. - **Day of:** drill the receipts cold. Numbers, system names, dates. The high-altitude stuff is decoration; the receipts are the load-bearing wall. ## The single most important rule - The talk's strength is the *bottom* of the abstraction ladder: ryvm, zinegen, the contrast linter, 78%, the freeze that didn't hold. - The clever sentences at the *top* of the ladder (lambda calculus, the economics inversion, category theory) are seasoning. One or two land well; more starts to look like showing off. - **You only get to keep one high-altitude sentence.** Make it the economics one: *"LLMs didn't change the architecture; they changed the price."* Drop the others into the "if asked" bucket. - Stanford rewards receipts. Stop reaching up. Drill down.