URI:
        _______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
  HTML Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
  HTML   Formal methods and the future of programming
       
       
        brainless wrote 8 hours 43 min ago:
        I am not a language nerd but I keep on experimenting in my own ways to
        use the type system to generate code that is more reliable.
        
        I build a coding agent specifically for small models, which makes
        everything harder. I started this chat with Claude to build the next
        step: [1] - how can I go from PRD to a typed representation of the
        business logic.
        
        The I started building as per [2] . The praxis crate: [3] and a sample
        Todo app: [4] Generating unit tests for the library functions of any
        project would be done via a separate agent than the one coding the
        functions. And then use tree-sitter to statically check code to PRD
        (provenance graph).
        
        Again, not a language nerd, just enjoying chasing a goal.
        
  HTML  [1]: https://claude.ai/share/4264e5f6-b334-426c-afe4-904d233ef946
  HTML  [2]: https://github.com/brainless/nocodo/blob/feature/praxis_agent_...
  HTML  [3]: https://github.com/brainless/nocodo/tree/feature/praxis_agent_...
  HTML  [4]: https://github.com/brainless/nocodo_example_todo_app
       
        blueblazin wrote 11 hours 5 min ago:
        As someone with a bit of interest in programming languages (design and
        implementation) this was really interesting to me:
        
        > For most people who work on programming languages, the easy part is
        coming up with new and better ideas about how to make programming
        better. The hard part is convincing anyone to actually use those ideas
        for real work.
        
        Totally agree, there's only so much strangeness you can introduce in a
        new language[1] regardless of benefit.
        
        But AI agents should not feel much resistance to radically new ideas in
        PL design. I've been thinking for a while now about how PL design will
        evolve post agentic AI. I think it will be very interesting to see what
        new ideas we can come up with to improve programming languages when we
        worry much less about adoption.
        
  HTML  [1]: https://steveklabnik.com/writing/the-language-strangeness-budg...
       
        seqradev wrote 13 hours 14 min ago:
        We are working toward applying formal methods from the application
        security testing side, but I believe the same kind of approach can be
        applied to business logic verification as well. For that, we are using
        the taint analysis technique — a fairly well-established formal
        methods approach, but still not widely applied in the field because of
        the complexity of dataflows in real codebases.
        
        Scaling formal methods beyond AST pattern matching and some simple type
        checking turns out to be a really hard task! It took years of research
        and development to reach the point where taint analysis enables us to
        trace interprocedural dataflows in real codebases in minutes and find
        deeply hidden vulnerabilities.
        
        If this sounds interesting to you, take a look at our project:
        
  HTML  [1]: https://github.com/seqra/opentaint
       
        smasher164 wrote 19 hours 9 min ago:
        I think it's a good time to learn Lean. It positions itself as a proof
        assistant that's also good at practical programming. I'm not sure how
        mature the ecosystem is for the latter.
       
        UncleEntity wrote 19 hours 30 min ago:
        I was playing around with stuff trying to get Claude produce a JavaCard
        VM with the idea that the VM was hand written from the spec with a
        separate, independently produced, spec file used to generate tests for
        ESBMC to verify so an identical bug would have to exist in both to make
        it through. Worked out pretty well, found a few bugs in both projects,
        but my poor laptop can't handle the full 32-bit space so that part
        never got the full verification -- 16-bit, rock solid though.
        
        Then I really got serious about the yak shaving and, well, am probably
        in need of an intervention as I don't get Claude to write a VM but to
        make the tools to generate a VM from an assortment of DSL and that has
        snowballed a bit as I really liked shaving yaks before the daffy robot
        revolution.
        
        --edit--
        
        Almost forgot, I tried that with the little less dodgy banned Claude
        and the wasm standard it wrote a python script to parse the spec pdf,
        the official bytecode implementation in OCaml (or whatever) and
        generate a TOML file (Claude loves the TOML) to generate the type
        headers and for cross-referencing in the other tools. Was so impressed
        I just let it go on its merry way and it did the deed.
       
        zisa-security wrote 20 hours 3 min ago:
        Interesting read. Formal methods seem to matter most when the
        architectural complexity grows faster than human intuition can track.
       
        rzmmm wrote 21 hours 53 min ago:
        He mentions the seL4 microkernel. The specification is written in
        Isabelle, and it's relatively complex: PDF [1] The bottleneck seems to
        be that clearly it's critical to be certain about the spec.
        
        Maybe not for kernels, but I can see use for cryptographic algorithms,
        etc?
        
  HTML  [1]: https://sel4.systems/Info/Docs/seL4-spec.pdf
       
        cadamsdotcom wrote 23 hours 24 min ago:
        I’ve had huge success with TDD and nothing more formal than that;
        test frameworks are great for verifying “business correctness”. But
        they can also verify other aspects - eg. asserting that 3 queries ran
        instead of 103 for a request, to prove absence of N+1 queries.
        
        Really any technique that lets the agent create its own verification is
        ideal, as it makes verification scale.
       
        winwang wrote 23 hours 45 min ago:
        Love this. I've shifted in the past few months to using highly
        expressive types in Scala 3 to have types carry more and more
        compile-time proofs (without macros, though a couple are warranted).
        Not only does it help with agentic test "sprawl", it seems to prevent
        agents from falling into lower-quality modes of operation. One of the
        more annoying things I've been preventing is what I call "noun
        accretion", where agents try and make a new monomorphic type for
        everything, instead of clearly genericizing when sensible.
        My bet is on formal-method-shaped tooling (including languages with
        strong type systems) to accelerate decent-quality agentic coding.
        
        When I say "highly expressive types", I mean techniques I'd likely not
        want to ship in a typical codebase, unless the team was already on the
        typelevel programming bandwagon (i.e. having HKT and type functions
        being basic blocks already, not weird). Agents are better at "math"
        than basically most devs (even category-theory-pilled ones), at least
        in terms of knowledge. Better yet, they are decent at pedagogy,
        especially when considering they have "infinite" patience for dialogue.
        
        In a more personal setting, I've had Codex Lean-ify a couple of my
        hobby proofs, it was extremely easy. Note: not saying it did this 100%
        "correctly" (gotta learn more Lean 4 to check more thoroughly), but it
        also seems to check for classic proof gotchas by default.
        Very excited for the future of formal methods.
       
        addaon wrote 1 day ago:
        I've been playing with related ideas recently, and can talk about them
        at length... but one thing that's surprised me is just how effective
        frontier models (ChatGPT-5.5 in particular) are at completing certain
        manual proofs in the Roqc (né Coq) proof assistant. The proofs aren't
        always pretty, but ChatGPT can often prove something in minutes and 10
        - 100 iterations that would take me, a human who has limited but
        non-zero proof assistant experience but significant domain experience
        in the lemmas being proven, much much longer.
        
        The reason I think this is relevant and interesting in the context of
        the (short) article is because it challenges basic assumptions about
        how certain formal method tooling works. Frama-C, Ada/SPARK, etc are
        focused on generating proof obligations that can be automatically
        discharged by CVC5, Z3, etc; with a relatively painful fallback of
        manually proving the obligation in Rocq. The most common workflow seems
        to be to discover that an obligation is "hard" (not automatically
        discharged), and to restructure the set of visible lemmas and
        assertions at the obligation generation point in the code to try to
        make it "easy"; or even restructuring the code to try to make it easy.
        Which, in a world where Roqc proofs are doubly expensive (first because
        they're a challenge for a human to write; second because they tend to
        require quite a bit of maintenance churn as the code and proof around
        them evolves), makes sense. But if Roqc proofs are "automatically
        discharged with AI in the loop", this cost delta goes away -- and it
        becomes possible to think about writing proofs the same we (usually)
        write code, with human readability and clarity as the first goal, and
        [compiler|prover] optimization a distant second.
        
        Which I find quite exciting, although I haven't fully digested the
        implications yet.
       
          rirze wrote 16 hours 38 min ago:
          This mirrors my own experience, except I went with Lean. It was more
          relevant to the features I wanted to build.
          
          Exciting times ahead.
       
          zahlman wrote 20 hours 25 min ago:
          > how effective frontier models (ChatGPT-5.5 in particular) are at
          completing certain manual proofs in the Roqc (né Coq) proof
          assistant. The proofs aren't always pretty, but ChatGPT can often
          prove something in minutes and 10 - 100 iterations that would take
          me, a human who has limited but non-zero proof assistant experience
          but significant domain experience in the lemmas being proven, much
          much longer.
          
          ... How do you know that the proofs are themselves correct?
       
            red75prime wrote 12 hours 1 min ago:
            Proof checkers fuel the AI hype by outputting "valid" for a
            hallucinated text. /s
       
            addaon wrote 19 hours 45 min ago:
            With the proof checker.
       
              bobkb wrote 5 hours 25 min ago:
              I assume your idea is, if the spec and the proof is verified the
              code generated is good enough as well ?
       
                addaon wrote 2 hours 30 min ago:
                Today, I write the code. It’s trivial and takes a lot less
                time than writing the spec, and since I’m using conventional
                tooling for WCET and stack sizing it’s nice to get those
                right up front. The LLMs sometimes tweak the code slightly for
                provability, but this is usually either direct operator
                replacement (shift with multiplication, and with modulus, etc)
                or factoring out a block to a function to tie a contract onto
                it, both of which I trust my compiler to undo (simple
                arithmetic operations and inlining, respectively) with zero to
                minimal impact on the generated binary.
       
          bluGill wrote 1 day ago:
          How often will AI look at something that can't be proven because you
          change requirements and decide to change the code and your
          requirements to make the proof easier though? I haven't played with
          proofs at all, but I do know that occasionally when I say, this test
          failed after making a change, AI will just change the test instead of
          making the code pass both the old test and the new test which is most
          often my intent.
       
            addaon wrote 23 hours 22 min ago:
            That's up to the harness. Right now, my harness doesn't have any
            tools that would allow the agent to change contracts. Also, agents
            working on a Rocq proof don't have write access to the C+ACSL code;
            they can review it, and propose structural changes to the
            coordinator, but all they can do is iterate on the Roqc proof or
            give up. The weakest part of the harness today is that the agents
            that do C+ACSL work can modify both C and ACSL, even though they're
            usually run with the explicit intent to not change the semantics of
            the C code (although it's totally within bounds, and often
            required, for them to change e.g. `foo & 3` to `foo % 4`) -- so
            there's weakness here, but it's bounded, and since the contracts
            live in header files that they can never write, it's worked so far.
            It's a starting point, at least; I certainly wouldn't say this is a
            mature, or even good, tool chain, but still learning.
       
        jp0001 wrote 1 day ago:
        Formal methods is like a plan. Everyone has one until they are punched
        in the face (real world requirements and trust boundaries).
       
          angry_octet wrote 21 hours 29 min ago:
          Incredibly uninformed comment. The formally verified part is the high
          confidence component that is the anvil to hammer out bugs in the
          unverified components.
       
          Paracompact wrote 1 day ago:
          Formal methods are not meant to replace trust in a system. They are
          meant to minimize the surface area of trust. To not understand and
          advertise what surface area still exists is foolhardy, and mistakes
          logic for magic.
       
        JacobAsmuth wrote 1 day ago:
        Sounds like Jane Street would be interested in the Lean framework I've
        been working on for formally verified frontends.
        
  HTML  [1]: https://github.com/JacobAsmuth/qed
       
        Animats wrote 1 day ago:
        I used to do proof of correctness work, decades ago.[1] We had more
        proof automation than many of the later systems. The easy stuff was
        solved by the first SAT solver, the Oppen-Nelson simplifier. The harder
        stuff used the Boyer-Moore prover, which uses heuristics and previous
        lemmas to guide the theorem prover. The Boyer-Moore prover had to be
        helped along by suggesting lemmas, which it would try to prove and
        which would then be used for later proofs. That was the tough human
        job.
        
        Later systems seemed to have less automation. What tended to go wrong
        with formal methods was that the people doing them were too into the
        formalism. I was working on this for a commercial project that wanted
        bug-free code, not as an academic. The academic projects tended to get
        too clever. They had the mathematician's bias of wanting a terse
        notation and not much case analysis. That's not a good solution. You
        want lots of automated grinding and don't want to need much insight.
        The clever people kept inventing new logics - modal logic, temporal
        logic, etc, - to avoid verbosity in paper and pencil proof work. That
        wasn't really all that helpful.
        The basic truth of this business is that most of the theorems are
        rather banal.
        
        As the Jane Street people point out, there's a big advantage in having
        control of the language. You want the verification statements
        integrated into the programming language. In many systems, they're
        embedded in comments, in a different syntax than the programming
        language, or even in completely separate files. This adds unnecessary
        work. It's good to see them pushing this forward.
        
        We were doing this too early, over 40 years ago. It took about 45
        minutes of compute back then to build up number theory from a cold
        start with the Boyer-Moore prover. Now it takes less than a second.
        
  HTML  [1]: https://archive.org/details/manualzilla-id-5928072/page/n3/mod...
       
          sgt101 wrote 14 hours 27 min ago:
          Human interpretebility and human construction is going to be really
          important for formal specifications.
          
          My fear is that we will get inscrutable maths that will be guarded by
          tiny coteries of devoties. The different inscrutable notations will
          be mutually incompatible, understanding one will not really help with
          the others. Most people will simply write english statements that
          cannot be verified properly.
          
          Worse even, people will get machines to formalise their statements,
          and will not understand the formalism or the proofs, but will take
          part in the theatre of verification and act shocked when everything
          explodes.
       
          ebiederm wrote 19 hours 36 min ago:
          Have you looked at ADA Spark?
          
          If you have does it match your intuition of how things should be
          done?
          
          I am slowly working on something where I hope to integrate such a
          capability for the things that type systems can't handle quickly.
          
          So I would be interested in perspectives of people who have been down
          this route before.
       
            Animats wrote 14 hours 55 min ago:
            I've generally liked classic approaches which had entry and exit
            conditions, and loop invariants, all written using the same syntax
            and operators as the program. The compiler has to know what to
            ignore, of course. The compiler should syntax and type check all
            the proof information, even if it can't verify it. It's important
            to avoid an impedance mismatch between the proof system and the
            programming language. If programmers are constantly translating
            between two notations, customer resistance will be high.
            
            If you need to write functions as scaffolding for proofs, they
            should be written in the programming language. You might need some
            operators that aren't actually runnable, such as FORALL, but the
            syntax should be that of the programming language.  If your specs
            look like
            
                Γ, P || Q, P ==> R, Q ==> R ⊢ R
            
            in a language with none of those operators, you're doing it wrong.
            That's the disease of falling in love with the formalism.
            
            The user should not have to tell the proof system things the
            compiler already knows. Whatever overloading and aliasing rules the
            language enforces should be known and handled in verification, too.
            We worked off the syntax tree produced by a compiler front end
            modified to handle the verification statements, not the raw source
            code. Ada Spark has different aliasing rules than Ada, for example.
            
            One big problem is that object oriented programming came and went
            in popularity. You really want object invariants. You need some way
            to attach an invariant to a data structure. You also need a clear
            boundary where control enters and exits the objects.
            If the language doesn't have objects, you struggle with trying to
            express object type invariants in the proof language. You don't get
            the boundaries as part of the programming language.
            
            A useful interface between the prover and the program is simply to
            use
            
                assert(A);
                assert(B);
            
            in the middle of code
            to encapsulate complex proof goals. A should be provable from the
            preceding code using a SAT solver. B, when assumed true, should
            provide enough info for a SAT solver to proceed from that point.
            Now you need to prove
            
                A implies B
            
            by external means. That creates a well defined problem which can be
            given to something AI-like, backed by a proof checker, to chew on.
            
            Ada Spark has some features that violate these rules. Also, it's
            hard to find anything about Ada Spark newer than ten years old.
            
            There's a fair amount of interest in verifying Rust. There's been
            some progress using Dafny. But "Dafny is not Rust. Using Dafny
            requires porting algorithms of interest from Rust to Dafny. This
            port can miss details and introduce errors." There's the impedance
            mismatch again. Verus looks more promising. It is more Rust-like in
            syntax, they get the SAT solver/AI prover distinction,
            and there's active development.
       
              rramadass wrote 4 hours 4 min ago:
              Have you looked at "Verified Design-By-Contract"? See the paper
              Safe Object-Oriented Software: The Verified Design-By-Contract
              Paradigm by David Crocker here - [1] David Crocker's Verification
              Blog -
              
  HTML        [1]: https://www.eschertech.com/products/verified_dbc.php
  HTML        [2]: https://critical.eschertech.com/
       
          nextos wrote 21 hours 32 min ago:
          I've worked in formal methods for quite a long time, and I disagree a
          bit with your statement that new logics are not helpful. Industrial
          logics are really practical and allow you to write all sorts of
          sophisticated properties that your system should satisfy in a very
          succinct way. Logic is to computer science and software engineering
          what calculus is to physics and mechanical or civil engineering [1,
          2]. Things like LTL or, more recently, separation logic, have been
          incredible breakthroughs.
          
          TLA+, which has gained quite a lot of popularity, is a testament to
          that. Model checking is eminently practical. The exciting thing now
          is that heavier formal methods, in particular theorem proving, might
          become cheap enough to use in regular systems software. Writing
          formal specifications for functions and getting them synthesized and
          proven correct by some SAT/SMT, theorem prover & LLM hybrid may
          become the norm in the not-too-distant future. [1] On the Unusual
          Effectiveness of Logic in Computer Science. [1] [2] From
          Philosophical to Industrial Logics.
          
  HTML    [1]: https://www.cs.rice.edu/~vardi/papers/aaas99.jsl.pdf
  HTML    [2]: https://www.cs.rice.edu/~vardi/papers/icla09.pdf
       
            jmalicki wrote 3 hours 5 min ago:
            "Logic is to computer science and software engineering what
            calculus is to ... mechanical or civil engineering"
            
            Things professionals rarely use in practice?
       
            arka2147483647 wrote 12 hours 11 min ago:
            > Industrial logics are really practical and allow you to write all
            sorts of sophisticated properties that your system should satisfy
            in a very succinct way.
            
            It sounds like what you think as positives are exactly the things
            parent comment thinks as negatives.
       
              ajb wrote 6 hours 44 min ago:
              There's a big difference between wanting a succinct proof and
              wanting a succinct statement of the requirements.  The easier it
              is to state the requirements, the more likely you have stated
              them correctly. Whereas succinct proof is not relevant for
              industrial purposes, as long as the proover has a small trusted
              kernel, it makes no difference to the reliability.
       
        Syzygies wrote 1 day ago:
        I nearly adopted OCaml for my current math research, in no small part
        because of Jane Street contributions. I instead chose Lean 4, as LLMs
        became able to help code in Lean. I give up a factor of two in
        performance (a gap likely to close) for what reads to me as the
        clearest expression of each algorithm. Lean is a beautiful language
        that makes OCaml read like Old English.
        
        Lean is designed to be optionally verified; proof is its primary (but
        not only) application. It seems an impedance mismatch for Jane Street
        to explore this direction without also migrating to Lean 4.
       
          derdi wrote 7 hours 32 min ago:
          You are praising Lean for optional verification. They want to add
          optional verification to OCaml. Where do you see an impedance
          mismatch? Just in the fact that you don't find OCaml pretty enough?
       
        yawaramin wrote 1 day ago:
        Quis custodiet ipsos custodes?
       
          Animats wrote 1 day ago:
          The proof checker.
          
          You verify a proof system by having it produce a proof trace:
          
          - Step 3185: Apply rule that a * b = b * a to "length * width" in the
          current formula.
          
          Then you run a proof trace checker which applies each transformation
          in sequence and checks that the expected result is obtained.
          
          Provers are complicated, but proof trace checkers are really dumb.
       
            zahlman wrote 20 hours 22 min ago:
            But don't you still have to logically connect the validity of the
            proof to desirability of the output?
       
        reinitctxoffset wrote 1 day ago:
        I caught the religion on using types in conjunction with LLMs about
        eighteen months ago, but I only really got serious about `lean4` like
        six to eight months ago and now I wouldn't even consider using AI
        assist in software work with a `CIC` proof substrate that has practical
        C/C++ (and therefore, everything) FFI.
        
        We've banned everything from JSON to Python, rewritten `nix` to have a
        compiler, and almost everything we write is not only property tested
        and multiply fuzzed to within an inch of it's life, but we have proofs
        in `lean4` that at a minimum drive property tests via `.olean` linkage,
        and when we have the bandwidth we prove exhaustiveness over the domain
        and property test that.
        
        We skip the whole C++/Rust thing because all of the fast stuff is
        generated from `lean4` and so it doesn't really matter (C++ has
        advantages when bugs in it are actually bugs in `lean4` code, but you
        could go either way).
        
        This is a big departure, and I certainly don't blame anyone who is
        skeptical: "ban JSON and Python wtf?!?!", but we've done millions of
        lines (checked) of this stuff and AI + formal systems is a dramatically
        bigger leap than no AI -> AI and Python. The latter in our experience
        is not monotone in progress, the former is almost always monotone in
        progress.
        
        And you can do wild shit, this is a formal proof of the polyhedral
        tensor calculus that is modeled by things like ISL and CuTe, and using
        that we can get swizzling and tiling using `mdspan` in C++23 on device
        and prove it's right (up to some L'Hopital arugment about the coverage
        which this example doesn't demonstrate well: [1] That in turn, well, it
        goes real fast. On the first try.
        
  HTML  [1]: https://github.com/b7r6/mdspan-cute
  HTML  [2]: https://straylight.software/assets/lambda-hierarchy.svg
       
          gottlobflegel wrote 8 hours 12 min ago:
          Listing Agda and Idris 2 under CIC makes your lambda hierarchy
          diagram misleading at best.
       
          skybrian wrote 1 day ago:
          Ok but what software have you built and why is it useful?
       
            reinitctxoffset wrote 1 day ago:
            This is all pre-release but a few highlights:
            
            - `continuity` is a `lean4` metaprogramming system that we use all
            kinds of ways but the real meat is it allows formal specification
            of codecs and state machines in ways that make security and
            performance properties proof amenable, the key trick here is to
            limit parser power to just what you need and no more. a very cool
            thing is that we can add targets to it, so when we do Zig for
            example, Zig will immediately get proven correct and frontier
            performance support for dozens of protocols that are not all mature
            right now.
            
            - `libevring-cpp` (bound up into `libevring-hs` and `libevring-rs`)
            is a Trinity-inspired deterministic event replay system that
            replaces anything you would have done with `libuv` or whatever, and
            it's wired into `io_uring` (we're stuck with `kqueue` on darwin,
            eh). it interfaces with `continuity` machines and codecs (which are
            generated very carefully for the hardware they run on) and we have
            yet to find a way of measuring such programs where it doesn't
            resoundingly shatter the performance of any other asynchronous
            programming primitive in any language. i'm sure the community will
            prove us wrong when we release it, but it's real fast. and you get
            much stronger guarantees than in most such systems (Trinity
            derived, so if you can repro a bug, you can walk the event trace
            until it's sitting there in GDB and shit)
            
            - `hyperconsole-cpp` (and `hyperconsole-hs` and `hyperconsole-rs`)
            is the TUI library idea taken to some deranged extreme on
            performance and supports everything in `notcurses`, it's pretty
            wild: [1] - straylight-nix is a complete rewrite of nix that fixes
            thousands of bugs, hundreds of them security adjacent and dozens of
            them we only talk to vendors about. it's daemonless, dramatically
            faster, ground-up WASM-targeting compiler with a formal grammer,
            uses an extremely fast LSM-based store (it can read the legacy
            store but we don't write it) that fixes all the problems in
            floating CA, IFD is too cheap to care about, and recursive nix is
            no longer and issue (see daemonless). it supports tearing
            derivations into `REAPI` actions that you can feed to your friendly
            native NativeLink or whatever, which just goes through them like a
            woodchipper. KVM-based sandbox with snapshot and restore, really
            opens the world up on what your builders can be.
            
            - `slide` is the reference implementation of a family of protocols
            called `SIGIL`: `SIGIL-LLM` is a binary encoding for LLM data that
            resets on ambiguity and drops the average bytes on the wire from
            e.g. OpenRouter to your harness from ~hundreds to ~1.5 per token,
            `SIGIL-API` is a bijection on OpenAPI 3.1.0 and AsyncAPI 3.0.0 that
            gives comparable improvements, and `SIGIL-SH` is such an encoding
            for a sensible subset of bash. this does about 1.5 billion tokens
            per second on a laptop and never emits partial frames, so you don't
            get speculative execution rollback problems in your harness that
            tilt agents off.
            
            - `// WEAPON //` is an adversarial, vendor-skeptical, full-take
            surveilance agemt harness built on `hyperconsole-cpp` and `SIGIL`
            (so, you could absorb the entire token output of OpenRouter on one
            machine if you wanted at least on the wire and in the terminal,
            clearly the bottleneck rapidly becomes whatever the agents are
            calling, but it's `zmq4` transport underneath `SIGIL` so it's also
            trivial to full-take all of your data for fine tuning or whatever
            you want it for into e.g. `parquet` on R2. `// WEAPON //` does a
            bunch of stuff: the tool call surface is heavily optimized for
            AST-level edits that miss dramatically less, we intercept and
            manipulate shell commands (slice off the stupid `| tail -n5` that
            keeps the droid in a loop not seeing the error, pre-emptively
            ground using heuristics that have been tuned (defeats the search
            flinch), and always recovers from any stall, or nag box, or
            anything else that would serve as an unannounced rate limit, it's
            fine if vendors rate limit but they need to put it in their ToS. it
            has a bunch of other primitives, agents run in real KVM sandboxes
            and speculate out as wide as you want to pay the tokens for. we
            hyper-manage things like the cache breakpoint geometry of Anthropic
            so e.g. Opus rarely misses in cache and always hands off edits to
            specialized tool use models. it's pretty extreme the difference in
            outcomes relative to all this React jank.
            
            - `s4` is a general-purpose compiler from most any pytorch 2.0
            model to `myelin`-level performance on NVIDIA (we only support
            NVIDIA Blackwell at the moment, that might change) and it's never
            worse than `myelin` because if we don't out-tune it we invoke it,
            but we out tune it a lot because we've proven a lot of
            decideability theorems about tiling and scheduling on both 1CTA and
            2CTA, so we can often arrive at a finite, enumerable set of
            schedule/tile choices. `myelin` mops up the random garbage around
            the big GEMMs just like in TRT-LLM.
            
            - `sigil-trtllm` is inspired by TensorRT-LLM-Edge but designed from
            the ground up around Mellanox/ConnectX and in particular GPUDirect,
            so it can stream `SIGIL-LLM` tokens directly onto the wire whereas
            something like Dynamo is usually traversing both Python and NATS,
            which is super weird to us. this uses the `s4` compiler very
            heavily.
            
            - `straylight-cas` is a geographically distributed content
            addressable store backed by any R2-compatible (so most any
            S3-compatible too) object store with multi-level LSM and extreme
            performance memtables, optimistic hinted handoff over `zmq4` to
            other geographies, and a really simple operational story, this is
            kind of the thing that powers the product surface.
            
            ... which is the thing i'm less ready to talk about because it's
            supposed to be a surprise.
            
  HTML      [1]: https://youtu.be/YqgEtpJ8tGI
       
              rirze wrote 16 hours 36 min ago:
              I am super interesting in `straylight-nix`. Even a blog post on
              what you did to improve it would be elucidating.
       
              skybrian wrote 1 day ago:
              It sounds impressive but also rather obscure. Do you have users?
       
                reinitctxoffset wrote 1 day ago:
                No we're pre-alpha, I guess most people would call it stealth
                but it's more like, not quite done and we don't want to waste
                people's time because our entire thesis is around correct
                outcomes in AI systems at a level that permits their use in
                outcome-critical regimes (we sometimes call it "insurable" AI
                as a north star, would LLoyds of London or Swiss Re stand
                behind someone who was writing policies on this?).
                
                Now a bunch of that is development tooling that copes with
                agent-scale software development, and a lot of that might
                become product surface, so we have a lot of usage denominated
                by like, bytes and agent hours and stuff because we build this
                stack in itself, but that's somewhat orthogonal to the north
                star vision.
                
                We'll make sure to give the HN community the opportunity to see
                this stuff as early as anyone does if people find it
                interesting, most of the above will be open source fairly soon.
                Don't know if it'll make the front page, but the product will
                be called `ORBITAL`, so if you see that floating around that's
                us.
                
                Appreciate the interest!
       
        brap wrote 1 day ago:
        Whenever I read about formal specs it always seems to me like “write
        the same tests just in a different way”, or worse, “write the same
        implementation but in a different way”.
        
        I guess doing things twice can help catch errors, but I fail to see
        what’s so special about formal specs if they can suffer from the
        exact same bugs as the tests/implementation.
        
        I guess the root of the problem is if you want to formally prove that a
        program does something, you have to be very specific (heh) about what
        that something is, at which point you are basically just writing
        tests/implementation all over again.
        
        I have been looking into this on and off for years now, and I keep
        feeling like I’m missing something, but I don’t know what it is.
        
        Can anyone enlighten me?
       
          rendaw wrote 13 hours 3 min ago:
          The same with unit tests, they're best suited for problems where the
          specification is much simpler than the implementation. A sorting
          algorithm can be long and complicated, but a unit test just needs to
          present unordered input and assert that the output matches ordered
          output. A sorting algorithm is also well suited for formal
          verification.
          
          If you're rewriting the implementation, I think it's probably not a
          good use for unit tests or formal verification.
          
          As another commenter said, unlike unit tests which cover a specific
          case, and where you need multiple tests for edge cases and both
          positive and negative results, formal verification will cover all
          cases.
       
          majormajor wrote 17 hours 47 min ago:
          The powerful difference is between specific testing and exhaustive
          proving. If you fail to think of a test for an edge case, whoops. If
          you have a good model, you can know that it exists—and fix
          it—before you ship.
          
          This is particularly valuable IMO in
          concurrent/distributed/multi-threaded situations where
          race-conditions and deadlocks are extremely hard to test and reason
          about. "Can A, B, C happen in the order 'A, C, B'" types of things.
          
          I have a rough hypothesis that maturity in this space looks something
          like:
          
          1) LLMs will allow much faster learning and usage of formal methods,
          even if initially just for "do it a second time" post-hoc
          verification
          
          2) From there you move towards automation of LLM-checking "hey, the
          implementation code changed, does it look like it broke the model" -
          though this is still not deterministic, but it's likely a lot better
          than a human review requirement  of something that only changed
          infrequently would've been
          
          3) And then the real jump will be taking tooling for "write just the
          formal spec, let the implementation get generated" to the next level.
          There's various
          mostly-not-fully-baked-for-what-most-companies-would-want projects
          already out there for codegen like this, what if the LLM tools can
          accelerate the year-or-two of manual work it would take to make one
          of them fit your needs?
       
          mhh__ wrote 20 hours 6 min ago:
          Lamport phrases this as "Thinking doesn't guarantee we will be right;
          not thinking guarantees we will be wrong" - specifications are for
          reasoning about systems in the deepest sense
       
          jcranmer wrote 22 hours 33 min ago:
          There's a famous quote from Dijkstra: "Program testing can be used to
          show the presence of bugs, but never to show their absence." The flaw
          of testing is that it can only test the behaviors that you think
          might be problematic. To actually reach into the category of
          proactively fixing behaviors that you didn't know could go wrong, you
          have to reach for more exotic tools in the toolkit. Fuzz testing is a
          start down this path; formal verification is a different start down
          this path.
          
          Obviously, the quality of these sorts of tools is dependent on your
          ability to write a formal model that is comprehensive in allowing
          behavior you want to be acceptable and disallowing behavior you want
          to be unacceptable, and we're still surprisingly far from that in
          many fields.
       
          chadgpt3 wrote 23 hours 59 min ago:
          For trivial functions they can be. Function: isPrime. Impl: loop to
          sqrt(n) and check divisibility. Spec: returns true if nothing
          divides. Impl: return true at the end if we didn't return false.
          
          It's not exactly the same because your spec says nothing divides, not
          just up to sqrt(n). It's definitely not a test if you do it right.
       
          IshKebab wrote 1 day ago:
          That can definitely be a problem, and I have definitely written
          formal specs (for hardware in SVA) where I'm like "I'm just
          implementing this again". And actually that's a totally valid formal
          verification task to do - formal equivalence checking of an
          implementation against a model (which is just another
          implementation). I can feel (and sometimes is) tautological.
          
          However you usually try to not do that. If possible (and it usually
          is possible) your formal spec will be either:
          
          1. Another implementation, but a much simpler one. For example you
          can formally verify the equivalence of a pipelined dual issue CPU,
          with a combinatorial single cycle model.
          
          2. More general properties that much be true about the implemention,
          rather than exactly what the implementation must be. The classic
          example is compression: decompress(compress(x)) == x.
       
          comonoid wrote 1 day ago:
          The book "Program = Proof" by Samuel Mimram starts with a formula
          which is true for all n below
          
          n = 15 341 178 777 673 149 429 167 740 440 969 249 338 310 889
          
          I don't think you can catch it with any test suite.
       
            pdhborges wrote 1 day ago:
            If you have an int32 or less you can!
       
              jaggederest wrote 21 hours 7 min ago:
              And you'll rapidly return to proofs when your "function input" is
              something like a sequence of, say, ieee floating point numbers
              coming over the wire of possibly unbounded length. State machines
              with proofs that all the cases are handled are great.
       
          jaggederest wrote 1 day ago:
          It's more than just "write the same tests just in a different way",
          because each test is largely independent, or grows to unmanageably
          large, and you have to do the work of figuring out the completeness
          of your test suite by e.g. branch coverage or other relatively crude
          overlap methods.
          
          Statically proven type systems let you do this in a way that each
          contributing piece is checked in advance against all the other
          pieces, guaranteeing not just "this test passes" but, in combination,
          "all these tests create a reasonable, coherent whole", and it
          disallows incoherent models from being implemented in the code. An
          example of this is like Rust's match, which requires complete
          coverage of all the possible branches, but writ large across an
          entire codebase.
          
          You're right that it does nothing for you if you make a fundamental
          logic or theory error, it can't catch that kind of error, but you'd
          be surprised how much more robust it is than "merely" e.g. Haskell's
          type system + ad hoc functional testing + property testing - which I
          would consider a quite strong system overall, but formally proven
          e.g. cryptography is a much higher bar.
       
          y1n0 wrote 1 day ago:
          I don’t write software, but in asic and fpga design, formal method
          specifications enable the use of things like SAT solvers to determine
          if the underlying test article meets the specification.
          
          You specify properties of the design and the tooling tests the design
          in a variety of ways to see if it can violate those properties.
          
          Let’s say you have a system that controls turn signals.  You can
          specify properties that say things like lanes that cross each other
          can’t both have a green light at the same time or even within some
          period of time of each other.
          
          The tooling can exhaustively check the design for this and surface
          code traces that violate it.
       
          jlebar wrote 1 day ago:
          > Whenever I read about formal specs it always seems to me like
          “write the same tests just in a different way”, or worse,
          “write the same implementation but in a different way”. [...] Can
          anyone enlighten me?
          
          A big difference is that formal methods allow you to use the "for
          all" quantifier.
          
          For example, you might write a unit test that says "foo('abc')
          returns a string with no trailing whitespace".
          
          But with formal methods, you can prove that "for any input x, foo(x)
          returns a string with no trailing whitespace".
          
          This is a trivial example, but you could imagine something more
          complicated, such as "for any program P, compile(P) has the same
          behavior as P".
          
          Of course, you have to define what "has the same behavior as" means!
       
            parthdesai wrote 22 hours 37 min ago:
            > But with formal methods, you can prove that "for any input x,
            foo(x) returns a string with no trailing whitespace".
            
            Isn't that essentially property testing?
       
              akoboldfrying wrote 21 hours 15 min ago:
              Unless you literally try every possible combination of inputs
              (which is usually infeasible), property testing can't give you
              mathematical guarantees about correctness. You can think of it as
              a halfway house between classic testing and formal verification:
              
              Classic testing: A human comes up with some concrete example
              inputs for which they know the "right answers" (corresponding
              outputs). They write code that runs the code under test, gets its
              actual outputs, and compares them to the desired outputs.
              
              Property testing: A human comes up with a precise way of randomly
              generating concrete example (input, desired output) pairs. They
              write some code to describe how to generate the pairs, often
              using a declarative DSL that describes only constraints on the
              inputs and outputs, with the understanding that anything not
              expressly forbidden is permitted, like "The input can be any list
              of between 0 and 100 integers each between -500 and 500" and
              "Every integer in the input must appear the same number of times
              in the output". They then write some more code (often a single
              line) to ask the computer to use this "spec" to randomly
              generate, say, 1000 such pairs, or as many pairs as can be
              checked in 1s. The computer generates the pairs itself, runs the
              code under test on each input and and checks its output matches
              the desired output.
              
              Formal verification: A human comes up with a spec that typically
              describes conditions that must hold for all (input, output)
              pairs. This may look very similar to, or even exactly like the
              DSL used for property testing, though in general there are other
              conditions that can be expressed that cannot be checked with
              property testing even in principle -- for example, checking that
              the program always eventually terminates. The main difference is
              that the code under test is never actually run; instead, the
              computer analyses the source code itself to attempt
              mathematically prove that the stated conditions hold. How to
              actually accomplish this is a field of active research, but one
              basic approach is called "symbolic execution". To greatly
              simplify, if we forget about loops and conditionals for a moment,
              the idea is that we can write down things we know must be true
              after each statement executes, based on the things we knew must
              be true before it executed. So for example if x is a variable
              initially containing any integer (and we ignore overflow) then
              after the line
              
                  x = x * x
              
              runs, we know that x >= 0. To handle conditionals like
              
                  if x > 50:
                    x = 42
              
                  something_afterwards(x)
              
              the prover "forks" into two cases: One in which we know for
              certain that x > 50, one in which we know for certain that x <=
              50. At the end of the if statement it then has the task of
              recombining what is known about the two cases. In this example,
              the first case lets us conclude that x = 42 by the end, while the
              second case lets us conclude that x <= 50 by the end, so it could
              conclude that x <= 50 either way by the time execution reaches
              something_afterwards(x). Handling loops is trickier but generally
              involves looking for invariants.
       
              Agingcoder wrote 21 hours 52 min ago:
              Formal methods allow you to prove that it works for all inputs,
              and not just for the small subset that will be sampled by
              property testing
              
              It’s a proof, not a successful experiment.
       
              Jtsummers wrote 21 hours 56 min ago:
              Property testing is stochastic, which may be fine, but only gives
              you a statistically (hopefully) high chance of discovering a
              problem. If you use something like SPARK/Ada, you can actually
              embed a proof in the code so that you actually know that the code
              is correct (for what you've proven). PBT scales better than
              embedding proofs, though, and is highly effective in practice
              along with fuzzing.
       
            brap wrote 1 day ago:
            >Of course, you have to define what "has the same behavior as"
            means
            
            And that's really my issue, for example when you define "has no
            trailing whitespace", you are basically writing a piece of the
            implementation. Cover all behaviors, and you have basically
            re-implemented the function, no?
            
            In other words, if I have the full formal spec of f(), isn't that
            the same thing as having f()?
       
              nemo1618 wrote 21 hours 24 min ago:
              I think the key is that, while you may think you have a full
              formal spec of f(), you actually do not. You have a program
              written in some language, and that language has its own spec, and
              the language is compiled to asm which has its own spec, and the
              asm executes on an architecture that has its own spec, and so on.
              
              So when you write a function like:
              
                func hypot(x, y):
                  return sqrt(x*x + y*y)
              
              You might think you have "fully specified" hypot, but this is far
              from true! You have said nothing about what registers will be
              used, for example. This is not a problem; quite the opposite.
              It's the whole point of using high-level languages: they let you
              focus on what you care about. A spec is just a program in a
              very-high-level language.
       
                ____tom____ wrote 17 hours 23 min ago:
                Exactly.
                
                You can prove that f(a,b) always returns a+b.
                
                And then you overflow the int on that machine.
                
                Oops.
       
                  Joker_vD wrote 13 hours 38 min ago:
                  Well, ideally you proof should've failed because your axioms
                  should be aware that arithmetic operators in the programming
                  language of your choice are operating modulo some large power
                  of two, and thus don't produce correct results for some
                  particular inputs.
                  
                  Admittedly, this one peeves a lot. Remember when Java's
                  binary search and mergesort (and implementations in many
                  other languages' standard libraries) turned out to have a bug
                  of this kind [0]? Admittedly, the proof was informal, but if
                  you are trying to prove some properties of a program X
                  written in a certain programming language Y, you can't "just
                  assume" that the semantics of Y are different from what they
                  are, right? The integers do overflow in Java, that's
                  explicitly stated in the Java's spec... and that means that a
                  lot of even the most simplistic code has some very
                  non-obvious correctness preconditions, which most of the
                  times, I believe, are simply hoped to be true.
                  
                  [0]
                  
  HTML            [1]: https://research.google/blog/extra-extra-read-all-ab...
       
                  codebje wrote 14 hours 30 min ago:
                  I am not sure that the perspective you have taken is the same
                  as what I understood from the parent post; what I took from
                  it is that things like registers, memory locations, ways to
                  implement square root, and so on, are all _implementation
                  choices_ that are not important properties of the
                  specification. You specify that the hypotenuse is the square
                  root of the sum of squares, but whether square root is
                  implemented using Newtonian approximation or a fast inverse
                  square root is irrelevant; whether your first argument is
                  passed in a register or on the stack is irrelevant.
                  
                  Often, things like resource usage are not specified: running
                  time, memory consumption, etc, aren't relevant enough to
                  appear in a behavioural specification.
                  
                  If your spec says "f(a, b) returns a + b", but it's just a
                  high-level document you can use to help guide your
                  implementation, integer overflow is just one of many ways
                  your implementation might be inconsistent with the
                  specification. It's still likely that the existence of a
                  formal specification you reference during implementation
                  means that more edge cases have been considered ahead of time
                  than if you just had an informal spec.
                  
                  If, on the other hand, you prove it but it turns out not to
                  be true (ie, you overflow integers), your proof is wrong. If
                  a machine verified your proof and gave you a big thumbs up,
                  your machine verification is wrong.
                  
                  If, in Idris, I write "f : (a : Nat) -> (b : Nat) -> (c : Nat
                  * c = a + b)", then I cannot compile an implementation for
                  which I can't show a proof that the result is _always_ the
                  addition of a and b, for all a and b, unbounded by anything
                  but the resources at hand with which to run the program. An
                  implementation subject to integer overflow won't compile.
                  
                  Or, I could write "f : (a : Bits32) -> (b : Bits32) -> (c :
                  Bits32 * c = a + b)" and implement something where , but then
                  modulo arithmetic on overflow is _part of the specification_,
                  because "+" in there is doing the heavy lifting of being
                  defined as addition modulo 2*32 already; by specification, 4
                  billion plus 4 billion is ~3.7 billion.
       
                    ____tom____ wrote 13 hours 39 min ago:
                    Perhaps I have misunderstood or was unclear.
                    
                    Everything the parent comment mentioned were implementation
                    details that did not affect the correctness of the code.
                    
                    I just wanted to point out that there are implementation
                    details that DO affect the correctness of the code.
                    
                    And, of course programs need to run on multiple
                    architectures. So it's hard to do what people seemed to be
                    talking about in this thread and verify code just from the
                    source code.
                    
                    If you have the luxury of proving the correctness of the
                    CPU, compiler and OS, that should be a big win. Otherwise,
                    it seems to just be another type of testing. Still useful,
                    but calling it verified or proven seems a bit much.
                    
                    From my perspective, it seems more to be writing another,
                    more complicated program, with more opportunities for bugs,
                    and seeing if the results agree
       
              chadgpt3 wrote 23 hours 56 min ago:
              Non trailing whitespace means the string doesn't end with a
              space. But foo is a function that converts an AST to a string,
              that's totally different. Or it's a function that loops until \0
              and changes all spaces to +
              
              The spec should be a summary of what the impl is supposed to do.
              You'd want more than just doesn't end with whitespace of course.
       
                ____tom____ wrote 17 hours 28 min ago:
                > Non trailing whitespace means the string doesn't end with a
                space
                
                No. And this is a great example of the problems with
                specifications. You still have to write a spec. And this, too,
                is subject to bugs.
                
                What's wrong with the statement above is there are 17 space
                characters in Unicode and another eight whitespace characters,
                like newline.
                
                If you try to verify that something ends in whitespace, you
                have to make sure you have the right definition.
                
                Not picking on parent poster! It's just a great example of the
                fact that you can verify, but if what you are verifying is
                wrong, it doesn't help.
       
                  Joker_vD wrote 5 hours 28 min ago:
                  > there are 17 space characters in Unicode and another eight
                  whitespace characters, like newline.
                  
                  And of course, those 25 characters don't include ZERO WIDTH
                  {SPACE,NON-JOINER,JOINER,NON-BREAKING SPACE} and WORD JOINER,
                  which gives you yet another 5 arguably "it's kinda space,
                  right" codepoints which definitely should not be trailing in
                  any reasonable text string.
       
                    ____tom____ wrote 2 hours 58 min ago:
                    Unicode never ceases to amaze me.
       
              bluGill wrote 1 day ago:
              > In other words, if I have the full formal spec of f(), isn't
              that the same thing as having f()?
              
              In some cases, however quite often, the spec is much simpler. For
              instance, it's easy to say that after running sort on some list,
              that the result is sorted. However, it is very hard to come up
              with the algorithm to do that from the specification. Sometimes
              that is even a point. Bubble sort, quick sort, tim sort, we can
              go on and on. There's a huge number of different sorts that
              computer science have discovered over the years. They all should
              have the same result and so you should be able to prove they do
              the same thing. However, in the real world there are often
              reasons you would prefer one to another despite all meeting the
              same spec.
       
                gsnedders wrote 21 hours 19 min ago:
                Another obvious example are cryptographic hash functions: if
                you have a function f(s) = h, you can trivially specify a
                function inverse_f(h) = s st f(s) = h, and if you can infer a
                non-brute-force algorithm for that, you’ve just inferred a
                cryptographical weakness!
       
                brap wrote 23 hours 59 min ago:
                This makes a lot of sense, thank you!
       
                pydry wrote 1 day ago:
                I did some of this stuff in college and what bugged me was that
                the spec actually ended up more complex than the code and it
                had bugs.
                
                That was a long time ago and they said that formal methods were
                the future back then, too.
       
                  chadgpt3 wrote 23 hours 55 min ago:
                  It's possible for that to happen but probably means either
                  the function is too trivial or you're missing some
                  abstraction in the spec
       
                    pydry wrote 13 hours 41 min ago:
                    It's also possible it's not the future of programming.
       
                      chadgpt3 wrote 13 min ago:
                      It never was. It's useful but extremely slow.
       
        spenrose wrote 1 day ago:
        As usual, this paean to deductive reasoning (“formal methods”)
        leaves out its fundamental limit: how closely do the postulates and
        definitions fit the domain they purport to map? (“In theory, there is
        no difference between theory and practice. In practice ...”) My guess
        is that Jane Street maintains large code bodies where the mapping is
        1:1, because the purpose of the code is to implement a deterministic
        algorithm. Many other coders work in such areas. But millions of us
        don’t: most UIs, most exploratory work, etc.
        
        There is a movement parallel to formal methods to define acceptance
        criteria at high resolution but not logico-mathematically, which at
        least grapples with the mapping problem but can’t resolve it where
        the map isn’t the territory, which is most places. Has Google’s
        results page, with its extremely evolved internal optimization
        frameworks really hit an optimum? Could that prototype you whipped up
        to capture a hazy idea have better illustrated it? These questions are
        best answered by looking outside the system to what the system serves.
       
          sn9 wrote 22 hours 0 min ago:
          Most UIs in practice boil down to state machines which are extremely
          amenable to formal verification.
          
          Hillel Wayne's writing is a good starting place to learn more:
          
  HTML    [1]: https://www.hillelwayne.com/formally-specifying-uis/
       
          petra wrote 1 day ago:
          Google's results page isn't well defined. But probably 90%+ of the
          code below it is well defined.
          
          And in some cases, where the result isnt well defined, it can be
          learned, so it's not about sitting and thinking what would make
          sense.
       
          benlivengood wrote 1 day ago:
          Formal methods are precisely for the domains where the semantics are
          well-defined.  Logical circuits (a lot of CPU components get formal
          verification), kernels, protocols, parsers, compilers, cryptography,
          security frameworks, concurrency primitives, etc. all benefit a lot
          from verification.
       
        pjmlp wrote 1 day ago:
        I am already seeing the uptake on Spec-Driven Development as the
        Rational Unified Process revenge.
       
        awinter-py wrote 1 day ago:
        See previous kleppmann post [1] , and yes, obviously anything that you
        can put in the typesystem or the linter, you should weigh doing so.
        
        Hopefully we get more ergonomic ways to do this? Like of the tools
        listed in the post, dafny + iris are the closest to being industrial I
        think. And amzn S3 has a history of TLA use in-house I think. But we
        probably haven't seen the typescript in this space yet, a zero cost
        abstraction that drops into existing tools, and people genuinely prefer
        it to the old way.
        
        (And custom linters are also still pretty bad to write. Like
        golangci-lint is a painful codebase, haven't tried semgrep but the
        rules engine seemed intimidating. I've yet to use an AST API that I
        liked)
        
  HTML  [1]: https://martin.kleppmann.com/2025/12/08/ai-formal-verification...
       
        mark4 wrote 1 day ago:
        Relevant keynote I wrote about
        
        Formal methods won, now what?
        
  HTML  [1]: https://muratbuffalo.blogspot.com/2026/04/bugbash26-keynote.ht...
       
        xvilka wrote 1 day ago:
        seL4 despite being formally verified contained gnarly bugs, like [1].
        Thus, it's not a silver bullet as some people think. Yes, it improves
        the quality, but only one of the aspects of hardening software, like
        ASAN, SAST, fuzzing, strict languages, and so on.
        
  HTML  [1]: https://github.com/seL4/seL4/issues/1199
       
          cayley_graph wrote 1 day ago:
          > Since it is the unverified SMP config of the kernel
          
          I don't disagree with your point (formal verification does not rid
          you of all bugs), but this is not the subject of the linked issue.
          This was a bug in an unverified path.
       
        bobkb wrote 1 day ago:
        I have been testing formal verification methods with multiple products.
        It will be great to also understand more about what’s tried and how
        it was done. For example attempting to verify the spec is what I have
        been trying to implement.
       
        jdw64 wrote 1 day ago:
        In other words, because GEN AI a lot of code, the idea is to shift
        human value toward verification. Sometimes I think about what
        programming really is. In fact, learning programming itself is a huge
        challenge for a non English speaker like me. I have to rely on machine
        translation to understand English documents that have no translation.
        The materials in my language are about 5 to 6 years behind.
        
        Now, since it's impossible to code review the tens of thousands of
        lines of code that AI produces, I see discussions about establishing an
        absolute rule like mathematical proofs. Reading this reminds me of
        Rust's borrow checker. In fact, after writing in Rust a few times, it
        often leads to bad practices where people use tricks to avoid the
        borrow checker.
        
        Actually, when mathematical rigor goes too far, humans tend to find
        ways around it. An undereducated programmer like me is especially prone
        to that.
        
        Looking back at this kind of attempt, it will probably result in
        writing code only for specific formalized answers. If it becomes that
        standardized, I'm not sure it will be able to respond to human needs.
        
        I think these defensive programming attempts are fine, but I want to do
        offensive programming (I coined that term). You take risks, but you fix
        things quickly and ship. Believing that over time, it will become good
        enough. Of course, for established industries where accuracy matters
        and the scope of work is well defined, like Jane Street, the approach
        in this article is correct. In other words, because there is enough
        data to adequately model the market's demands
        
        But for social losers like me trying to make money, constantly moving
        from place to place looking for gold mines, these kinds of
        methodologies seem like a luxury.Established businesses with mature
        modeling need highly educated and specialized personnel to continuously
        optimize. But I know that realistically, I can't keep up with that
        demand. So I look for places where modeling is unstructured, but I'm
        not sure if I can use this approach even then.
       
          closeparen wrote 36 min ago:
          >Now, since it's impossible to code review the tens of thousands of
          lines of code that AI produces
          
          I would simply not ask the AI to write me more code until I'm
          satisfied with the code it's already written! But it is true there's
          more of an opportunity to let the thing rip if you can give the
          harness the ability to meaningfully verify its own work.
       
          rramadass wrote 3 hours 32 min ago:
          Your ideas about Formal Methods are not clear nor quite correct.
          
          Read first the paper On Formal Methods Thinking in Computer Science
          Education to understand the different levels of practice available.
          Here is a previous comment of mine which explains and links to the
          paper - [1] Carroll Morgan just published his Formal Methods,
          Informally: How to Write Programs That Work which teaches you how to
          think in a formal method manner before you start using the
          heavyweight tools - [2] Also read Understanding Formal Methods by
          Jean-Francois Monin to get an overview of some of the tools and the
          concepts/mathematics embodied in those tools.
          
          With just the basic ideas from the above viz. Set Theory, Predicate
          Calculus, learning to think of a Program as a trajectory through a
          state space, you can start enforcing the  trajectory using simple
          asserts(i.e. predicates) for preconditions/postconditions/invariants.
          Now because of Curry-Howard Isomorphism ( [3] ) you can treat
          "propositions/formulae as types" and thus exploit the type system
          itself as constraints enforcing the above trajectory.
          
          Once the above is grokked, you can move on to more complex logics
          (eg. FOL/HOL/Temporal/Separation etc.) and learn about
          tools/methodologies which implement them (eg. Alloy/B-Method/TLA+
          etc.).
          
          Finally, with AI tools, the threshold for the practice of formal
          methods has dramatically come down. This enables one to do Formal
          Specification and Verification with guaranteed traceability for
          AI-generated code which IMO is a necessity.
          
  HTML    [1]: https://news.ycombinator.com/item?id=46298961
  HTML    [2]: https://www.cambridge.org/highereducation/books/formal-metho...
  HTML    [3]: https://en.wikipedia.org/wiki/Curry%E2%80%93Howard_correspon...
       
          teiferer wrote 14 hours 53 min ago:
          > I have to rely on machine translation to understand English
          documents that have no translation. The materials in my language are
          about 5 to 6 years behind.
          
          Pretty much off topic, but I strongly recommend you learn English. It
          takes a little bit of effort, but getting rid of that constant
          translation overhead will be an enormous boost for you.
       
            curuinor wrote 4 hours 38 min ago:
            guy's korean, looks like? korean <-> english is the hardest
            language pair of two industrialized-nation languages, almost.
       
          altmanaltman wrote 15 hours 37 min ago:
          I don't understand the point of your comment here. Yes of course
          JaneStreet published that since its relevant to them. Yes, it doesn't
          generally apply to all programming. How is you being a "social loser"
          or not following these practices in your career relevant to this?
       
          orochimaaru wrote 19 hours 10 min ago:
          You have to see it from janestreet's perspective. They're an HFT and
          trading high volume (millions if not 10's of millions) of stock &
          instruments. There is no "fix". By the time you understand what's
          wrong you've lost billions.
          
          So yeah - offensive may work in non-critical areas.
          
          Fwiw - you already use defensive everywhere. Python, Java, etc. come
          with garbage collectors. It's verified that the code is executing
          your intent.
          
          I was wondering when we would start seeing formal verification. It
          makes sense that we would go from worrying about implementation
          details to a scientific/mathematical description of the problems.
       
            coderenegade wrote 17 hours 58 min ago:
            >Fwiw - you already use defensive everywhere. Python, Java, etc.
            come with garbage collectors. It's verified that the code is
            executing your intent.
            
            Sort of. Garbage collectors can be fallible too, especially where
            release optimization is used.
       
          ngruhn wrote 23 hours 28 min ago:
          > I want to do offensive programming (I coined that term). You take
          risks, but you fix things quickly and ship.
          
          Nice, I like the term too. But the paradigm is absolutely status quo
          in the industry. The thing is: with Gen AI the cost of "defensive
          programming" has gone way down, while the cost of (human)
          verification has gone way up. On the other hand, formal methods make
          verification cheap but come with massive implementation overhead
          (writing specs, types, proofs, and generally bending the
          implementation into a rigid framework). But Gen AI can automate all
          that laborious work. It's a match made in heaven.
       
            jdw64 wrote 23 hours 3 min ago:
            You're right. I think the point we need to discuss here is, to be
            clear, dividing the contexts in which defensive programming is
            strong and where it should be aggressive.
            
            I think the overhead of implementation is enormous, but I believe
            AI can write it. However, before even reaching the 'implementation'
            stage, that is, at the planning stage, sufficient data must be
            collected for the implementation to be safe.
            
            In that respect, I think Jane Street already has enough data and
            modeling capabilities. However, I think it's a bit difficult to say
            whether this approach will take shape in many other domains.
            
            In that sense, I also think that the reason many industries are
            doing this kind of fast deployment and experimental tooling might
            be a preparatory step for that kind of modeling. Have a good day
            Thanks for the good comment. It helped me think more sharply as
            well
       
              majormajor wrote 18 hours 29 min ago:
              > I think the overhead of implementation is enormous, but I
              believe AI can write it. However, before even reaching the
              'implementation' stage, that is, at the planning stage,
              sufficient data must be collected for the implementation to be
              safe.
              
              I don't really follow here, I think once you know what you want
              the system to do, you could start to model it formally. What data
              do you think needs collection for planning here? Is it just for
              performance profiling of whatever algorithms are decided on? But
              that seems equally as relevant as if you do your initial
              implementation straight-up in code w/o any formal approach.
       
                jdw64 wrote 17 hours 19 min ago:
                Formal verification is a mathematical proof and requires
                axioms. However, if real world field data is not collected, you
                are not comparing theory to reality but forcing reality to fit
                the theory, like the bed of Procrustes.
                
                This means the object of verification is not reality but the
                consistency between specification and implementation, and
                incorrect formal verification only refines a wrong worldview.
                
                A company with sufficient capital can turn a theory into
                reality through marketing and other means. But for small
                businesses or in markets where new competitive logic is
                introduced due to the lifespan of an industry, this can be a
                fatal problem.
                
                For example, consider the stock market. If you use the Buffett
                indicator to invest right now, the market looks overheated. But
                other indicators suggest positive prospects. We cannot know
                whether the market logic of the AI era aligns with the past or
                is different, but I believe there are cases where modeling
                changes due to real world shifts.
                
                In reality, when converting the real world into mathematics or
                physics, information loss is inevitable. In this process,
                science may later advance and become more precise, or a value
                once considered important may turn out to be wrong.
                
                In other words, something may be correct within a given set of
                axioms, but those axioms themselves may change depending on the
                context. The history of computer standards has shown this as
                well(ASCII -> Unicode)[To explain this point, ASCII works
                perfectly in English speaking countries, but in my country it
                creates incorrect encoding.]
                
                Programmer's ability may come down to distinguishing between
                what changes little and lasts long, and what changes quickly
                
                After writing the replt, I realized I was being too critical of
                formal verification. I think Jane Street's argument holds for
                invariants inside code. However, I am cautious about whether
                that logic can be extended to other fields. I haven't studied
                deeply enough, so my thinking may be shallow. Please understand
       
          stephenlf wrote 1 day ago:
          > these kinds of methodologies seem like a luxury
          
          Absolutely. The article acknowledges this. Jane Street is pretty
          uniquely equipped to benefit from this.
       
        eddiepete wrote 1 day ago:
        I wonder how formal methods can help us move faster with GenAI.
        
        Is it that they can help write formulas faster? That they can help
        ensure formulas match the system they're modeling faster? If the
        problem you think formal methods will help with is sloppy code, isn't
        the verification code going to be sloppy as well, unless some (not
        sloppy) intelligence carefully confirms that the specification matches
        the target system, which was the labor that previously made formal
        methods too expensive? I guess I don't understand how the argument
        works if code was previously less sloppy and verification was too
        expensive, and now code is more sloppy, and there's more of it, but
        somehow the sloppy intelligence will make verification move fast enough
        to make it worthwhile.
        
        Unless we have some non-sloppy intelligence that's less of a bottleneck
        on verification than humans, how are we in a better place?
        
        Maybe it's that investing that huge amount of labor of verification by
        human experts is now worth it because so much code will be produced
        that uses the verification systems that the investment will now pay
        off. But that requires creating pretty general verification systems,
        such as type system verification or something (which is what they seem
        to be aimed at), rather than individually verifying software systems
        like the micro-kernel example.
        
        In other words, maybe the play is to invest in reusable verification
        systems that can be run tons of times on new code and systems. If so,
        it's surprising that this wasn't always the strategy.
       
          jsenn wrote 1 day ago:
          > isn't the verification code going to be sloppy as well
          
          The beauty of formal methods is it doesn't matter if your proof is
          sloppy. As long as it passes verification, it is correct. And unlike
          in pure math, the proof that a software system is correct is usually
          a huge mess of special cases, loop invariants, proofs by induction,
          and boilerplate that requires a large amount of human labour while
          providing no insight.
          
          Proofs are also brittle: a tiny change in the code can force you to
          throw your proof away and start from scratch.
          
          To me, the exciting thing about formal methods in the LLM era is it
          allows humans to offload the difficult and tedious work of writing
          proofs to a computer. Taken to an extreme, the human could live
          entirely in the world of a formal specification, and the LLM could
          generate 100% of the code. The code may be a mess, but if the system
          proves it satisfies the spec then it can't be wrong.
       
            pron wrote 1 day ago:
            The problem is that generating either code or proofs with LLMs is
            very expensive, and generating good proofs (I don't mean elegant, I
            mean proving the most important properties) is probably not very
            fast, either. Reducing the verification time of a program from 100
            years to 10 years or the cost from $1bn to $100m is still not
            practical enough to become truly mainstream.
            
            Things can be improved when people help guide and focus the LLMs,
            but these people still need to be formal methods experts.
       
            odyssey7 wrote 1 day ago:
            So, formal methods produce runnable systems, but communication
            remains the challenge.
            
            If a formal spec is messy, then it's a proof of ... what, exactly?
            
            A formal specification that bridges tech and product, that lets
            non-technical contributors read and discuss all the logical
            nuances, directly as operational code, at product's level of
            abstraction of interest, would transform a lot.
            
            It's no longer a challenge to create code, it's a challenge to
            create business requirements and translate them into systems.
       
              rzmmm wrote 21 hours 45 min ago:
              The spec and proof are separate. In this blog article he mentions
              seL4 formal verification, where they state that the spec was 4900
              lines of Isabelle and the proof was 200K lines. Obviously human
              has to understand the spec deeply.
       
                jaggederest wrote 20 hours 47 min ago:
                There's an information theoretic aspect about generating a
                proof which is essentially not human readable from 4900 lines
                of spec. I wonder how much additional signal they're getting
                out beyond what's in that 4900 lines, and what's the percentage
                of noise in the 200k lines of proof?
       
          jnwatson wrote 1 day ago:
          The general idea is that formal methods are self-verifying, up to a
          point.    Sloppy formal methods simply won't prove.
          
          The point up to which formal methods stops is: do the formally
          encoded requirements actually encode what I want to be true?
          
          One can make the argument that the requirements is a much smaller
          surface to verify than that of the entire program.
          
          The counter argument is that figuring out what you want the program
          to do has always been the hardest part of programming, and that
          programming in itself is a journey to discover latent requirements.
       
            Veserv wrote 1 day ago:
            > One can make the argument that the requirements is a much smaller
            surface to verify than that of the entire program.
            
            This argument is unfortunately empirically false for any program of
            any meaningful complexity (>1000 lines, probably even as low as
            >100 lines ignoring well-defined algorithms and data structures)
            using current formal methods.
            
            Complete formal specifications are usually multiple times larger
            than the corresponding source code and encode esoteric propertys
            necessary for the proof, but which are largely even more
            impenetrable than a undocumented codebase.
            
            So, it is both harder to figure out if you encoded the desired
            requirements and it is more complex. Your only advantage is
            confidence that what you wrote down is proven.
       
              akoboldfrying wrote 12 hours 29 min ago:
              > Complete formal specifications are usually multiple times
              larger than the corresponding source code and encode esoteric
              propertys necessary for the proof, but which are largely even
              more impenetrable than a undocumented codebase.
              
              Is it possible that the spec could be factorised into something
              higher-level and more modular? If not, can you give a flavour of
              the type of unavoidable esoteric detail? (One area I can see lots
              of complications is when dealing with versioned data, especially
              in multiple interacting systems -- naively, correctness needs to
              be proven for every combination of versions, even if some are
              never seen.)
       
       
   DIR <- back to front page