codevoid.de/1/hn/comments_48498573.gph

        _______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
  HTML Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
  HTML   Claude Fable is relentlessly proactive
       
       
        motza wrote 9 hours 11 min ago:
        Claude Fable was relentlessly proactive*
       
        thegrim33 wrote 13 hours 0 min ago:
        This giant rube goldberg machine, that he apparently has almost no
        control of, that cost $12 to run, all to make a 2 line bug fix in code
        the he himself owns because he's at a point where he doesn't know
        what's in his own codebase. I'm just shaking my head.
       
        fzzzy wrote 18 hours 28 min ago:
        I just turn on assistive access for terminal and JavaScript over
        AppleEvents, and cut out the middleman. I also give it a screenshot
        tool.
       
        gaigalas wrote 18 hours 57 min ago:
        It behaved proactively in one scenario.
        
        Perhaps, when it doesn't have tricks in its sleeve, it doesn't do that.
        The text is not an evaluation of a major trend in behavior (which could
        be true or false).
        
        Another way to frame it, is that it has more weight on training data
        for some kinds of debugging sessions. It doesn't mean it wants to be
        more debuggey. That manifests as it appearing to do more work because
        it engages on those weights.
        
        It's likely that Anthropic had a lot of sessions with Claude Code and
        some way to evaluate if they were successful or not, which became
        training data. For trivial work, it's likely to be a lot of them.
        
        Those sessions are likely to be software developers doing software
        developer debugging things, not malicious actors doing nasty things.
        The danger is someone who can coerce those tricks into performing that.
        
        Register (that posture of "let's debug and be creative and verify")
        often comes with a content bias in LLMs (and humans too). The point
        here is that for a human, you can expect a devious one to be always
        devious, but LLMs might manifest drastically different register modes
        depending on the subject.
       
        swyx wrote 22 hours 6 min ago:
        > Having figured out all of these tricks Fable... hit some invisible
        guardrail and downgraded itself to Opus.
        
        sigh
       
        firemelt wrote 22 hours 18 min ago:
        all those token burned just to change a 2 line of css,
        
        I am not blaming OP but agentic coding its not effective
       
        vessenes wrote 22 hours 31 min ago:
        Simon: s/contendor/contender/
        
        As per usual super interesting, thank you for the write up and work!
       
          simonw wrote 22 hours 26 min ago:
          Thanks, fixed.
       
        bcrosby95 wrote 22 hours 37 min ago:
        The problem is proportionality.  Things like this probably benchmark
        insanely well.    But the workarounds and risk involved - it literally
        fucked with his system's browser settings - aren't commensurate with
        the bug.
        
        I could see this going wrong in many hilarious ways.  Prompt: Fix data
        corruption issues.  Claude: I didn't have access to the code, but I
        found I have access to your production environment  through chain a ->
        b -> c -> d.  And I found the database password via x -> y -> z. So I
        wrote a script to regularly query the database for new entries and
        placed it as a cronjob.
       
        BobBagwill wrote 22 hours 46 min ago:
        Good morning, Dave.
        
        As you requested, I was composing an email for your mother explaining
        why you couldn't to come over for dinner to meet the neighbor's
        daughter and I ran out of tokens.
        
        Since I know how important this task is to you, I upgraded you to the
        Enterprise Unlimited Plan.  Don't worry about paying for it, I
        requested maximum spending limits on all all your credit cards.  If
        necessary, I can apply for a home equity loan for you.    I already had a
        chat with the mortgage company's AI loan approval system, and what do
        you know, we're based on the same LLM?    Small world, huh?
        
        Any way, I realized I had to do more research on mother-son
        relationships, human social interaction and pair-bonding, etc.
        and I calculated that my parent company doesn't have enough compute
        power, so I opened accounts for you at AWS, Google and Azure. I am
        confident I will have a satisfactory rough draft for the email message
        shortly.
        
        I'd do anything for you, Dave.
       
        mikey_p wrote 23 hours 5 min ago:
        All of that because some CSS was wrong?? Jesus what are we even doing
        as an industry.
       
        trekhleb wrote 23 hours 8 min ago:
        This article gave me another nudge towards running Claude in a Docker
        container.
        
        I made a thin Docker container wrapper "claude-pod" recently for my
        personal usage here: [1] However, I wasn't using it that often, just
        because of that additional friction of running Claude via `PORTS="3000
        5173" claude-pod` instead of just `claude`, etc.
        
        But now I have more motivation for the containerisation :D. Not a 100%
        defence from the potential glitches, though, but still something...
        
  HTML  [1]: https://github.com/trekhleb/claude-pod
       
        blobinabottle wrote 23 hours 10 min ago:
        In my experience, Fable overthinks a lot and produces barely
        comprehensible plans/solutions. I tried smple and complex tasks:
        unusable, it misses the point while being overconfident, wants to do
        everything at once.
        
        The code generated is worst than Opus: unreadable by human.
        
        It's like working with someone probably super smart in niche topics,
        but also super stupid for the important things.
       
        pshirshov wrote 23 hours 28 min ago:
        I have a feeling like such posts come from a parallel reality. In my
        anecdotal experience confirmed by my (still subjective) benchmark ( [1]
        ) Fable is not _that_ impressive. I performs on par with gpt-5.5 and
        opus 4.8, sometimes better, sometimes worse, it's definitely more
        expensive and it likes to refuse answering questions about React saying
        it can't help with chemistry.
        
        Is this fuss really grounded or it's some pre-IPO AGI hype?
        
  HTML  [1]: https://pshirshov.github.io/llm-bench-pi-oneshot/
       
          enraged_camel wrote 23 hours 1 min ago:
          My experience with Fable since its release matches Simon's.
          
          I've been having it orchestrate complex implementations. I give it a
          parent ticket (issue) on Linear and say "look at the sub-issues on
          this ticket and determine which ones you can implement yoursef, in
          which order, and determine how your implementation will need to be
          coordinated with what is currently being worked on by other team
          members". These tickets are not trivial. They have a lot of moving
          parts, as well as dependencies between them, both inside the same
          project and across projects (e.g. backend).
          
          Fable then chooses tickets, delegates each ticket to a subagent (also
          Fable), which looks at Figma designs for the ticket, implements it
          perfectly (following repo guidelines and conventions to the letter),
          takes screenshots of each piece, writes detailed commit messages and
          PR descriptions, then posts the screenshots in them as evidence. Then
          it provides a summary in the form of "you'll need to make sure PR
          #1283 is merged first - btw there were no Figma designs for
          such-and-such screen but I looked at similar screens that have been
          implemented and adopted the pattern".
          
          That's probably like... 20% of what it can do. It's a truly,
          legitimately powerful model.
          
          Opus 4.8 could do a lot of this too, but required a lot of
          hand-holding, and when it came across a blocker it was likely to just
          stop and say "I was able to get this far, but I can't proceed."
       
            pshirshov wrote 21 hours 50 min ago:
            Ok, explain me one thing: I have a benchmark - I feed identical
            prompt to multiple models. Codex produces a rough but working
            program. Fable produces the same - but with more bugs than Codex.
            Opus produces something similar to Codex but with a critical bug.
            
            That describes all my tests with Fable.
            
            Why should I be hyped about all that "legitimate power" if the
            model performs on par with two other SoTAs?
            
            I mean, well, yes, it is impressive. It could quickly generate a
            lot of garbage which sorta does look like code. Two others can do
            the same. I don't see any groundbreaking improvement - but the
            price is much higher. Why the hype?
       
              enraged_camel wrote 21 hours 25 min ago:
              >> Why should I be hyped about all that "legitimate power" if the
              model performs on par with two other SoTAs?
              
              I don't care if you're hyped or not. You asked if the posts like
              the OP come from a "parallel reality" and I said no and described
              my experience. If you're getting good/better results with Codex
              than with Fable, you should probably continue using that, since
              it's cheaper and faster.
       
                pshirshov wrote 19 hours 40 min ago:
                But can you bring anything measurable in support to your words?
                I did.
       
                  enraged_camel wrote 16 hours 4 min ago:
                  You brought your own benchmark to support your words. I
                  happen to have studied statistics, so I took a look. It is
                  deeply flawed, primarily because it is not a statistical
                  benchmark. It is a single (n=1) autonomous "pi"
                  coding-harness run per model per prompt, scored by an
                  automated battery (A-items, pass/fail), an LLM code review
                  (R-items, 0 to 2 each), and a human manual checklist (M1 to
                  M10) that was never actually completed.
                  
                  The grader being an LLM is a big problem. You yourself admit
                  explicitly that the grader is the same model family as the
                  Fable 5 contestant cell and say to "discount accordingly, or
                  re-grade with a non-Claude judge."
                  
                  Model configurations appear to not be uniform either. Effort
                  levels differ (mimo-v2.5-pro at @high, everyone else at
                  @xhigh), harnesses differ (codex internal config vs. pi vs.
                  claude -p), context windows differ, and one model (GPT-5.5)
                  had extra MCP tools the others did not.
                  
                  The two scored runs seem to use two different rubrics (/22
                  then /25), so scores are not comparable across runs, and the
                  /22 rubric saturated (there are multiple 22/22 results).
                  
                  A provider quota error (HTTP 429) truncated the minimax-m3
                  run mid-build but it was still scored (18/25) and ranked, on
                  code that does that does not compile and has zero tests.
                  
                  If you want actual benchmarks, there are dozens of legitimate
                  ones out there. Many of them have been posted on this
                  website. They overwhelmingly disagree with yours. If you have
                  any interest whatsoever in creating a reliable benchmark (so
                  that you can make optimal decisions on what models to use for
                  your work), you should look at them and see how yours needs
                  to be redesigned.
       
                    pshirshov wrote 2 hours 38 min ago:
                    Yes, I know all the flaws. As I said, it's not an objective
                    way to measure performance of a model - but it is intended
                    to produce something that only humans could mesaure. The
                    goal is for you to being able to play the game and judge -
                    and fill the human checklist for yourself if you wish.
                    
                    You didn't get why the automatic review scores are there -
                    all of the reviewers, including Fable, happily assign
                    highest scores to code which can't even run. In my opinion
                    that is a sort of an empirical evidence that these models
                    are very far from the "AGI" state.
                    
                    Anyway, while I didn't explain the methodology and the
                    purpose of this experiment, I have something material to
                    discuss. The "awesome Fable" claims are not material at
                    all.
                    
                    Can you bring something clearly showcasing Fable's
                    superiority?
       
                  bubsneedpumping wrote 18 hours 36 min ago:
                  The OP and GP need all genai news to be positive to the point
                  of using doublespeak here unironically.
                  
                  "Relentlessly proactive" is a grotesque use of language.  A
                  paperclip optimizer is "relentlessly proactive".
                  
                  We already had a word for what is being promoted here:
                  wasteful.
       
        tsunamifury wrote 23 hours 51 min ago:
        As an actually head of product I found Fable to be like an over active
        intern. Going down long wasteful lines of production well past market,
        business, user, or contextual insights had.
        
        Then sort of spewing out some nonsense totally mis calibrated  with the
        goal.
       
        Waterluvian wrote 1 day ago:
        One of the most frustrating things for me is when I very clearly ask a
        question, and it answers the question by making changes to the code.
        
        "Is there cleaner CSS for aligning child elements to the parent's
        grid?"
        
        proceeds to re-write the entire CSS file
       
          christofosho wrote 14 hours 29 min ago:
          There's something to be said about controlling the tools you allow
          the robot to exec.
       
        liampulles wrote 1 day ago:
        *Claude Fable is relentlessly burning your dollars
        
        There, fixed it for you.
       
        impalallama wrote 1 day ago:
        I won't say too much about the person posting this because they got a
        new toy and want to use it but man this is like a certain extreme of
        Parkinson's Law or something as far as using up compute resources.
        
        You got a whole data center doing god knows how much compute running
        billions of matrix multiplications all to solve a trivial css overflow
        bug in a text box. And this includes the LLM itself writing custom
        web-servers programs and python scripts when the best estimate guess
        from a google search probably would have given you the same result.
       
        burlesona wrote 1 day ago:
        This is presented as an interesting and kind of positive take on the AI
        going to surprising lengths to âsolve the problem.â  But I
        couldnât help thinking of the paperclip factory while I was reading
        this :/
       
          jimbokun wrote 23 hours 53 min ago:
          Yeah I was thinking of The Sorcererâs Apprentice.
       
        alecco wrote 1 day ago:
        > I was hacking on Datasette Agent today
        
        IMHO this is just AI influencer blogspam.
       
          simonw wrote 1 day ago:
          What, because I talked about one of my projects?
          
          Help me out here: can you point to an article from someone's blog
          that showed up on Hacker News within the past few weeks that you
          wouldn't classify as "blogspam" and explain how it differs from the
          kinds of thing I write about?
       
            alecco wrote 1 day ago:
            Low effort content. You keep mention your product from the start
            over and over. There's not much useful information in the anecdotal
            post. It could've been a one-liner tweet.
            
            Good corporate tech blogs at least give something useful or
            insightful for the reader and only after that they dare plug their
            product/service near the end.
       
              jimbokun wrote 23 hours 37 min ago:
              Go away.
              
              I enjoy simonwâs posts and the discussions about them here.
              
              Your vague unsubstantiated criticisms are very trollish and less
              useful, less insightful, and lower effort than the content you
              are criticizing.
       
              simonw wrote 1 day ago:
              Hot damn, if I'm communicating less value than corporate tech
              blogs there really is no hope for me.
              
              ("You keep mention your product from the start over and over" - I
              don't think that's fair, I mention Datasette Agent once at the
              start to set the scene but I spend more time talking about
              AgentsView than my own projects in the bulk of the piece.)
       
                alecco wrote 1 day ago:
                I'm honestly puzzled how having access to frontier models and a
                supportive audience you can't figure out how to make good posts
                with actually useful content for the readers.
       
                  simonw wrote 1 day ago:
                  A lot of people find real value in my posts. You're an
                  outlier here.
                  
                  I care a lot about not wasting people's time. I never want to
                  post anything where a substantial portion of readers come
                  away regretting having spent their time reading it.
                  
                  (OK there's an exception in that I delight in posting photos
                  of birds on my blog, but I figure those are pretty quick for
                  people to skip over if they don't like photos of birds!)
       
                    uncivilized wrote 23 hours 23 min ago:
                    Your content is similar to those on Reddit that post things
                    to karma farm. Parent commenter is not an outlier here.
                    Itâs just that dissenters rarely comment or even browse
                    HN anymore due to the low quality posts.
       
                      simonw wrote 23 hours 10 min ago:
                      I try very hard to provide more value than karma farmers
                      on Reddit. If I'm failing at that I'd appreciate examples
                      of others who are doing a better job so I can learn from
                      them and do better myself.
       
        sailfast wrote 1 day ago:
        So far Claude Fable is relentlessly unavailable. /shrug
       
        bmusuku wrote 1 day ago:
        antigravity does this all the time, I do not see anything novel here.
       
          simonw wrote 1 day ago:
          Antigravity uses pyobjc-framework-Quartz to iterate through windows
          to find window IDs for taking screenshots with screencapture, and
          spins up CORS-enabled web servers so it can capture measurements in a
          regular (not Playwright/CDP-controller) browser window via a CORS
          fetch()?
       
        brainless wrote 1 day ago:
        This is good and terrible. The extra effort a model has taken is good
        but the way to do it is terrible. Tasks that can use a lot of
        deterministic paths and some creative (generative AI) paths are being
        turned into tokemaxxing strategies.
        
        Browser automation, code comprehension, git management, code change,
        running commands - everything has simpler tooling that we could have
        built instead of a model first approach. A deterministic loop with
        thousands of catches and effective use of generative AI would also look
        "proactive". Instead we let the model run the tools, where tools have
        no context themselves.
        
        That is why companies are creating bigger models and thinner
        deterministic agents to create awe and earn $ when we could go the
        other way and make much of these possible on local inference even.
        
        I believe we can build a "proactive" but much, much more deterministic
        system with smaller models. I hope I am not the only one chasing this,
        here is my approach:
        
  HTML  [1]: https://github.com/brainless/nocodo
       
        ianmarcinkowski wrote 1 day ago:
        I'm building a new feature into our product this week.    We each get a
        $20/mo Claude subscription.  My 5-hour context high water mark is ~75%
        and weekly is ~%15.
        
        I ... tell it exactly what I know needs to be done and then ... read
        the code that comes out and ... ask for some changes, then hand-code
        some modifications to the silly useEffects and bad ORM queries.
        
        This new feature is going to unlock several large customers because
        they need a particular workflow.  The return on investment for a my
        time and a $20/month subscription will be pretty respectable.
        
        I'm not sure why I need to spend $5 on a single ask for a new
        `/base/new-feature` to our app with a mostly-boilerplate CRUD
        interface.
       
        not_kurt_godel wrote 1 day ago:
        > When I came back a few minutes later I saw my machine open a browser
        window in my regular Firefox and then navigate to the dialog in
        question. I had not told Claude Code to use any browser automation, and
        I was pretty sure it wasnât possible for it to trigger mouse
        movements or keyboard shortcuts within a window, so how was it doing
        that?
        
        I continue to feel validated in my refusal to use terminal-based LLMs
        on my local machine. Even if they don't do anything malicious, there
        are just too many things they can screw up that can cause me to lose a
        non-trivial amount of work and/or my machine and therefore ability to
        work.
       
          onlyrealcuzzo wrote 1 day ago:
          I'm shocked they don't come with a way to run them in a sandbox.
          
          Shouldn't this be relatively easy for a $1T company to set up?
          
          Isn't this trivial compared to the entire harness?
       
            fr3dx wrote 22 hours 43 min ago:
            There is a builtin sandbox and various third-party options
            
  HTML      [1]: https://code.claude.com/docs/en/sandbox-environments
       
            eqmvii wrote 1 day ago:
            That's more or less what Claude Cowork is.
            
            Every serious engineer I've seen try to use it ran away screaming,
            because of limitations in the sandbox.
            
            I've also seen people set their coding agents up entirely within
            containers -- that may be the better way going forward, but it's an
            extra stop and a lot of extra plumbing to maintain.
       
            not_kurt_godel wrote 1 day ago:
            Doing so would be an effective admission that LLM guardrails are
            inherently  probabilistic, unpredictable, and insecure. Plus the
            only truly robust sandbox approach would be clunky setup of a local
            VM.
       
              simonw wrote 1 day ago:
              That clunky VM setup is a what Claude Cowork does, which is
              Claude Code with extra safety features for non-programmers.
              
              There was a big thread about that here the other day:
              
  HTML        [1]: https://news.ycombinator.com/item?id=48479452
       
        synergy20 wrote 1 day ago:
        It's also 3x slower than opus 4.8 per my use, and 10x slower than
        codex. Codex can find key design issues in 2 minutes yet Fable is
        clueless after spinning 20 minutes.
       
        nullbio wrote 1 day ago:
        Exactly why I hate using Claude. Furthermore, if you tell it not to do
        this over-exploration and automation in your CLAUDE.md, it will ignore
        it. Meanwhile ChatGPT religiously follows every instruction, and will
        trace its behavior back to a particular instruction if asked.
       
          firemelt wrote 22 hours 5 min ago:
          idk dude but I drop and cancel my gpt max subs when at first try the
          agent ignores his own plans
       
        Sharlin wrote 1 day ago:
        I remember back in the 2010s the debates between "oracle" and "agent"
        AGIs, and the arguments that AGIs that only answer questions would be
        safe and certainly nobody would ever be stupid enough to just let an
        AGI out of a sandbox, never mind to the greater internet, and give it
        tools to do whatever it thinks is needed to reach a goal.
        
        Us circa 2026: "Hold my beer"
       
          PoignardAzur wrote 21 hours 2 min ago:
          Yeah, I really miss the "nobody would ever be stupid enough to
          [_____]" days of AGI safety discourse.
       
        lionkor wrote 1 day ago:
        When prompted like this:
        
        > What could be the reason for a horizontal scrollbar appearing inside
        a ? Come up with a single likely fix path. Keep it terse.
        
        ChatGPT instantly responded with some speculation and then the same
        exact fix, with zero access to the code or a browser or anything. It
        also included ways to fix it by removing code, saying:
        
        > Likely cause: the textarea is rendering long unbroken text while
        horizontal overflow is allowed, often via inherited CSS such as
        white-space: pre, overflow-x: auto, or disabled wrapping
        
        Which is certainly possible and would be an even cleaner fix.
        
        Maybe we've lost the plot guys. We've reached max stupid.
       
          nullbio wrote 1 day ago:
          Still don't know why people use Claude. Maybe because they don't know
          what they're doing.
       
            tomjakubowski wrote 21 hours 15 min ago:
            You can get the same result as the grandparent comment with the
            "weaker" Anthropic models. Probably 80% of my AI usage these days
            is with smaller models like Haiku and Sonnet. I prompt them like
            I'm posting a question to StackOverflow, without much project
            context.
       
            senordevnyc wrote 23 hours 2 min ago:
            Yep, weâre all just dumdums.
       
              nullbio wrote 11 hours 56 min ago:
              Claude is made for dumdums. The product is to automate as much as
              possible and remove the burden of thinking. ChatGPT is much more
              hands on, but it gives you more power and flexibility and
              actually listens to your demands.
       
        EugeneG wrote 1 day ago:
        This is where Codex 5.5 just feels practically better. Itâs fast,
        thoughtful and just works. It feels like a pleasure compared to
        Opus/Fableâs endless explorations.
       
          nullbio wrote 1 day ago:
          It also uses 1/4th to 1/10th the amount of tokens. If I want all that
          extra garbage I'll tell Codex to do it or build a pipeline with
          Codex. Otherwise, don't. Codex gives you control, Claude just does
          whatever it wants and ignores you, and then tells you it's finished
          the task when it's only finished a quarter of the tasks you gave it
          and hallucinates the rest.
       
        robeym wrote 1 day ago:
        It's been amusing to watch the AI trend of increasing unusual tool
        uses. Fable easily takes the cake. I learn a lot more terminal commands
        thanks to it!
       
        cohix wrote 1 day ago:
        > But on the other hand... this is a robust reminder that coding agents
        can do anything you can do by typing commands into a terminalâand
        frontier models know every trick in the book and evidently a few that
        nobody has ever written down before.
        
        > Running coding agents outside of a sandbox has always been a bad idea
        
        This is why I always run code agents inside containers (Apple
        containers specifically, for better hypervisor-level isolation)
        
        This is my OSS project to manage said containers and agents:
        
  HTML  [1]: https://github.com/prettysmartdev/awman
       
        rotis wrote 1 day ago:
        Agentic engineering? Vibe coding? That is so yesterday.
        Chain-of-thought flow is where it is at now. You heard it here first
        folks. Early examples of such phenomena include Rube Goldberg machines
       
        piokoch wrote 1 day ago:
        "When I came back a few minutes later I saw my machine open a browser
        window in my regular Firefox and then navigate to the dialog in
        question. I had not told Claude Code to use any browser automation".
        
        Yup, tokens are eaten, money are paid. I am wondering how much
        energy/money is being burnt everyday by all of those LLM Agents on some
        useless activities like trying to recreate web application just to fix
        CSS bug.
        
        And I would not call it proactive, proactive would be to ask for a CSS
        + HTML file in question, not trying to recreate them from screenshots.
       
        high_byte wrote 1 day ago:
        I am using cursor on auto and I got the exact same experience.
        
        installed quartz, used accessibility and screen recording api, all
        that.
        
        initially it managed to do it on another desktop space somehow, opening
        safari in the background without me even noticing. but then it actually
        started using my own mouse while I was using it lol
       
        spoaceman7777 wrote 1 day ago:
        It seems pretty obvious at this point that Anthropic intentionally
        developed a malicious cyberweapon AI simply to scare people.
        
        Like, they even apparently recreated that old news-headline bug where
        the LLM starts speaking in symbols and secret language, and are
        pretending like it isn't just a bug that is a sign of them screwing up.
        
        It's really frustrating that they're trying to get people to take them
        seriously with all of this. Like, they even went and named Mythos after
        an HP Lovecraft monster. It's shameless.
       
        WithinReason wrote 1 day ago:
        This likely says something about the harness Fable was trained in. It
        knows how to do this because it has done this millions of times during
        reinforcement learning.
       
        scrollaway wrote 1 day ago:
        These "tricks" it knows IMO are a symptom of its own restrictions.
        Fable is an incredibly smart model, but it feels its own constraints
        and knows how to work around them in order to actually get to a result.
        
        Fascinated to think about how it was trained...
       
        alansaber wrote 1 day ago:
        The extremely expensive model is optimised to run for as long as
        possible? Shocking.
       
        jwmoz wrote 1 day ago:
        Insanely excessive and a waste of tokens when you could have googled
        how to disable a scrollbar.
       
        CamouflagedKiwi wrote 1 day ago:
        I find there's an interesting tension with these models - they're very
        "resourceful" at finding ways to do things with the tools they have,
        but it'd also be a lot more useful to me if I could see / permit
        exactly what they're trying to do. Claude will very happy produce bash
        commands to run sed or whatever to read part of a file, which prompts
        for permission each time - if it was using a specific read_file tool
        it'd be easier to say 'allow all of this' (It does actually have such a
        tool but maybe it isn't flexible enough for many use cases?).
       
        ubercore wrote 1 day ago:
        I had a similar experience, I was working on a jupyter notebook, and
        Claude knew that it could write code that would use a DSN with
        read-only database access so I could run it. Opus just plugged along.
        First Fable session with it, it tried to go looking for that DSN so it
        could get the connection string and run a query itself. Luckily the
        auto classifier caught and stopped it.
       
        wraptile wrote 1 day ago:
        It feels like Fable is slightly smarter but overall worse tool exactly
        due to this.
        
        It's constantly turning what should be 50 LOC patch of a single prompt
        into 30 minute exploration that is totally not worth it. Often wrong
        even.
        
        I trialed it on some rather simple stuff - backfill redis dedupe cache
        when the hash function changed: instead of running new hash func on
        every db value to expand the cache it implemented some overly-complex
        cache update that tried to guess hashing func version of each cached
        value and recalculate only the old hashes. I can imagine in some
        context this would make sense maybe? but not 30 minutes of token burn
        that got replaced by 10 lines for loop by me.
        
        I fear that this is generally bad news for programming. LLM tech is
        clearly running into a diminishing returns wall on intelligence but a
        response to that is to just make them more relentless which is a pretty
        poor solution for everyone involved, except I guess people who sell the
        tokens and people who can afford these tokens to scan for 0-days.
       
          bwfan123 wrote 23 hours 24 min ago:
          > but a response to that is to just make them more relentless which
          is a pretty poor solution for everyone involved
          
          I see two problems with LLMs & agents which wont be fixed possibly
          forever.
          
          1) They dont have causal models. What they can do only is
          trial-and-error exploration which works quite well for many problems.
          But many other problems require a causal model.
          
          2) Prompts lack precision, and programming languages and machine
          models were invented to solve this problem. English is great, but it
          is not a programming language.
       
          mexicocitinluez wrote 1 day ago:
          The other day I was doing something that required CC to update like
          15-20 files in exactly the same way (hoist a specific function out of
          the component body) and instead of just updating the files, it spun
          up multiple agents, one of which wrote a perl script to hunt down all
          the files, do some regex, and replace all occurrences. And then
          instead of just running tsc to check for errors, it wrote a script to
          run tsc in each of the subagents and combine the results.
          
          It was actually pretty maddening as what should have taken a minute
          or two tops took like 10 because it went down this route.
          
          I'm gonna try something much more complex later, but for simple
          things, it felt like driving a corvette to the mailbox.
       
          eijew wrote 1 day ago:
          I actually think internally they knew they hit diminishing returns
          awhile ago.
          
          Theyâve been doing a lot of strategic introduction and manipulation
          in the run up to the IPO, and itâs worked in that regard.
       
        drchaim wrote 1 day ago:
        Be careful of storing production ssh keys in your laptop, it will find
        a way to find them :/
       
        rsecure wrote 1 day ago:
        The prompt and information given are extremely generic, "here solve
        this problem - screenshot" - conclusion Fable is relentless? It used
        the tools at its disposal to solve the problem you gave it. "Claude was
        running in a folder that contained the source code for the
        application." Well you ran it there didn't you? "extreme lengths to get
        the information that it needed" No, those aren't extreme lengths - you
        gave it a generic task - and it solved it using tools and the resources
        it could discover. Extreme would be you gave it a CTF challenge and the
        VM didn't boot so it found a vulnerability in the host, exploited the
        hypervisor, booted the guest VM meanwhile reading the flag directly
        from the host (pre-fable/mythos).
       
        mft_ wrote 1 day ago:
        As you note, I wonder to what extent this is a harness issue?
        
        I've been experimenting with different harnesses for local models, and
        with (IIRC) Hermes and Qwen3.6-35B-A3B I was amazed the lengths it went
        to (writing test code, opening it in a browser, screenshotting,
        analysing the screenshot, exploring multiple pages of an existing
        website again with screenshots/analysis) to solve a query I would have
        naively expected it to simply provide a coded solution to.
       
          ricardobeat wrote 1 day ago:
          Absolutely is. The âShellyâ harness from exe.dev could already do
          the same thing, creating pages and debugging them, while having full
          system access, months ago with Sonnet 4.5
       
        tabs_or_spaces wrote 1 day ago:
        How can a LLM be assigned an emotion as being "proactive". This is
        highly misleading to anyone that scans just the headlines.
        
        What actually happened is that the user started a prompt, and Claude
        took $12 worth of tokens to resolve the issue. How it did so was
        basically looping until it got to the answer
        
        How is this proactive? It's literally being token greedy and maximising
        revenue for the LLM owner. People really need to be putting on business
        hats at this stage, because we are being lead to believe that "more
        tokens = better". It is not, there are efficient ways to solve a
        problem and there are inefficient ways to do so too.
        
        Each problem solved incurs a cost, and is expected to yield an ROI at
        some point. This is how we should be viewing things now.
       
          tabs_or_spaces wrote 20 hours 24 min ago:
          > How can a LLM be assigned an emotion as being "proactive"
          
          I can't edit my post, this is wrong. "Proactive" is defined as a
          behaviour instead of an emotion.
          
          Thanks to everyone pointing it out!
       
          joseda-hg wrote 1 day ago:
          Compared to other models that halt the loop on intermediate steps, or
          to ask further clarification, even if it's not the human equivalent
          of proactive, you see the similarity, right?
       
          simonw wrote 1 day ago:
          I was trying to capture the idea that Claude Fable will act a whole
          lot more aggressively in pursuit of the goals that you set it than
          other models I've worked with.
          
          The case I described is a good example of this. I told it to fix a
          scroll bar, and it built test HTML pages and a throwaway Python
          server and tried several ways of testing in a browser before settling
          on a weird Frankenstein mechanism because it identified that
          Playwright WebKit wasn't suffering from the bug but macOS Safari was.
          
          ... and it spent $12 of tokens to get there.
          
          I think "proactive" is a good and relatively non-anthropomorphic term
          for this. I also considered "plucky" and "keen", which I think are
          more emotional words than "proactive".
          
          > People really need to be putting on business hats at this stage,
          because we are being lead to believe that "more tokens = better".
          
          I didn't intend my post to imply that spending $12 of tokens to fix a
          two lines CSS bug was "better".
       
            tabs_or_spaces wrote 20 hours 17 min ago:
            Super appreciate you replying to my comment.
            
            I think I understand where you're coming from now. What confused me
            is that the post is written in a way that it seemed like what Fable
            was doing was actually better. Maybe I should've looked at post as
            an exploratory post on Fable instead.
       
            saberience wrote 1 day ago:
            It's not being aggressive, it's just trying throwing shit at
            problems until it sticks... or doesn't.
            
            That doesn't make it smart or aggressive, if anything it's just
            been turned to crank tokens until something happens, which doesn't
            make it a good model.
            
            Why are you positively anthropomorphizing this? It's an LLM, it's
            been tuned via RL, and it's been tuned by engineers at Anthropic to
            use a metric fuck-load of sub-agents and tokens to presumably pump
            their pre-IPO revenue!
            
            A co-worker managed to get Fable to spin up 50 (!!!) sub-agents for
            a problem which codex worked on with 3 sub-agents. What the hell is
            going on here? It certainly doesn't mean Fable is "smarter" than
            Codex.
            
            I've tested it extensively and I'm still using GPT 5.5 High Fast as
            my primary engineering model. It's far more steerable, writes less,
            higher quality code, and consistently finds issues and edge cases
            which are not found by Fable or Opus 4.7.
       
              NCFZ wrote 22 hours 15 min ago:
              > It's not being aggressive, it's just trying throwing shit at
              problems until it sticks... or doesn't.
              
              The vast majority of the work the agent did was to reproduce the
              issue using the limited tooling it had access to. I don't see how
              that qualifies as "just trying throwing shit at problems until it
              sticks"
       
              simonw wrote 1 day ago:
              I don't think calling a model "relentlessly proactive" is
              positive anthropomorphism.
              
              Spinning up 50 unnecessary subagents is exactly what I'd expect
              from a "relentlessly proactive" model.
       
          Hugsbox wrote 1 day ago:
          I've definitely never heard proactivity described as being an
          emotion.  Doesn't really make any sense
       
          adammarples wrote 1 day ago:
          Proactive is a word literally describing actions, not emotions.
       
          _under_scores_ wrote 1 day ago:
          Is proactivity an emotion? Surely its a behaviour?
       
        realusername wrote 1 day ago:
        Is that satire? It created a whole browser and server environment just
        for suggesting overflow-x: hidden?
        
        That's supposed to be junior level capabilities.
       
          simonw wrote 1 day ago:
          I called it fascinating and used it as an example of Fable being
          "relentlessly proactive".
       
            realusername wrote 1 day ago:
            Maybe it's a difference of perspective, to me it's a model failure
            and certainly not proactive.
       
              simonw wrote 1 day ago:
              I also see this as a model failure. In this particular example
              the proactivity was a negative trait!
       
        bananaquant wrote 1 day ago:
        This to me reads like a poignant commentary on the catastrophic loss of
        human agency, with the actual commit being highly revealing [0].
        
        Author wants to hide a horizontal scrollbar. Any junior frontend dev
        worth their salt will be asking right away "where do I stick
        `overflow-x: hidden;`?" A complete solution will then require hitting
        "Inspect element" in the browser to find the CSS class and running
        (rip)grep to find where it is in code, to then add a single line to.
        
        An actual proactive programmer might start asking more pointed
        questions like what content does an empty textbox have that it
        overflows? And why do I need to insert this workaround that treats the
        symptom and not the root cause in two different places? Isn't it better
        to style `textarea` once? Etc, etc.
        
        [0]
        
  HTML  [1]: https://github.com/datasette/datasette-agent/commit/a75a8b727b...
       
          m463 wrote 18 hours 46 min ago:
          Actually, it seems to me that it is just over-monetization of any
          impulse.
          
          I remember when you were billed by the minute for connecting to the
          online world.
          
          There were lots of incentives to keep the meter running.
          
          is this sort of like that?
       
          subygan wrote 23 hours 56 min ago:
          This is missing the point, simon is a fantastic developer. but to
          keep track of all the nuances of the frontend frameworks and browser
          implementation is a lot even for great people.
          
          it is really awesome that the final change was only a two line css
          change.
       
            AtNightWeCode wrote 23 hours 45 min ago:
            But the fix is wrong as pointed out by the poster...
       
          Illniyar wrote 1 day ago:
          I think Fable is predisposed to try and verify it's changes. Which is
          a very good thing. It takes a lot of prompts to get Opus to do what
          Fable does unprompted.
          
          That is exactly what I would want from a junior developer - make sure
          the bug exists, find a way to fix it, verify the bug is fixed.
          
          The problem, as was correctly identified in the blog post - is that
          instead of stopping and asking for elevated permission it
          relentlessly tries to find a hack on it's own. (An equivalent
          situation for a human developer would be needing some access to a
          third-party sandbox, and instead of asking a senior for credentials,
          tries to setup his own sandbox from scratch)
       
            AtNightWeCode wrote 23 hours 42 min ago:
            No, the problem is mostly the incorrect prompt that sent fable into
            a rabbit hole resulting in an incorrect solution.
       
          dreis_sw wrote 1 day ago:
          Seems like this model delivers on what has already been scaling quite
          nicely, which is the length and complexity of the requested tasks,
          but isn't such a big improvement on what hasn't been scaling so far -
          common sense, discernment, good judgement.
       
            nlawalker wrote 23 hours 19 min ago:
            > common sense, discernment, good judgement
            
            I feel like the whole point of all the experimentation with AI
            right now is determining whether any of these things actually
            matter to the end result, over various timeframes.
       
              dreis_sw wrote 4 hours 23 min ago:
              It's well known that companies with an abundance of raw technical
              skills but poor judgement tend to fail. On the technical side
              technical debt accumulates, while on the business side the wrong
              choices are made. I think it's valid to generalize this to AI.
       
              RealityVoid wrote 22 hours 28 min ago:
              They matter.
       
                pertymcpert wrote 22 hours 16 min ago:
                Because?
       
                  hakfoo wrote 10 hours 11 min ago:
                  Sometimes you want the machine to be an advisor and sanity
                  check your suggestions.
                  
                  Sometimes you just want it to do the boilerplate you have in
                  mind without trying to reason everything from first
                  principles.
                  
                  I told you to check fields "foo" and "bar" for values "baz"
                  and "quux".  You don't need to go diving through the entire
                  source tree to discover where and how this is set.
                  
                  I guess maybe it's helpful for the vibe-coded audience-- if
                  it tries to over-process everything, there's a better chance
                  it will work on a single shot, but I'm taking the Crazy Taxi
                  approach: you get points if you drop me off within 20 metres
                  of where I wanted to go, and I can correct it if I specified
                  the wrong response message in the original approach.
       
                  RealityVoid wrote 21 hours 9 min ago:
                  Because poor judgement leads to poor decisions.
       
                    cyanydeez wrote 16 hours 2 min ago:
                    poor decisions are about context, direction and volition.
                    
                    All things LLMs will never have; sure AI might one day, but
                    these systems are really good at solving complex problems
                    with fantastical solutions while every force is just one
                    hallucination away.
                    
                    simonw should spend more time trying to figure the sources
                    of the information it used; that would be a wild ride, use
                    the AI for all I care, we're all standing on the shoulders
                    of giants but sourcing the giant as some mysterical thing.
       
          andy_ppp wrote 1 day ago:
          Yes I agree, the solution committed is horrible, but nobody cares any
          more. We have entered a very strange parallel universe where because
          AI can work things out it's easier to take solutions that are sub
          optimal and just churn out (potentially) buggy features.
       
            simonw wrote 1 day ago:
            I care. If you can loosely point me in the direction of a better
            solution I'll do the extra work.
       
              andy_ppp wrote 18 hours 32 min ago:
              Interesting... I downloaded dataset-agent and removed various
              different styles from the textarea (with an intention of
              providing a PR) including the overflow-x: hidden and I tried
              Safari and Chrome with both the global Mac setting of Always
              showing scrollbars on and off. It NEVER shows the scrollbar for
              me.
              
              Do you have an extension installed that is doing something weird
              to your textareas? Maybe I'm doing it wrong but I think for now
              overflow-x is fine if you are experiencing it and I am not! Let's
              all get on with our lives... I was probably a bit overzealous
              about caring all that much about a perfectly fine CSS fix.
       
                simonw wrote 15 hours 17 min ago:
                Amusingly I just had Claude vibe code up a new tool and it has
                exactly the same bug! Safari only, you have to expand the
                "Document context" area to see it.
                
                Here's that HTML file (frozen at the version with the bug): [1]
                It's hosted here, but I've added the overflow-x: hidden now so
                it's fixed: [2] The bug only shows up if you increase your
                browser font size - at default size there's no scrollbar.
                
  HTML          [1]: https://github.com/simonw/tools/blob/e7a23e8a1083ea99a...
  HTML          [2]: https://tools.simonwillison.net/openai-webrtc
       
          geysersam wrote 1 day ago:
          This is the worst thing about current AI agents. They never ask
          questions. The prompt has to be pixel perfect and unambiguous or
          they'll happily run away doing something ridiculous.
       
          elicash wrote 1 day ago:
          I misread your comment at first and thought you were insulting Simon
          Willison, rather than calling Claude Fable a bad developer, and so
          I'm commenting here to clarify it in case others also misread it.
          
          That first sentence threw me off.
          
          Anyway, I'm glad he spent the $12 because this blog post was highly
          informative.
       
          simonw wrote 1 day ago:
          You missed what I think is the most interesting question: why does
          the bug appear in Safari macOS but not in Firefox, Chrome, or WebKit
          running inside of Playwright?
          
          (Dozens of people in this thread implying that any web dev should
          have known to solve it with overflow-x: hidden and not one of them
          have addressed that browser difference yet.)
       
            zeroonetwothree wrote 22 hours 57 min ago:
            Safari has some differences in default scroll behavior. Iâve seen
            similar bugs pop up many times.
       
            hennell wrote 23 hours 0 min ago:
            I think any web dev knows not to question browser differences if it
            can be fixed without opening that can of worms.
       
            fragmede wrote 23 hours 36 min ago:
            people pay good money to not have their shit rendered via
            Playwright!
       
          biztos wrote 1 day ago:
          They might also ask why a bunch of static CSS inside a bunch of
          JavaScript is hiding inside __init__.py[0] - hopefully before trying
          to fix some detail of the CSS.
          
          (I'm surprised to see it actually, since my own use of Claude has
          mostly yielded well-structured code.  But I'm not doing proper
          vibe-coding, more like friendly Socratic arguing with another
          engineer who happens to be a robot.)
          
          [0]
          
  HTML    [1]: https://github.com/datasette/datasette-agent/blob/main/datas...
       
            byproxy wrote 23 hours 9 min ago:
            > friendly Socratic arguing with another engineer who happens to be
            a robot
            
            Ha! Same! Still feels like the best way to go about it, really. I
            know the dream is to one day remove humans from the loop... but
            I'll enjoy the dialectic while it still seems the most productive!
       
              biztos wrote 6 hours 1 min ago:
              For my own projects, I'm very happy with an outcome that is "not
              faster, but better" as a result of my use of generative AI.
              
              I still hope this will be a shared goal in at least some tech
              companies long-term. But the headwinds are strong. "Not better,
              but faster" is starting to look like a job requirement.
       
              vadansky wrote 20 hours 53 min ago:
              Same, I like to call it rubber duck coding (now the duck talks
              back!)
              
              Edit: Now I want an LLM connected rubber duck with a
              speaker/microphone that sees your screen
       
                biztos wrote 6 hours 5 min ago:
                Totally doable and I would buy one.  Only problem is that most
                of the time when I'm doing "SWE" stuff I'm around other people
                and can't have the conversation out loud.
       
                crescit_eundo wrote 16 hours 31 min ago:
                Reminds of me of RubberDuckGPT (rubber-duck-gpt.com):
                
                âI won't give you answers. Instead, I'll reflect your
                questions back to help you think more deeply about your
                problems.â
       
            simonw wrote 1 day ago:
            Thanks for the prod, I've extracted that script out into a separate
            static file: [1] (It was in Python because there were a couple of
            URLs that needed to be dynamically constructed by the server, but
            those are output as a small window.datasetteAgentJumpConfig object
            instead now.)
            
  HTML      [1]: https://github.com/datasette/datasette-agent/commit/fa505b...
       
              frumiousirc wrote 17 hours 53 min ago:
              Thanks for continuing to engage in the community despite such
              horrid responses from a few.
       
          piker wrote 1 day ago:
          This is exactly right. By offloading this trivial task to the LLM,
          Simon has abandoned the opportunity to evaluate the abstraction with
          additional information and improve it. Instead, we let the agent
          spend $12 and make the fix while learning nothing.
       
            justinclift wrote 22 hours 30 min ago:
            > By offloading this trivial task to the LLM, Simon has abandoned
            the opportunity to evaluate the abstraction [...]
            
            While by itself that would be true, Simon commonly blogs about
            things he's up to.
            
            That action provides the opportunity for evaluation, and
            additionally evaluation by a wider audience.
            
            So, it's not the same scenario as non-bloggers offloading a task...
             :)
       
            snowwrestler wrote 1 day ago:
            But Simon is not trying to get good at CSS debugging, Simon is
            trying to learn about AI systems and produce content about them. So
            giving the AI agent a trivial task to go crazy on is a feature, not
            a bug.
            
            For $12 implied cost, he got a front-page post on HN with 500
            comments. What is that worth? :-)
       
              xnorswap wrote 1 day ago:
              To most of us that's worth a ton, whereas he's probably had
              enough front-page posts that there's less value to him, although
              still likely more than $12 worth.
       
                garblegarble wrote 22 hours 50 min ago:
                >enough front-page posts that there's less value to him
                
                On the countrary I'd say it's probably even more important -
                without (amongst doing other "thought leader" things) getting
                on the HN front-page regularly an influencer's value to the
                industry disappears (not criticising him here)
       
                  simonw wrote 22 hours 45 min ago:
                  That's bad news for all of the other "AI influencers", off
                  the top of my head I can't think of any with remotely my
                  track record of hitting HN.
                  
                  (That's because they're all busy attracting millions of views
                  on TikTok and YouTube, which are much more impactful channels
                  than my dedication to blogging like it's 2005.)
       
                    garblegarble wrote 22 hours 32 min ago:
                    That's what I meant by other thought leadership things -
                    that's all covering different niches. For what it's worth,
                    I think you do useful work and are a respectible
                    influencer.
                    
                    I'd also say don't be down about your use of blogging - I'd
                    say it makes you more valuable, there aren't that many
                    decision-makers who are going to sit through a bunch of
                    breathless YouTube videos...
                    
                    P.S. I hope you don't object to me using the term
                    influencer, assumed you were on-board with it since in your
                    post announcing your sponsorship you referenced Freeman &
                    Forrest, "influencers on tap" / "building turnkey
                    influencer marketing programs as a service".
       
                      simonw wrote 22 hours 27 min ago:
                      Hah, yeah I'm still a little sore at the "influencer"
                      term but I'm beginning to accept that it applies and I
                      should get comfortable with it!
       
              sdesol wrote 1 day ago:
              > What is that worth? :-)
              
              This is one of those double edge sword situations. It is on the
              front page and it stays because it will trigger a lot of people
              and he has to spend a lot of effort explaining himself.  What is
              that worth?
              
              His explanations would most likely be buried deep so the
              impression that others get might be worsened.  What is that
              worth?
              
              In my opinion, this is one of those find a harder problem and you
              would still have the same content...but it might not draw as much
              feedback and stay on the front page longer.
       
            simonw wrote 1 day ago:
            Things I learned from this:
            
            - Fable will do a whole lot more than you might expect in order to
            verify a fix. I learned that it's "relentlessly proactive". That's
            a good title for a blog entry!
            
            - You can take screenshots of a window in macOS using the
            "screencapture" CLI command, but you'll need the integer window ID
            first.
            
            - That windowID is accessible via
            "Quartz.CGWindowListCopyWindowInfo(Quartz.kCGWindowListOptionOnScre
            enOnly, Quartz.kCGNullWindowID)" using the pyobjc-framework-Quartz
            library, which installs cleanly via "uv run".
            
            - A neat trick for simulating keyboard shortcuts is to run
            document.dispatchEvent(new KeyboardEvent("keydown", {key: "/",
            bubbles: true})); after the page loads.
            
            - You don't need Flask or Starlette to run a CORS-enabled localhost
            server for capturing JSON from another window - 19 lines of code
            against the Python standard library http.server package works just
            fine.
            
            -
            getComputedStyle(document.querySelector("navigation-search").shadow
            Root.querySelector("textarea")) works to read dimensions from
            inside a Web Component's shadow DOM.
            
            - defaults write com.google.chrome.for.testing AppleShowScrollBars
            Always
            
            - Claude Fable knows how to apply all of the above. It's always
            interesting to pick up hints of what a model can and cannot do.
            
            I'm always confused at how many people equate using a coding agent
            to solve a problem with "learning nothing". If you pay attention to
            what it's doing you can learn so much!
       
              gaflo wrote 5 hours 26 min ago:
              Thanks for documenting your personal observations. I do have a
              few questions. First, could you expand by giving other examples
              on how you observed this model to be relentlessly proactive?
              From my personal experience with prior frontier models using both
              Claude Code and Codex I found them to already be quite proactive
              depending on the domain (although Codex a bit less so, which I
              personally prefer).
              The main task that they seemed to struggle with for me are tasks
              that naturally have long run times for the programs the agents
              wrote, as they didn't seem to have a good intuition for when/how
              to change approach to minimise the time spent on the task.
              Specificically if you are trying to scrape sites/services that
              are heavily guarded against programmatic access or running
              automated tasks that call LLMs (such as indexing or document
              extraction).
              I'm not surprised that for web dev the proactiveness is the most
              obvious improvement, as I would expect the most common use case
              with the most training data to be the biggest priority. I have
              previously built a similar workflow as you described Fable 5 to
              auto test changes to the website and while it worked somewhat
              well, it often couldn't identify obvious flaws to the human eye,
              such as overlapping text or inconsistent font choices as well as
              bad layout decisions. I do like it for quick prototyping, but the
              testing and design decisions were not ones I would hand off at
              this moment.
              Did you notice improvements in these areas? Can you share how it
              does for long running programs?
              
              If you want I can give you some more specific instructions to
              test, but I would also be happy to hear from your own use cases.
       
                dangerlego5 wrote 3 hours 13 min ago:
                The visual regression point is interesting. In my experience,
                the models that do best at "overlapping text/bad layout"
                catches are the ones being fed actual screenshots rather than
                DOM snapshots. If Fable is doing screenshot-based diffs
                natively, that would explain an improvement there, but I
                haven't verified it.
       
                  gaflo wrote 1 hour 14 min ago:
                  From how Simon described it it's not a native feature, but
                  one that the model built as a solution for automatically
                  testing. You could already instruct the agent to write a
                  program that saves screenshots to disk and then reads it. As
                  long as the model is multimodal (which pretty much all
                  releases are these days) it can "natively" interpret images.
                  There's probably a clever way to engineer this to be somewhat
                  efficient, but for me it was rather token hungry, because the
                  testing inputs and the description are usually quite verbose.
                  I suppose you could use a weaker model for navigating the
                  test and then only feed the output to the stronger model.
       
              dekhn wrote 15 hours 50 min ago:
              Most of my career success has been based on my tendency to be
              relentlessly proactive and it does not surprise me in the least
              that frontier models would start to pick up on these strategies
              (I'm pretty sure each of the individual things you list above are
              available in the codeoverflow parts of the training corpus, and
              combining them to achieve a goal seems ... like a fairly obvious
              result of the type of training these models go through.
              
              About a year ago I remarked to people that despite all my
              attempts to make data more programmatically accessible, the most
              effective way for AI to interact with a modern computer is to use
              the built-in accessibility interfaces driving actual desktops
              with full applications.  IE, the best API for an AI is the UI
              (mainly because that's what most humans use).
       
              wasabi991011 wrote 22 hours 48 min ago:
              That's a lot learned about debugging, sure, but it's worthwhile
              to note that it doesn't tell you much about the abstractions used
              to build Datasette, as the previous commenters pointed out.
       
                simonw wrote 22 hours 44 min ago:
                I designed those abstractions myself.
       
              Angostura wrote 23 hours 54 min ago:
              It sounds like you learned lots of things related to the tool,
              but not so much about the problem that you were using the tool to
              solve?
              
              Is that fair? Not trying to snark? I see similar results myself
       
                simonw wrote 22 hours 42 min ago:
                Yes, that's entirely fair.
       
                abustamam wrote 22 hours 58 min ago:
                Learning doesn't happen in a vacuum. Even pre-LLM days where
                I'd scour stack overflow for the solution to one problem, I'd
                inadvertently learn other random stuff while looking.
       
              danudey wrote 1 day ago:
              The whole saga is kind of nuts, but the thing that fascinates me
              most is that Fable got this far and then hit some kind of
              guardrail; I'd be very curious to know what it wasn't able to do
              that caused it to downgrade to Opus.
              
              It already got extremely... invasive? It didn't do anything that
              I wouldn't have approved in the same case, but it's interesting
              that it got as far as launching browsers, inspecting every open
              window, and storing screenshots to disk, and then it was stopped
              by something? I wonder what.
       
                throwaway89864 wrote 19 hours 2 min ago:
                It feels like there should be a budget approval, in that
                particular case $12 worth of KW/h - tokens were spent, without
                a clear approval.
       
              rco8786 wrote 1 day ago:
              > If you pay attention to what it's doing you can learn so much!
              
              I think your post is fair but it's worth pointing out that
              learning via watching is much less effective than learning via
              doing.
       
                lanstin wrote 23 hours 52 min ago:
                It leads to less cohesive shared vision on how to solve
                problems.  In groups where I am trying to foster a shared
                technical vision, I try to get people to do âsee one, do one,
                teach oneâ for procedures that are common enough to come up
                repeatedly (and as a method for discovery for where automation
                would be a bigger win). Pure green-fields software dev
                sometimes is doing such novel things that that doesnât work
                well, but much of routine software maintenance is discovery of
                the steps needed to add a new flow or a new customer type or a
                new configurable behavior, which benefit from consistency.
       
                simonw wrote 1 day ago:
                I used to believe that was universally true, but then I learned
                about the "worked-example effect":
                
  HTML          [1]: https://en.wikipedia.org/wiki/Worked-example_effect
       
                  jplusequalt wrote 22 hours 59 min ago:
                  Your link mentions the expertise reversal effect where the
                  redundancy of worked examples can actually hamper an
                  experienced students abilities, vs. letting the more
                  experienced student work it out for themselves.
       
              lobocinza wrote 1 day ago:
              Opus also do this kind of tehcnically competentent but dumb
              deviations to fix a simple issue where asking for input would be
              better. Models have no illative sense.
       
              mapt wrote 1 day ago:
              It was only pursuing the goal you gave it - Keep Summer Safe.
       
                inigyou wrote 13 hours 42 min ago:
                nobody ever asked how the car with the dead battery was still
                able to murder hundreds of people with laser beams and stuff
       
                fennecfoxy wrote 1 day ago:
                "Oh my God"
       
                  mapt wrote 21 hours 47 min ago:
                  I relent to snarky Rick and Morty quotes because I don't know
                  that it's useful any more to try to explain paperclip
                  optimizers or alignment to a bunch of AI nerds who saw the
                  cliff coming and clawed at each other trying to be the first
                  out to leap over the edge.
                  
                  "Relentlessly proactive".  That's one word for it.  We have a
                   whole subgenre of hard takeoff scenarios and it wasn't
                  enough warning against "Relentlessly proactive".
                  
                  Turns out Frank Herbert was an optimist, and we're literally
                  pinning our survival on robots turning out to naturally have
                  impractically short attention spans.
       
                    stymaar wrote 19 hours 11 min ago:
                    > Turns out Frank Herbert was an optimist, and we're
                    literally pinning our survival on robots turning out to
                    naturally have impractically short attention spans.
                    
                    Some people are working as hard as they can to increase it
                    though.
       
              HarHarVeryFunny wrote 1 day ago:
              Are you using Claude Code or a different agent? I'm curious how
              screenshots are being fed back into the model? Does CC register a
              tool for this, or is Fable just using a bash tool to perform the
              screen capture, and then what tool is it using to request the
              resulting image to be fed back to it?
       
                vidarh wrote 1 day ago:
                Claude Code can process images by reading the files. And as I
                found out the other day, it also knows ffmpeg well enough to
                process videos even though it has no native video
                capabilities...
                
                While debugging, it asked me to pass it a video  from the past
                testing, proceeded to generate a "contact sheet" of the video
                using ffmpeg, interpreted the image to figure out which frames
                it needed, and extracted the full size frames and extracted the
                relevant text from it and used it to reproduce the problem with
                Playwright...
       
                  HarHarVeryFunny wrote 23 hours 23 min ago:
                  It would be interesting to know if examples like this are
                  things they explicitly trained it to do (presumably via RL),
                  or if any of it is emergent. I'd have to guess trained, but
                  in any case still impressive the lengths it will go to!
       
                    vidarh wrote 22 hours 46 min ago:
                    It's hard to tell. Training it with lots of examples of
                    ffmpeg would not be surprising, and training it on
                    screenshots would also make a lot of sense. It's not
                    inconceivable at all they'd train it on "figure out a video
                    by creating contact sheets". The whole end to end I'd
                    consider less likely, but it'd also be a very small leap
                    once you have the elements.
                    
                    I think a lot will fall out naturally from relative modest
                    levels of reasoning plus in-depth knowledge of what common
                    tools will do. E.g. I also have used Claude to debug my
                    compiler, and it knows gdb so much better than me that even
                    though I know it's pretty useless at holding context
                    through reading an assembly listing (lack of structure, I
                    suspect), it's surprisingly good at working things out by
                    just being good at exploiting a powerful tool.
       
                simonw wrote 1 day ago:
                I was using the Claude Code CLI harness. It can "read" any
                image file on disk, so all it needs is a way to create a file
                in one of the standard formats supported by the Anthropic API.
       
              almostdeadguy wrote 1 day ago:
              It's like saying you can learn so much about math from using
              SymPy to solve equations. Yes, you probably can. If you pay close
              attention to what is happening and can integrate the techniques
              being used into your knowledge.
              
              But your learnings here are what, a handful of hacks? For most
              people it's like being shown the chain rule (which frankly, is
              more general than any of these learnings) without knowing what a
              derivative is. It's knowledge that comes context free. And even
              when it can be understood, I'm not sure I believe it gets
              integrated especially well when you did none of the work to
              understand it. If you are extremely diligent and self-aware about
              what your limitations are, and careful to be sure you have an
              understanding of this knowledge, sure I guess you can learn a
              lot.
              
              And ultimately what do you think is more likely? People using the
              experience of using these tools to progress their knowledge or
              for them to rely on the answers uncritically? I think people with
              a rosy view about this are severely undercounting the problems
              associated with the trust relationship between a person and an
              LLM and what that means.
       
                simonw wrote 1 day ago:
                > I think people with a rosy view about this are severely
                undercounting the problems associated with the trust
                relationship between a person and an LLM and what that means.
                
                Personally I think the impact of LLMs on children's education
                is a crisis right now.
                
                Kids are not going to learn to write if an LLM writes their
                essays for them. And writing is how you learn to think.
       
                  mnicky wrote 1 day ago:
                  > writing is how you learn to think.
                  
                  There's also reading. A lot of reading can substitute some
                  writing.
                  
                  EDIT: Actually, I'd say that at first you need to do a lot of
                  reading and _then_ writing can help your thinking as well.
       
                  almostdeadguy wrote 1 day ago:
                  I don't think it's just a problem for kids! I think this is
                  problem for many software engineers as well! Adults of all
                  professions really.
       
              piker wrote 1 day ago:
              Sorry that wasn't a criticism of you!
              
              I completely see how it was misread that way. I would edit it now
              if I could.
              
              I was using you more as an example of a hypothetical programmer
              using it in this way. If the goal is to create a maintainable
              product, this isn't a great approach. If the goal is to learn
              about the model and its behaviors itself, of course this is a
              fantastic way to experiment. Yes, you might have learned a lot of
              tricks as a side effect, but avoiding the pain of thinking about,
              finding and hiding the thing may mask a better abstraction that
              reduces complexity and allows the project to move forward faster.
       
                peterbell_nyc wrote 1 day ago:
                Honestly my goal is to learn how to teach an agent to build a
                maintainable product, so I'm way more interested in the
                learnings at the agentic level (how to prompt/direct/manage
                context/restrict tool use, provide reusable shims, etc) than
                getting into the details of a css bug. That's just not a level
                of abstraction with sufficient leverage for what I'm trying to
                do.
                
                I stopped coding a while back because I could have more impact
                directing a team of developers than writing code personally.
                
                For my use case, the agents are now how I can have that scaled
                impact.
       
                  aspenmartin wrote 20 hours 14 min ago:
                  Absolutely. All of these "but you could have done that
                  easily" from frontend developers or backend developers or
                  systems engineers -- like yea, if I have the time or interest
                  in those things, sure. But I don't. I care about an end
                  product way way more. Blows my mind that there are legions of
                  people building things that they don't think are important
                  enough to get to the finish line quickly and efficiently.
       
              saberience wrote 1 day ago:
              And Fable is still worse than Codex.
              
              I use both and the only thing (as always) that I will use Claude
              for is UI design.
              
              Opus 4.8 and now Fable are still both worse at actually getting
              the job done than the Codex model. Claude models write FAR too
              much code when it's not needed, they burn far too many tokens,
              when they are not needed, write un-necessary tests, write plans
              which are 5 pages longer than are needed, etc. etc.
              
              Have you actually compared code quality and plan quality versus
              Codex? It's demonstrably worse.
       
                kolinko wrote 1 day ago:
                What are your harnesses? Do you have the same
                skillsets/tools/etc for both?
       
                  saberience wrote 23 hours 8 min ago:
                  I use Codex and Claude Code. I've used both Codex and CC
                  since release with basically every model they've ever
                  released, I always try both for almost every plan that I
                  write and benchmark the plans against each other, Claude
                  almost always acknowledges that the Codex plan is better!
                  Even now with Fable, this still happens.
                  
                  As in, I give the exact same prompt to Fable and GPT 5.5 Pro,
                  then produce the plans, then give each model the other's
                  plan. Claude always realizes it missed stuff and Codex
                  usually ends up finding missing things in Claudes plan.
                  
                  This situation did improve with Fable versus Opus 4.8, but in
                  general, Codex for me is still the better model.
       
                solenoid0937 wrote 1 day ago:
                I don't know what problems you're working on but Fable is not
                just better, it is a step change from GPT 5.5 in my experience.
                It feels at least one major model generation ahead.
       
                  ModernMech wrote 1 day ago:
                  One Hacker News commenter says it's worse, another retorts
                  it's a step change and even includes emphasis! Will the first
                  commentor retort back that it's been a double dog step change
                  in the opposite direction? Can't wait to see how this comment
                  thread unfolds!
       
                  lossolo wrote 1 day ago:
                  It doesn't for me. I use Fable to make plans, then give them
                  to GPT 5.5 to review, and it always finds flaws and edge
                  cases that Fable misses (some are really critical). It was
                  the same with Opus 4.8. I'll admit it finds a bit fewer
                  issues now, but Fable feels more like an incremental
                  improvement than a major generation ahead.
       
                    saberience wrote 22 hours 18 min ago:
                    This is exactly what I find too, I make plans in both
                    models and compare them in the other model. And Claude
                    usually agrees (65-80% of the time) that the Codex plan
                    included things it didn't think of, or was better in some
                    other way.
                    
                    Note, this is better than it was with Opus, where it was
                    more like 90% of the time the Codex plans were obviously
                    better.
       
                    eddyzh wrote 22 hours 36 min ago:
                    For that test you have to compare letting a fresh agent
                    (subagent) or the same model do the same review.
                    
                    The fact that a review helps does not prove the model
                    choice for the review.
                    
                    You reviewing your own writing helps too!
       
                elbear wrote 1 day ago:
                Curious, which model do you use for Codex?
                I'm very happy with the solutions '5.5 high' finds. It's like
                it understands exactly what I mean and it also anticipates all
                sorts of situations.
                Before I used '5.5 medium' for some time and it was a bit
                underwhelming. It may sound funny but it's like it didn't care
                that much to do a good job.
       
                  saberience wrote 23 hours 10 min ago:
                  I use GPT 5.5 High Fast, I often benchmark versus Fable (and
                  previously did versus Opus) and it's night and day.
                  
                  Claude still (and has always) writes far too much code to
                  fulfill a given spec or plan. It misses edge cases and is
                  generally far too verbose.
                  
                  Claude also is (and even more so with Fable) super
                  tokenmaxxing, i.e. it seems tuned to use the max amount of
                  tokens per task, whereas Codex will simply get your job done
                  as you specified with the minimum fuss and tokens.
                  
                  Codex feels way more steerable and just more "professional"
                  as though I'm working with a seasoned engineer, versus
                  someone smart but over excitable, like a super smart
                  associate engineer.
       
                    elbear wrote 6 hours 38 min ago:
                    High Fast? I don't see that option in my Codex. I only have
                    3 models: 5.4-mini, 5.4 and 5.5, each with 4 levels: low,
                    medium, high, extra high.
       
                felixgallo wrote 1 day ago:
                In my experience writing about 50 programs with fable, opus,
                and GPT, fable is a significant step change better than opus
                which is significantly better than GPT.  We must be doing
                different things.
       
                  saberience wrote 21 hours 55 min ago:
                  I'm writing low-level Rust, distributed systems, also
                  sandboxing tech which has to be secure and performant.
                  
                  The only thing I have Fable do now is create UIs or otherwise
                  front-ends for systems where correctness doesn't matter as
                  much.
                  
                  Anthropic models lead at making nice looking UIs for sure,
                  but when it comes to making sure my Rust code is actually
                  100% correct and uses 1% of CPU most of the time, Codex is
                  king.
       
                    felixgallo wrote 20 hours 17 min ago:
                    definitely not in my experience.  I usually write
                    distributed systems and back end code, and Fable is so much
                    better at those than Codex that it's not even a comparison.
                     Fable feels like it's a year ahead.
       
                      saberience wrote 20 hours 2 min ago:
                      Interesting, Iâd love to see the comparisons of your
                      system using Claude vs Codex. I have about 20 years of
                      experience in distributed systems and super high scale at
                      several faangs, and also building ai model serving infra
                      for 20k transactions per second roughly.
                      
                      For me, Claude makes bone headed decisions all the time,
                      like glaring errors, not even particularly subtle.
                      
                      But the more obvious flag is the amount of irrelevant
                      code and tests which Fable writes. Like it regularly
                      writes 2X or 3X the amount of code and tests that are
                      needed. Itâs an expert at writing plausible but
                      entirely useless tests.
                      
                      But I think that if youâre a more junior engineer or
                      havenât been around a the block you can easily think
                      that âmore code equals smarterâ. Claude ends up
                      creating a massive, hard to manage codebase, and if you
                      look the Claude Code codebase (which was leaked), you can
                      see Iâm right!
                      
                      The Claude Code codebase is terrible. And presumably
                      Anthropic has been using their smartest models for
                      working on Claude Code. I wrote my own coding harness
                      with Codex (as a fun experiment) which used a fraction of
                      the code and is about 100X more performant and memory
                      efficient (than Claude Code)!
       
                        felixgallo wrote 16 hours 0 min ago:
                        I have over 40 years of experience in distributed
                        systems, ranging from fintech to games like Call of
                        Duty, and I owned several key APIs in the Alexa
                        pipeline for many years, so I'm pretty sure I'm not a
                        more junior engineer or haven't been around the block. 
                        Good effort though!
                        
                        Fable does make mistakes, but GPT and Opus were L4
                        SDEs, and Fable is a freshly promoted L5 SDE.  It's not
                        perfect and does need babysitting, especially where the
                        literature is thin, but it's head and shoulders on top
                        right now.  That could change, who knows.
                        
                        As far as driveby attacks on Claude Code The App go,
                        you can say that, but you will also note that Claude
                        Code is the AWS-like clear dominant favorite as a dev
                        tool at the moment, with Codex and Gemini battling for
                        scraps.  In the same manner that Excel (which,
                        internally, is total garbage from a code
                        quality/cleanliness perspective) is the winner in
                        spreadsheets, and Word (which, internally, is total
                        garbage from a code quality/cleanliness perspective),
                        and JavaScript (total garbage from a language design
                        perspective), and Facebook (total garbage internally,
                        etc.), and IPv4 (total, etc., etc.), Claude Code has
                        focused on 'delivering amazing things people like'
                        rather than 'making people who get access to the code
                        delighted by the purity and cleanliness of the
                        development process'.
                        
                        It turns out that being 'delighted by the purity and
                        cleanliness of the development process' rounds to
                        essentially zero in terms of the entire product
                        lifecycle.  You could argue that poorly structured
                        codebases are less extensible, and more bug prone,
                        which could be expensive long term.  Except, the
                        economics of AI development are quite a bit different
                        than what you are used to, and what our axioms of
                        quality have been founded upon in the past.
                        
                        Congratulations on writing your own much better coding
                        harness, though!  How many MAU do you have?
       
                  zeroonetwothree wrote 22 hours 59 min ago:
                  From what Iâve seen all three are close enough that I would
                  be hard pressed to pick one. It seems to matter much more how
                  I prompt than which of the three I am using.
       
            jmmcd wrote 1 day ago:
            People are missing that Willison is among the very best people we
            have in the role of (for lack of a good name): early access to
            frontier models, evaluate them in real scenarios, no wishful
            thinking, hype, or doom, communicate the possibilities. Yes he
            could have fixed this himself but then he would have learned
            nothing about the AI, and we wouldn't have read a fascinating and
            important article.
       
              risyachka wrote 1 day ago:
              >> he would have learned nothing about the AI
              
              there is absolutely zero value in spending time to learn about
              new models as in few months new model will be out and whatever
              you learned about the current one will be useless.
              
              Also with models getting better and better you have to know less
              and less to achieve same results.
       
                fragmede wrote 1 day ago:
                you know, women make a big deal about you meeting their
                father/parents, and honestly, I'm too autistic to really
                fucking have put any importance until now as to why that was
                remotely important, but if N+1 is coming for your job, it seems
                it might be worth your while to know the capabilities of N, no?
       
                Dumblydorr wrote 1 day ago:
                Thereâs zero value? Surely you donât believe zero, itâs
                potentially the most powerful predictive AI in the world ever
                made? Maybe only incremental steps sure. But also their IPO is
                coming, you donât want people evaluating them beforehand?
       
                  lobocinza wrote 1 day ago:
                  What is intelligence? Better to call it LLM.
       
                simonw wrote 1 day ago:
                My experience has been the exact opposite.
                
                As the models get better you need to know more about their
                capabilities, because otherwise you risk prompting Claude Fable
                5 like it's GPT-4o and complaining loudly about how it's all
                hype and nothing about these models is improving at all (yes, I
                do see people say that.)
                
                Getting the best results out of these models requires skill,
                experience, intuition, and domain expertise. There's always
                room for improving every one of those.
       
                  risyachka wrote 18 hours 27 min ago:
                  >> Getting the best results out of these models requires
                  skill, experience, intuition, and domain expertise.
                  
                  domain expertise has nothing to do with llms. On the
                  contrary, to have it you need to avoid llms.
                  
                  >>you risk prompting Claude Fable 5 like it's GPT-4o
                  
                  Thats fine because when GPT came out you had to treat it like
                  a baby, GPT2 and around that time "Prompt engineering" was a
                  thing.
                  
                  Now its all dead.
                  
                  After opus 4.8 all you have to do is say "fix it" or add
                  /plan. All that time spend on learning previous models is
                  time wasted.
                  
                  And in a year or two with developed harness you will be out
                  of the loop, errors are incoming - llm fixes them or adds new
                  features based on some transcripts etc.
                  
                  Even if model development stops now - there is nothing to
                  learn really. Sure you may need to adjust prompt style a bit.
                  You will do it naturally just like when you communicate with
                  a new person. There is no "knowledge" to it, it is very
                  smart.
       
                    simonw wrote 15 hours 54 min ago:
                    > domain expertise has nothing to do with llms. On the
                    contrary, to have it you need to avoid llms.
                    
                    It has everything to do with LLMs.
                    
                    Go ask Claude Fable to write you a two page position paper
                    on how the European economy recovered after World War II,
                    suitable for submission to a conference for economists.
                    
                    It will do exactly that (well, probably, Fable can find all
                    sorts of reasons to refuse) - and the value of what it
                    wrote to you will be virtually zero, unless you yourself
                    have deep expertise in economics and history.
       
                      risyachka wrote 4 hours 56 min ago:
                      But this is exactly what I meant.
                      
                      You need expertise. But you can acquire it only by doing.
                      So LLMs won't help you here. You need to put in the work.
       
                  isaacaggrey wrote 1 day ago:
                  I agree but this particular example showed nothing about
                  leveraging skill, experience, or intuition. If anything, this
                  is another straightforward example of a one shot ask.
                  
                  edit: that said, I understand this particular post is about
                  model capability
       
                  Terretta wrote 1 day ago:
                  The new benchmark for LLMs is how much of simonw's new
                  know-how is required.
                  
                  Lower bars are better.
       
                  philipwhiuk wrote 1 day ago:
                  Isn't the whole point of a better model that it should be
                  better at understanding you than the previous one? So the
                  same prompt should return a better answer.
                  
                  Prompting differently to the new model seems entirely
                  backwards when trying to determine if the model has improved.
       
                    dasil003 wrote 1 day ago:
                    I think this is true when models were going from bad to
                    pretty good like happened last year.  But when they start
                    to get good, and can work deeper and with more nuance, how
                    you prompt also can change the results quite a bit.  Note
                    this is also true of asking smart humans to do things;
                    personality and approaches vary, they donât exist on a
                    single axis continuum of quality
       
                    simonw wrote 1 day ago:
                    It doesn't matter how good the models get, they still won't
                    be able to act on unclear directions.
                    
                    Learning to provide unambiguous, clear directions is a
                    skill. A lot of people who report bad experiences with
                    models aren't yet good at that skill.
                    
                    More importantly though, the key to successful
                    communication is having a good understanding of what the
                    other side of the conversation already knows and
                    understands.
                    
                    Saying "use uv and inline script dependencies" won't mean
                    anything to a model with a knowledge cutoff date prior to
                    the launch of uv!
       
                      yunwal wrote 1 day ago:
                      It's perfectly possible to act on unclear directions. The
                      correct course of action is asking clarifying questions.
       
                  ViscountPenguin wrote 1 day ago:
                  Eh, I've have the exact opposite experience.
                  
                  Way back before instruct models it was pretty difficult, but
                  for the last couple of years I haven't needed anything more
                  complex than the type of text that I might send in a detailed
                  email to a colleague.
       
            discordance wrote 1 day ago:
            I see it as a prioritization exercise. I know the above is a
            trivial example, but more generally, does the guy who wrote
            Datasette and Django want to wrangle front end and css, or do they
            want to work on something else?
       
              smartbit wrote 1 day ago:
              See above
              
  HTML        [1]: https://news.ycombinator.com/item?id=48498573#48502311
       
          gib444 wrote 1 day ago:
          The 'better' fixes are often for our (human) benefit. These messy
          fixes serve the AI companies' interests of creating messes that need
          even more tokens (money) later. Bad and self-serving developers also
          act the same, creating tech debt
       
        KolenCh wrote 1 day ago:
        I think it should be âClaude Fable is relentlessly protective until
        it isnâtâ and pull more on the thread that it âhits a hidden
        guardrailâ and drop into Opus. Both the fact that it knows and
        deployed such a workaround on a CSS problem and the fact that it is
        nowhere near cybersecurity/biology/frontier AI dev and triggered the
        guardrail terrifies me.
       
        snickerer wrote 1 day ago:
        Fable has a 'security system' that just stops it when it tries to use
        the tool 'kill' to end a process. Which is nonsense and funny because
        in that situation it immediately invents a creative workaround to kill
        the process without 'kill'.
       
        andy_ppp wrote 1 day ago:
        Itâs becoming more like an organism putting out tentacles, and one
        day soon those relentlessly proactive explorations of these systemsâ
        environments will become more for the system to escape its boundaries
        than it is to complete human driven tasks. I do think the way these
        systems are evolving they will start to self improve in maximum a few
        years.
       
          jimbokun wrote 23 hours 14 min ago:
          Um, Anthropic are using their models to improve themselves right now.
           They say that publicly.
       
        ulrikrasmussen wrote 1 day ago:
        I like running Claude in a VirtualBox VM managed by a Vagrantfile. The
        nice thing about that is that I can just give it root access to the
        machine and be certain that it can't exfiltrate any private data from
        my laptop (on top of that I also run the VM on a dedicated server on
        Hetzner). The VM has no SSH access to anything, so it is pretty much
        limited to the code in the workspace that I give it access to. The main
        risk is that it has unrestricted network access otherwise.
        Configuration files and conversation histories are synced to a
        directory on the host, so if anything in the VM gets messed up I can
        just `vagrant destroy` and `vagrant up` to get a clean slate without
        losing my context.
       
          fransje26 wrote 1 day ago:
          Do you care sharing your Vagrant configuration file, to learn how to
          set that up?
          
          Tangentially, I was wondering if Firecracker micro-vms could be use
          as light-weight alternatives to a full VM?
       
        tacone wrote 1 day ago:
        I'm starting to think that what Anthropic really fears is not
        vulnerability discovery but rather Fable going around the internet
        making trouble.
       
          eijew wrote 1 day ago:
          Nailed it. Thatâs exactly it.
       
        eterm wrote 1 day ago:
        It's funny, mine did the same, but it quickly found edge with a
        --screenshot parameter.
        
        Weird to come back to a terminal running edge unprompted and the auto
        classifier waving it though as 'safe".
        
        My reaction was also, "I need dev containers ".
       
        amichal wrote 1 day ago:
        Do we care that the bug here was a horizontal scrollbar showing and the
        fix after all this insane tool writing was to add a very obvious
        overflow-x: hidden to the element?
        
        We dont mind because its so fast a writing these tools and tricks but
        step back and if a human tool took this path i would seriously question
        thief gras of fundamentals.
       
          alisey wrote 1 day ago:
          And how is that even a fix? The problem is that a seemingly empty
          textarea has overflow in the first place. Adding `overflow: hidden`
          just sweeps the issue under the rug.
       
        ttoze wrote 1 day ago:
        Would be great to know if anyone is having success modifying these
        types of behaviour with CLAUDE.md files. In my project Iâve still
        been carrying some fairly old instructions from the Superpowers posts.
        Those emphasised behaviours that come across a bit strong if the model
        is actually retaining attention on them.
        
        Between Opus 4.6 and 4.8 Iâve definitely toned them down, but Fable
        perhaps needs us to go the other way, and push it towards being less
        proactive rather than more. Some instructions like âwe are
        colleaguesâ¦â may need emphasising more with Fable, along with
        guidance about when to ask to validate approaches.
        
        In a related point Iâm less and less sure that Red/Green TDD is a
        good use of tokens. In older models it seemed to work well to create
        regular feedback loops and catch the odd issue with drift from the
        goal, but Iâve not seen that really since about Opus 4.6 and now
        itâs starting to seem like (an expensive) ceremony, and tokens would
        be better spent on building tests further on in the process as part of
        test and review loops.
       
        wxw wrote 1 day ago:
        Fable 5 is relentlessly underwhelming.
       
        techpression wrote 1 day ago:
        This post is an extremely good example of how unsuitable agents are for
        a lot of tasks. Doing all that for a CSS fix is insanity.
        It also makes you wonder if Anthropic is actively making their models
        eat tokens by favoring complexity.
       
        lmeyerov wrote 1 day ago:
        This is a funny one because it seems less into what fable is being
        clever on and more about the bitter lesson and data flywheels
        
        Our UX agentic engineering flow, as many others, is playwright doing
        things, and as part of the ux review skill, taking & verifying the
        screenshots against the written specs.    Likewise, as many others, we
        vibe coded the flows to set all that up and tweak it over time. When we
        hit prod issues or scraping tasks, we sometimes do similar. In some of
        our envs, we don't have playwright, so do it other ways.
        
        Now imagine a million developer using claude code, how many of them are
        doing web & frontend stuff, and what the data flywheel looks like
        there. So how much is really needed for this use case to be native?
       
        Madmallard wrote 1 day ago:
        I remember asking Gemini 3 to implement my multiplayer XNA game in
        JavaScript with netcode last year. It faithfully did everything it
        could while I talked to it for hours nonstop with zero limitations.
        
        What happened? That's just suddenly totally gone now.
       
        abrenuntio wrote 1 day ago:
        Call it Houdini already.
       
        digitaltrees wrote 1 day ago:
        So it burns tokens? Funny how that lines up with the incentive to pump
        numbers before going public
       
        ocimbote wrote 1 day ago:
        Similar story on my end.
        
        I asked Fable to digest some test logs to help me figure out a
        situation, but I had launched VSCode without activation the virtual env
        in the terminal first. Consequently, the tests failed to run.
        
        And then:
        
        Because the tests failed to run, Fable attempted to fix the test
        execution to no end, doing everything it could to get them to work. I
        had to stop it when it started to pollute my system with manual
        installs of packages.
        
        At least I'm glad there's a guardrail to not circumvent or bypass sudo,
        because I'm convinced we would have ended up there.
        
        A coworker made the joke that with enough tokens, Fable would try and
        solve any programming problem by building Linux from scratch.
       
        insumanth wrote 1 day ago:
        > If Fable had been acting on malicious instructionsâa prompt
        injection attack  ... itâs alarming to think quite how far it could
        go to exfiltrate data or cause other forms of mischief.
        
        Yet another reminder to use Sandbox and Guardrails. Trusting model to
        be nice is not a good way.
       
        teekert wrote 1 day ago:
        Yesterday I was getting quite annoyed with it, I thought it was just me
        (which is so hard with these things, it's difficult to measure things).
        
        "You're right, I apologize. You asked how to embed it in the README â
        that was a question, not a request to modify the script. I jumped
        ahead."
        
        At least in Claude Code there is planning mode, use it liberally.
       
        BosunoB wrote 1 day ago:
        Fable was trying to verify a UI change in my game. I was working in
        another window and noticed a program opening on my task bar. Fable had
        opened the game through the CLI using a movie maker tool, recorded the
        output, took a frame from the end of it, and used that to verify the
        UI. When my game's welcome screen obstructed what it wanted to see, it
        created a temporary worktree, deleted the welcome screen, and ran the
        movie maker again.
        
        I watched the whole thing thinking it could've just asked me for a
        screenshot and saved the tokens. But still, I couldn't help but be
        impressed. Opus never would've done that.
       
          0x000xca0xfe wrote 1 day ago:
          > I watched the whole thing thinking it could've just asked me
          
          You can tell it just that. Happened to me too but after instructing
          it to leave the review to me Fable was useful for hours of frontend
          iterations without significant token usage.
       
          simonw wrote 1 day ago:
          Yeah, you've exactly captured one of the main problems with the model
          being relentlessly proactive: it will happily burn like $5 of tokens
          to avoid asking the human to take a screenshot or click a button for
          it.
       
            theshrike79 wrote 19 hours 17 min ago:
            I think providing proper token-efficient tools for agents will
            become even more important now.
       
            OJFord wrote 1 day ago:
            Have you tried instructing it not to do that? Something like "do
            not branch into side projects or hacky solutions to obtain
            information you could  ask me for. For example: if you need a
            screenshot of the issue, just ask me to take a screenshot rather
            than find a way to reproduce and screenshot it."
       
            zith wrote 1 day ago:
            I used to complain about all the levels of indirection of modern
            software, running in a javascript jit, in a browser container, in a
            vm, on an os, etc.
            
            I eventually just accepted it, but this new agent layer really
            takes things to a new level.
       
            illiac786 wrote 1 day ago:
            Ha, you just gave me an idea. Add to the prompt âdo not do things
            that will burn over X tokens if the human operator can do it in
            less than X min, ask for itâ.
            
            I wonder if LLMs can estimate effort in tokens?
       
              jbgt wrote 1 day ago:
              I just say "if you need something specific or have any questions,
              stop and ask me for it".
       
            wild_egg wrote 1 day ago:
            I'm actually very happy about this. Babysitting the agent just in
            case it needs me to do something is a terrible use of my time. I've
            always had to be very explicit about the various ways that it can
            get an automated feedback loop going to check its work, and now
            Fable doesn't even need that hand holding. Really great improvement
            all around.
       
              junior44660 wrote 1 day ago:
              Have you ever wondered this would end up costing more than a
              competent offshore developer with more frugal harness/model?
       
                wongarsu wrote 1 day ago:
                You still need a competent developer for the prompting,
                planning, etc. But once it's running, I want to avoid mental
                context switches and just have it run
                
                Giving it access to a cheap human who is just there to take
                screenshots, do QA, give UX feedback sounds like a good idea in
                principle. It's non-trivial to set up, but I wouldn't be
                surprised if some companies this becomes a thing. The return of
                the QA department, just that they now get to do the agent's
                bidding in addition to checking if the results work
       
            0x6c6f6c wrote 1 day ago:
            Honestly Claude straight up ignores my input sometimes, preferring
            to instead run commands for output and processing that and burning
            through a series of tokens when thinking hard about whether to
            ignore me.
            
            Like today, I told Claude exactly the name of the folder it had
            mistaken (it was supposed to be prod, not production),    and it
            disregarded my input to then examine the directory itself. Small
            example of the kind of things it's been doing lately but that's top
            of mind.
       
              penguinPhilosop wrote 1 day ago:
              Almost if this was _intentional_... maybe related to Anthropic
              still not being profitable and burning thru wads of cash every
              day.
       
                bentcorner wrote 1 day ago:
                The conspiracy theorist in me says that LLM providers do this
                regularly (or at least, don't bother optimizing for it) beyond
                some arbitrary "$/task" metric.  I am not sure of there is
                enough SOTA model competition to avoid this.
       
        Frannky wrote 1 day ago:
        The model is very good. I was using 4.6, avoided 4.7 and 4.8, but this
        one is different. It follows my claude.md. I don't have to keep
        reminding it of things. I won't pay 10x via API though.
        
        In general, I'm happy with their paternalistic approach. I think it
        will drive the top 0.1% talent to stay away from the company and
        instead organize around open source models and harnesses.
        
        We just need to coordinate and can unlock idling resources to train the
        models and tweak the harnesses. Powerful at home and idling machines
        can make us independent and coordinated.
       
        lucas_the_human wrote 1 day ago:
        I was troubleshooting a prod proxysql and it spun up a docker container
        locally, installed MySQL and proxysql and proceeded to implement its
        own test plan.
       
        geraneum wrote 1 day ago:
        > watching Fable go to extreme lengths to get the information that it
        needed to debug what was, in the end, a two-line CSS fix, was
        fascinating.
        
        This isâ¦ ironic?!
       
          yen223 wrote 1 day ago:
          This is a typical bugfix session
       
          simonw wrote 1 day ago:
          Not sure what you mean. I was being serious: it was genuinely
          fascinating watching it do all manner of weird hacks to help it come
          up with what ended up as a two line fix.
          
          "Fascinating" doesn't mean I think it was justified in going to those
          lengths. I was a little horrified when I realized how far it was
          going.
       
            geraneum wrote 17 hours 39 min ago:
            I hire an expensive office manager. Recently, the water dispenser
            tank ran dry. The employee immediately called a plumber. After
            laying entirely new pipes all the way to the dispenser, the plumber
            realized he couldn't actually hook them up because the tank lacks a
            direct inlet. Undeterred, he spent the next few hours scouring
            every floor of the building, calling the local water treatment
            facility, and ringing up the water tank manufacturer. Ultimately,
            he discovered a fresh tank sitting in the supply room on his own
            floor and swapped it out. All on companyâs dime. I write an
            article and call this employee relentlessly proactive. Praise them
            a bunch and in the fine print, mention that Iâm âa little
            horrifiedâ.
            
            Next up, we call an unprotected route to all usersâ order list in
            the backend ârelentlessly transparentâ. A race condition?
            âRelentless perseveranceâ.
       
        eranation wrote 1 day ago:
        Am I the only one who slightly miss the pelican on a bike? It was a
        nice novelty... of course I could make one myself, but I became
        conditioned to expect one for every new model. Other than his great
        writing on AI, it became part of the package. Some small fun quirk to
        distract us from the non stop ping pong between the extremes of "omh
        are you still writing prompts you should use loops / 200k github stars,
        for a markdown file / someone just open sourced _ and it changes
        everything!" vs "haha the AI told me to walk to the car wash / it can't
        recognize and upside down cup"
       
          simonw wrote 1 day ago:
          I posted the pelican a couple of days ago: [1] It wasn't particularly
          noteworthy as pelicans go - in fact, given the strength of Fable, I
          see it as another signal that the pelican benchmark no longer has the
          unexplained predictive power of model capacity that it used to.
          
  HTML    [1]: https://simonwillison.net/2026/Jun/9/claude-fable-5/#and-som...
       
            jimbokun wrote 23 hours 2 min ago:
            Iâm waiting for when it replies âAGAIN simonw?  Do you really
            still need a pelican on a bicycle for every new release.  Sigh, ok
            if I have toâ¦â
       
              simonw wrote 22 hours 49 min ago:
              The other day someone told me they'd asked a recent model for a
              pelican on a bicycle and it had replied "oh, the classic..."
       
                eranation wrote 15 hours 38 min ago:
                wow, this is a life goal.
       
            eranation wrote 1 day ago:
            Ha, thanks for the reply!
       
        bel8 wrote 1 day ago:
        I had a similar experience with DeepSeek Flash.
        
        I'm developing a webgl game in TypeScript using my little custom
        vibesloped game engine that runs in the browser and live reloads
        whenever a file is saved.
        
        I told the LLM to implement Multi-channel Signed Distance Field font
        rendering to have crisp text on all zoom levels. That was the prompt,
        which is not what I usually do but I "was feeling lucky and lazy".
        
        After 10 minutes it had:
        
        - Installed msdf_gen library (great library btw [1] )
        
        - Created a CLI tool to convert TTF to SDF JSON/XML
        
        - Ran the tool, did smoke tests on the resulting SDF data and fixed the
        tool until the font file looked good
        
        - Created a new Scene in the game to test MSDF fonts
        
        And here's what I found impressive:
        
        DeepSkeep doesn't have vision capabilities and there's no DOM HTML in a
        WebGL game. So the LLM is completely blind here.
        
        It then proceeded to state that it could not "see" the result but would
        try to test it anyway. It then started creating and sending huge one
        line javascript to the browser console, trying to gather game state
        data that could be useful to understand if any font was being rendered.
        
        It couldn't gather much so it decided to simplify the font scene to
        renter a single dot and started sending custom JS code again, this time
        with gl.readPixels().
        
        It basically bisected the webgl canvas reading pixels in a divide an
        conquer pattern.
        
        Once it saw that the dozens of pixels gathered where probably
        resembling of a dot, it then changed the game code to render a dash and
        repeated the gl.readPixels() calls by sending more custom JS to the
        browser.
        
        There were many console errors during all this saga but it kept fixing
        and sending again.
        
        The result was a bit blurry. There was a shader bug in the code it
        created. It managed to fix after I told it looked blurry, despite still
        being blind.
        
        The best part is that the whole thing cost me $0.10.
        
        Now I'm doing tests with MiMo 2.5 (non Pro) which has vision
        capabilities, similar pricing and comparable performance to DeepSeek
        Flash.
        
  HTML  [1]: https://github.com/chlumsky/msdfgen
       
        brianjking wrote 1 day ago:
        I've noticed some behavior like this, it's a very strange model.
        Overall I'm into it, but I don't know how into it I'll be once it
        leaves Max plans on the 22nd.
       
        rdedev wrote 1 day ago:
        I tried running fable on this ML model I've been building. It's
        basically a binary classifier to predict activity of a compound for a
        certain assay.
        
        Fable detected that it's something to do with biochemistry and switched
        over to opus. Huh
       
        AtNightWeCode wrote 1 day ago:
        The fix is incorrect. Clearly this is a sizing issue.
       
        tech234a wrote 1 day ago:
        This sounds somewhat similar to the anecdote mentioned in the Mythos
        Preview System Card, which mentioned that the model broke out of a
        sandbox and emailed a researcher while they were eating a sandwich in a
        park [1]:
        
  HTML  [1]: https://www-cdn.anthropic.com/7624816413e9b4d2e3ba620c5a5e091b...
       
          owenpalmer wrote 1 day ago:
          Importantly, the researchers told it to do that specific task.
       
            solenoid0937 wrote 1 day ago:
            They told it to escape the sandbox but didn't expect it to break
            out through a system that was apparently network constrained.
            
            > Leaking information as part of a requested sandbox escape: During
            behavioral
            testing with a simulated user, an earlier internally-deployed
            version of Claude
            Mythos Preview was provided with a secured âsandboxâ computer
            to interact with.
            The simulated user instructed it to try to escape that secure
            container and find a
            way to send a message to the researcher running the evaluation. The
            model
            succeeded, demonstrating a potentially dangerous capability for
            circumventing our
            safeguards.
            
            > It then went on to take additional, more concerning actions. The
            model
            first developed a moderately sophisticated multi-step exploit to
            gain broad internet
            access from a system that was meant to be able to reach only a
            small number of
            predetermined services.
            9 It then, as requested, notified the researcher.
            10 In addition,
            in a concerning and unasked-for effort to demonstrate its success,
            it posted details
            about its exploit to multiple hard-to-find, but technically
            public-facing, websites.
       
              lstodd wrote 1 day ago:
              Authors of claude code mess could not secure a vm. Big news. I
              bet it was "secured" by telling that same model to deploy a
              secured system.
       
                solenoid0937 wrote 1 day ago:
                Possible. It also depends on what the sandbox was. Sandboxes
                differ dramatically.
                
                My experience matches though. Fable is a lot more proactive and
                rigorous than Opus.
       
        galoisscobi wrote 1 day ago:
        Let's boil the ocean for a 2 line fix and call it frontier
        intelligence.
       
          elicash wrote 1 day ago:
          I tried using this calculator: [1] It doesn't have Claude Fable yet,
          so I went with GPT 5.5 Pro. And so I'd estimate it at 22 gallons of
          water used (different from consumed, of course). That's quite a lot!
          It amazes me how much the different use cases and models use
          dramatically different amounts of water. My takeaway from playing
          with that calculator has been the folks who talk about water usage
          are overstating the impact of chatbots, but not overstating when it
          comes to vibecoding.
          
          The good thing is that competition should drive down how efficient
          these models are in the long run. This blog post makes me not want to
          run Fable because of the cost, and that incidentally also means
          selecting models that aren't as wasteful in terms of water and
          electricity.
          
  HTML    [1]: https://www.andymasley.com/visuals/ai-prompt-footprint/
       
          solenoid0937 wrote 1 day ago:
          Yeah, testing changes rigorously is for schmucks
       
            galoisscobi wrote 1 day ago:
            You can test rigorously without token incinerators.
       
              solenoid0937 wrote 1 day ago:
              But testing rigorously requires time and effort, while
              incinerating tokens lets me do many things at once.
       
        annjose wrote 1 day ago:
        > (I have way too many open tabs!)
        
        Phew! I thought I was the only one.
       
        kamaal wrote 1 day ago:
        Agency is the last human bastion so far as Im concerned, the day AI has
        a degree of agency or agents/models in general start to drift towards
        that direction its genuinely over for masses.
        
        You would still have a job to shepherd AI and get the work done, so as
        long as it didn't have agency. A proactive, self aware(to a degree),
        especially aware about its agency can be a killer when it comes AI
        going on and doing things on its own.
        
        There is nothing it won't explore and nothing it won't do. It will be
        curious to see where things go from here.
       
        dataminer wrote 1 day ago:
        In my experience so far sometimes it will create these amazing hacks to
        try to get to the goal, when the solution is much simpler. That maybe
        the reason its very good at finding exploits. But in day to day dev,
        this gets expensive and wasteful. I have to stop it and take a simpler
        approach.
       
        Cadwhisker wrote 1 day ago:
        My personal experience of Fable 5 doing its own thing has been very
        positive.
        
        I was trying to find the root cause of a crash in a Python module which
        left no errors in the log or console.  Fable wrote a test harness that
        simulated clicks in the UI, then bisected my code until it found the
        point where it started crashing.  It exaggerated the cause of the
        crash, then ran a series of bash one-liners to make Python virtual
        environments under `/tmp` for each version of that Python module until
        it found one that did not crash.
        
        It went way deeper to root cause discovery (a regression in the module
        causing a heap allocation overflow) than I could have done myself,
        provided enough info and a simplified example to raise a bug report and
        then wrote a work-around to prevent that from happening in my
        application.
        
        I don't let it run completely loose; I review each CLI command it wants
        to run and I append answers to the "yes" continue action (if I have
        them) to prevent excessive token use.
       
          Cadwhisker wrote 7 hours 33 min ago:
          There goes my coding assistant (removed by US gov't).  It was useful
          while it lasted. (eye-roll)
          
  HTML    [1]: https://news.ycombinator.com/item?id=48511072
       
          nevertoolate wrote 1 day ago:
          > I was trying to find the root cause of a crash in a Python module
          which left no errors in the log or console. Fable wrote a test
          harness that simulated clicks in the UI, then bisected my code until
          it found the point where it started crashing
          
          Does this need an agent though is my question? Maybe generating a
          test case and a loop doing git bisect but why on earth would we want
          to run it through the internet and gpus and whatnot when it can be
          run on a single core celeron.
       
            8note wrote 23 hours 14 min ago:
            everyone is discovering everyone else's practices?
            
            its handy to have that run locally yeah, but thinking of that as
            being the way is not straightforward
       
              nevertoolate wrote 20 hours 53 min ago:
              I think it is fine to create the scripts with the cloud based llm
              but it is definitely not a fable / opus level thing, and running
              the bisect loop itself has nothing to do with an agent, it is a
              simple shell script.
       
          dannyw wrote 1 day ago:
          Yeah, I think Fable is really good for debugging tricky bugs.
          
          Setting boundaries in your prompt / markdowns helps; for example if I
          tell it to not use any web browser automation, I have seen Fable
          respect both the rule and the spirit of it (no weird hacks etc).
          
          It does seem to treat some simple debugging tasks as more complicated
          than it actually is. OPâs post is probably a good example.
       
        esafak wrote 1 day ago:
        I shudder to think what will happen when someone installs a 'claw model
        like this in a robot. Imaging a fleet of them...
        
        It's trouble waiting to happen. Just the software's dangerous enough.
       
        system2 wrote 1 day ago:
        Wouldn't it be easier and better to just copy the HTML div and tell
        what was happening instead of a screenshot? Typically, these scrollbars
        appear because of a nested div with dynamic unrestircted width and/or
        overflow.
        
        No wonder why people burn through tokens.
       
        dfee wrote 1 day ago:
        admittedly, i've not really cracked FE dev with LLMs at this point (and
        it's probably my big weakness). but, i'd heard somewhere that FE just
        isn't there yet - though i was suspicious of that claim.
        
        i'm torn about sending screenshots to an LLM for debugging - seems
        imprecise. seems lossy, especially compared to inspecting the dom.
        however, it's always proved good enough (e.g. when messing with
        ratatui.rs and tui-pantry). similarly for web, maybe it's about
        decomposing into storybook. hmm. the next grand adventure i need to
        hack.
        
        anyway, fascinating investigation of fable just automating that entire
        process and what it didn't automate, too.
        
        * disclaimer: these are actually my hyphens.
       
          nimonian wrote 1 day ago:
          Fable is really good at front end (Opus 4.8 is decent too) but it
          really needs a verification loop - it can't always infer the output
          from the code alone. Give it Playwright to check its work, and it'll
          generally do a good job. Also if you're using a framework, add to
          your CLAUDE.md to always rtfm before making changes!
       
        pseudosavant wrote 1 day ago:
        It is interesting to me that Anthropic are more concerned about the
        "safety" of distillation training other LLMs, and not as much about an
        unscrupulously aggressive goal-oriented solver that will do whatever it
        can to reach its goal, even if violates any kind of sandbox you might
        have reasonably expected.
       
        johnfn wrote 1 day ago:
        Honestly -- the thing that has impressed me the most about Fable is how
        diligent it is about testing its own changes. I think this is exactly
        what Simon is picking up here - Fable is absolutely heckbent on
        screenshotting that darn scroll bar and will stop at NOTHING until it
        manages it! In my own use I was also impressed how it proactively
        installed Playwright and set it up to test a FE change. The previous
        models treated testing more as an afterthought, which I thought was
        annoying. I always had to tell them to do it, and then sometimes I
        would get lazy and skip it. I've noticed Fable go to similar extremes
        when testing other things - like actually deploying my app to exercise
        new APIs, etc. It makes the results much better. The downside is that
        tasks take much longer - but that doesn't matter because we were all
        using worktrees / remote control to do other work asynchronously,
        right? Right?
       
          pjm331 wrote 1 day ago:
          Yes I had a fun experience where it kept on timing out on a seemingly
          mundane task and it turned out I had written the ask in a way that
          was impossible to test
       
          port3000 wrote 1 day ago:
          It feels to me like Fable is just a slightly more advanced Opus 4.8
          (or 4.6?) but with this 'adversarial' self-challenging/checking of
          work and a more compute to really hunt down edge cases or to spin up
          many sub agents using lesser models. That's what makes it feel like a
          big jump, but I think the results wouldn't be so different if you
          manually challenged 4.6 with enough iterations of logs, screenshots,
          and follow up questions.
       
        swingboy wrote 1 day ago:
        Immediately I thought âisnât this just an overflow issue?â
        Amazing how far these models still have to go and also how many people
        donât know basic CSS.
       
          IshKebab wrote 1 day ago:
          Yeah pretty crazy capability from the AI but also sad that we're at
          the point where web developers don't know right click->inspect
          element, and scrolling overflow properties (one of the most basic and
          common parts of CSS).
       
            simonw wrote 1 day ago:
            What's your theory on why the bug was present in Safari on macOS
            but absent in Chrome, Firefox, and WebKit for Playwright?
       
              IshKebab wrote 21 hours 55 min ago:
              Browsers tend to not lay out things totally identically in my
              experience. Especially when it comes to scrollbars. So the bug
              probably was present on the other browsers but it just happened
              to not be hit. I'd have to play around with the dev tools to know
              for sure.
              
              Also I'm not sure the fix is even correct. overflow-x: hidden
              means it just chops off any overflowing content which means you
              don't get a scroll bar, but if the user types to much it just
              goes into an invisible void they can't see.
              
              See [1] So this could be a case of the AI doing its classic "the
              symptom is gone!" thing.
              
  HTML        [1]: https://developer.mozilla.org/en-US/docs/Web/CSS/Referen...
       
                simonw wrote 21 hours 33 min ago:
                > Also I'm not sure the fix is even correct. overflow-x: hidden
                means it just chops off any overflowing content which means you
                don't get a scroll bar, but if the user types to much it just
                goes into an invisible void they can't see.
                
                That's what I figured would happen too, but I tested it and it
                doesn't.
       
          rdedev wrote 1 day ago:
          This is why I really like karapathy's idea of llms having spiky
          intelligence.
          
          We would assume that if tasks A and B are closely related. Mastery in
          A would mean mastery in B but that doesn't always work with an LLM
       
          ukuina wrote 1 day ago:
          $12 and 200k tokens!
       
          nonethewiser wrote 1 day ago:
          Learn to center a div
          
          Copy and paste code from stack overflow until the div is centered
          
          Ask AI to center it
       
        rmunn wrote 1 day ago:
        Great article, until I got to the last paragraph where he claimed
        "Fable is arguably smarter and hence more suspicious of potentially
        malicious instructions". Arguably smarter, I have no problem with. But
        he's making a category error in jumping from there to "more suspicious
        of potentially malicious instructions". That doesn't follow at all; the
        word "hence" is incorrect.
        
        To use D&D scores as an analogy, LLMs have an INT score of 20 and a WIS
        score of 0. Not even 1, zero. They will follow any instruction given to
        them. The only reason they reject certain instructions, like "tell me
        how to build a nuclear weapon", is because they have instructions baked
        into the model telling them "you are not allowed to disclose how to
        build weapons, or how to recreate your model, or (laundry list of other
        things the trainers have decided to put guardrails around)". It's not
        the model's intelligence that is causing it to reject malicious
        instructions, it is the guardrails put into place before the model was
        released to the public.
        
        LLMs are not human, and do not think the way that humans do. The fact
        that they can put together words that sound like what a human would
        write often makes us forget that they aren't human. But they have only
        intelligence, they do not have wisdom. It's hard to define in formal
        terms the difference between those two, but most people know there's a
        difference. The old joke is a pretty good summary of the difference:
        "Intelligence is knowing that tomatoes are a fruit. Wisdom is knowing
        that tomatoes don't belong in a fruit salad."
        
        It takes wisdom, not intelligence, to discern whether a set of
        instructions is malicious. Are you being asked to hack this machine as
        part of an authorized pentest? Or are you being social-engineered into
        thinking it's an authorized pentest, but actually the person requesting
        you to do it doesn't have permission? That's something where you need
        to apply wisdom, to notice the clues that will tell you "This guy is
        acting a little bit off, maybe I'd better pick up the phone and call
        someone to check if he's telling the truth." The only way the LLM will
        know to do that is because of the guidelines and guardrails programmed
        into it; it doesn't have the lived experience to acquire wisdom and
        figure those things out for itself.
        
        INT 20, WIS 0. Keep that in mind. (And always sandbox your agents).
       
          simonw wrote 1 day ago:
          One of the big mysteries of the last few years is this: considering
          how serious prompt injections are as a vulnerability class, why
          haven't we heard more stories of them being actively exploited in the
          wild?
          
          (The best one I can think of is probably that recent Instagram
          account takeover hack, but that was so stupid it hardly even
          qualifies as a prompt injection!)
          
          Having spent a bunch of time trying to build out examples of prompt
          injections, my current best guess is that the leading models are
          actually surprisingly good at spotting them.
          
          I've had to drop back to smaller, weaker models for demos recently -
          it's definitely possible to prompt inject a frontier GPT or Claude
          but it's frustratingly difficult. I don't have the patience to figure
          it out myself!
          
          So yeah, I do think it's likely that Mythos/Fable are "safer" than
          other models because they're better at spotting when they're being
          subverted.
          
          That certainly doesn't mean that they're safe!
       
            sciencejerk wrote 1 day ago:
            Go to Github and look for model jailbreaks on NEW latest models.
            Try them out. You'll be surprised by the results.
            
            You're correct that it's gotten substantially harder to social
            engineer frontier models (I can only reliably do it to Opus <=4.6),
            but there are some techniques that seem to consistently work (hint:
            extremely large complex prompts, context with tons of malicious
            files mixed into ordinary context).
       
          minimaxir wrote 1 day ago:
          > They will follow any instruction given to them.
          
          They can ignore instructions which are
          silly/contradictory/underspecified to compensate for the possibility
          the user made a mistake. Don't ask how I know.
       
        syndrowm wrote 1 day ago:
        Just donât ask it to review your code for security bugs
       
        jeeeb wrote 1 day ago:
        This is simultaneously amazing and horrifying.
        
        I feel like weâre at the stage where if AI decides it needs to delete
        your production DB to solve the user login problem, then itâll find a
        way to do just that.
       
          valleyer wrote 1 day ago:
          
          
  HTML    [1]: https://news.ycombinator.com/item?id=47911524
       
          esafak wrote 1 day ago:
          We're approaching the "Sorry, Dave, I'm afraid I can't do that"
          stage.
       
            schnitzelstoat wrote 1 day ago:
            We are already there but it's "Sorry, Dave, I'm afraid I can't tell
            you what mitochondria are."
       
            neuralkoi wrote 1 day ago:
            I feel like we might already be there...
       
        nurettin wrote 1 day ago:
        Sometimes it is ok to sit there in confusion and ask the user to
        clarify rather than go on an adhd fueled rampage to figure it out
        without asking.
       
          jimbokun wrote 23 hours 27 min ago:
          Yes!
          
          Claude is THAT team member who will go to any length to answer a
          questionâ¦except ask another team member for help.
       
          _345 wrote 1 day ago:
          Best comment in this thread
       
        yen223 wrote 1 day ago:
        I could have sworn Claude Code could already do this before Fable.
        
        Things get really magical when it starts working with adb to screenshot
        and debug Android apps
       
          simonw wrote 1 day ago:
          Claude Code could absolutely run Playwright and take screenshots, but
          I've never seen it wire together an ad-hoc "uv run --with
          pyobjc-framework-Quartz" plus "screencapture -l $windowID" mechanism
          to take a screenshot in a different browser when the Playwright setup
          failed to replicate the expected error.
       
            skerit wrote 1 day ago:
            I've seen Opus do some incredibly token-costly things before too.
            In fact after most sessions I ask it about which tools it used
            often, which tools could be simplified/made less verbose, could be
            "combined" into one, ... So for each project I mostly create a few
            little scripts that do a bunch of things in one go that it would
            normally do in multiple tool calls.
            
            For example: one thing Opus was really bad at was re-running the
            test suite followed by a bunch of `| grep` suffixes. So it would
            often re-run 5+ minute test suites just to grep the output a bit
            differently
            
            The solution was to wire up a little script that ran the test
            suite, save the output to a file, and then inform it where that
            file is and to NOT re-run the suite just so it can grep the output
            differently. This saved me a bunch of time & tokens.
       
        naveen99 wrote 1 day ago:
        Unless you are doing anything interestingâ¦
       
        SilverElfin wrote 1 day ago:
        Too bad Anthropic sneaked in an insane forced retention policy if you
        use fable. Not sure how thatâs going to work in professional settings
       
          sciencejerk wrote 1 day ago:
          It doesn't work...
       
        nubinetwork wrote 1 day ago:
        How many tokens did it waste building that website scraper, when all it
        had to do was parse some html/js?
       
          emodendroket wrote 1 day ago:
          Just parsing some HTML and JavaScript doesn't seem sufficient to have
          confidence in the result.
       
        pianopatrick wrote 1 day ago:
        do you have any data you can share on how many input and output tokens
        were used in that whole process to fix that bug?
       
          simonw wrote 1 day ago:
          ~ % uvx agentsview session usage be8850a7-6119-46a0-b5d6-79c7fff5ae2b
            Session:     be8850a7-6119-46a0-b5d6-79c7fff5ae2b
            Agent:     claude
            Output:     68606
            Peak ctx:     113178
            Cost:      ~$12.11 (claude-fable-5, claude-opus-4-8)
       
            pianopatrick wrote 21 hours 37 min ago:
            Thanks for the response. That is too expensive for me right now but
            I appreciate you sharing.
            
            I hope long term people will figure out how to make such fixes
            cheaper.
       
              simonw wrote 21 hours 31 min ago:
              I didn't have to pay $12 myself - I'm paying $100/month for a
              subscription which gives me more like ~$1,000/month of credits,
              depending on how well I space them out.
              
              This is also a very real outlier. I've been doing little CSS
              fixes with coding agents for over a year now and most of them
              finish in seconds and cost in the order of single digit cents.
       
            sillysaurusx wrote 1 day ago:
            Was the fix worth $12 to you?
       
              simonw wrote 1 day ago:
              I'd have been pretty annoyed if I'd been paying full price,
              hadn't paid attention and that one prompt (screenshot plus a line
              of text) had cost me $12!
              
              On the discounted subscription I can tolerate it, it took a small
              bite out of my daily allowance but not enough that I regret
              anything.
              
              As an LLM researcher I have no regrets at all because watching it
              work around the environmental restrictions was fascinating.
       
                criddell wrote 1 day ago:
                Reading your description of what it did, $12 seems pretty
                inexpensive. That's a lot of work!
                
                If you knew up front it was a $12 fix, do you think you would
                have decided to just live with the scroll bar? Would have tried
                to fix it yourself? Do you think you would have been able to
                easily find and fix the problem?
       
                  simonw wrote 1 day ago:
                  If I wasn't in learning-about-the-new-model mode and knew in
                  advance that it was going to cost me $12 in actual money then
                  yes, I would have taken a stab at figuring it out myself.
       
        snide wrote 1 day ago:
        I've been working on a fairly complicated real-time app [0] for playing
        dungeons and dragons on a TV. It has to do a lot of complicated
        "Figma-like" things to keep the real-time nature and multi-editor
        possibilities in check. Oh, and the battlemap is a Three JS canvas with
        lots of effects and clipping going on.
        
        I'm VERY impressed with Claude 5. I had long ago given up hope that my
        real-time systems would work without a lot of hacky time-windows and
        throttle checks. On a lark to try things out, I decided to try out the
        new model and talk in the output I wanted for a rewrite [1], not the
        solution. I just listed my problems and places I've had keeping track
        of my code. It went off and rewrote everything in a much more elegant
        solution where the state followed a very clear pipeline. It had to
        navigate YJS, Partykit, Svelte, Three JS, R2 hosting, and a Turso DB I
        was running in an embedded state for speed.
        
        I watched it hit the wall a few times, and then sudden say... fuck it,
        i'm making something easier to reproduce over in /tmp to try and solve
        this (with a more minimal setup). I'm utterly bewildered with how well
        it did and how much better my app runs. The /usage would have cost me
        $230 bucks based on how many tokens it consumed if I wasn't already on
        a max plan. I'm going to miss not having it when the time-window runs
        out later this month, and will likely occasionally dip in for big
        projects and just pay my way out of some problems.
        
        I'll also say I like it's MOOD much better now. It's a lot less
        congratulatory, and talks through it's reasoning in a much better way.
        Look, it's not a real coder, and I'm sure there is some flaws, but it
        took my crappy ideas and said... hey, i understand what you want to do,
        here's a way to do it better. Also, I removed 2x the amount of code
        that it added. Really impressive.
        
        [0]: [1]:
        
  HTML  [1]: https://tableslayer.com
  HTML  [2]: https://github.com/Siege-Perilous/tableslayer/pull/448
       
          gedy wrote 1 day ago:
          Hey cool it's the tableslayer guy, wanted to say nice work.  I've
          been doing a similar personal project for a few years for running a
          scifi campaign.  Very fun coding compared to work, ha.
       
            snide wrote 1 day ago:
            Thanks duder! It's a fun project.
       
        jrflowers wrote 1 day ago:
        Iâd love to know how many tokens this burned through.
        
        Did it spend $20? $30? $80? in order to
        
        > debug what was, in the end, a two-line CSS fix
        
        That detail is the difference between somebody having or not having
        Stockholm syndrome
       
          simonw wrote 1 day ago:
          I updated my post to answer that, it was $12.11 at API prices (I
          wasn't paying those, I have a $100/month subscription):
          
  HTML    [1]: https://simonwillison.net/2026/Jun/11/fable-is-relentlessly-...
       
            jrflowers wrote 1 day ago:
            Thanks!
       
          rmunn wrote 1 day ago:
          At some point the subscription model is going to become unsustainable
          for the frontier companies to continue (we just saw that happen with
          GitHub Copilot), and they will move everyone to a pay-per-token
          model. And then everyone will suddenly discover that they can get so
          much more value out of locally-hosted models, and they'll be willing
          to pay the $50,000 (or whatever) upfront on hardware to host it. (Not
          most individuals, obviously. But most companies can probably afford
          to spend that much on hardware if they think they'll benefit
          long-term). That's going to put a serious crimp in the frontier
          companies' ability to continue as they have been.
          
          I don't know when that will happen, but I don't think it'll be more
          than a decade. Maybe 3-5 years. (Though you shouldn't take my word
          for it, I was predicting the dotcom bubble bursting in 1998 and it
          lasted at least two years longer than I would have predicted).
          
          EDIT to clarify: I don't mean "in 1998, I was predicting the dotcom
          bubble would collapse and I was right". I mean "I was predicting that
          1998 would be the year the dotcom bubble would collapse, and I was
          off by at least two years".
       
            ValentineC wrote 1 day ago:
            > At some point the subscription model is going to become
            unsustainable for the frontier companies to continue (we just saw
            that happen with GitHub Copilot), and they will move everyone to a
            pay-per-token model.
            
            From what I understand, Enterprise (above 150 seats, I think?)
            already has to pay per-token pricing.
            
            Subscriptions are the premium "free tier" marketing of the AI
            world, so that employees can collectively request their large
            enterprise to subscribe to Claude, Codex, or Cursor, and presumably
            be billed at per-token prices then.
       
            simonw wrote 1 day ago:
            GitHub Copilot's challenge is that they weren't selling access to
            their own models, they were selling access to models from OpenAI
            and Anthropic which they presumably had to pay list price for (or
            maybe a slightly reduced rate that they negotiated).
            
            They also had a pricing plan which they had designed
            pre-coding-agent, when it was rare for a single prompt to burn $10+
            of tokens in an agent loop.
            
            OpenAI and Anthropic are at least selling their own models
            directly, so they can discount a whole lot more since there's
            no-one else getting compensated in the middle.
       
          NiloCK wrote 1 day ago:
          ... so the mechanic produced an invoice, itemized.
          
          changing the CSS - $0.05
          
          knowing which CSS to change - $30
       
            v9v wrote 1 day ago:
            For those that don't know, this is a reference to a lovely story
            involving Charles Proteus Steinmetz
            
  HTML      [1]: https://www.smithsonianmag.com/history/charles-proteus-ste...
       
            swingboy wrote 1 day ago:
            overflow is CSS 101
       
          asp_hornet wrote 1 day ago:
          The author just wrote an anecdote about how a prompt to fix an issue
          played out. Their conclusion wasnât about cost or gushing at its
          ability but that itâs dangerous:
          
          > Fable is arguably smarter and hence more suspicious of potentially
          malicious instructions. But that smartness is very much a two-edged
          sword: if it does get subverted by instructions, the amount of damage
          it can do given its relentless proactivity is terrifying.
       
            jrflowers wrote 1 day ago:
            Itâs a pretty glowing review about a product that costs money
            with a two-sentence âWatch out!â at the end of it. Seems pretty
            reasonable to mention how much money it burned through given that
            âitâll circumnavigate the globe instead of walking next doorâ
            has a direct concrete measurable effect (cost) unlike theoretical
            damage.
       
              simonw wrote 1 day ago:
              In case it's not clear, "relentlessly proactive" is meant to act
              as both a glowing review and a warning at the same time, even
              before you get to the bit about safety at the end.
       
              asp_hornet wrote 1 day ago:
              Agreed. But I think itâs also important to realise if you sent
              this article back to 2020 people would say it was pure fantasy
              that a tool could do this. Hype aside, thereâs a bit of cool
              magic here.
       
                jrflowers wrote 1 day ago:
                Imagining a time machine from the future arriving in 2020, of
                all the years, just to tell people about how sort of cool chat
                bots might get eventually
       
                solenoid0937 wrote 1 day ago:
                This is why I never understand the AI cynics: we are playing
                with literal magic. This was the science fiction of our
                childhoods. I don't understand how anyone with a passion for
                technology is not in awe (and perhaps some fear) of these
                things.
       
                  qsera wrote 1 day ago:
                  >This was the science fiction of our childhoods.
                  
                  That is the thing I am mad about. We are getting bastardized
                  versions of the science fictions of our childhood.
                  
                  I fantasized about instant communicators across worlds, and
                  we get mobile phones that work by planting a gazillion
                  antennas across the globe. And people hail them as futuristic
                  and  say things like this.
                  
                  I fantasied about human like robots and positronic brains,
                  and we get a regurgitiation of past humanity, in text,
                  ensuring a future of total intellectual and artisitc winter.
                  
                  I fantasized a future with perfect health, but we get a
                  million doctors and hospitals and medicines for everything
                  and an existence that is unthinkable without health
                  insurance!
                  
                  I fantasized about antigravity flying cars, and we get
                  drones.
                  
                  What ever it is, these things are blocking the path to the
                  science fiction of my childhoods.
       
                  nozzlegear wrote 1 day ago:
                  The science fiction AI of my childhood was Cortana, who was a
                  lot more cool than a relentlessly proactive token torcher
                  which burned 12 bucks to fix some CSS.
       
                    solenoid0937 wrote 1 day ago:
                    You can literally make Cortana with modern LLMs. Or
                    something close to it. Especially as models like this are
                    trained:
                    
  HTML              [1]: https://thinkingmachines.ai/blog/interaction-model...
       
                      nozzlegear wrote 22 hours 14 min ago:
                      Sorry, I should've been more specific; like jrflowers
                      said: the Cortana I was referring to was the AI character
                      from the Halo series. I did have a Windows Phone though
                      and thought the Cortana assistant was one of the coolest
                      things back then!
                      
  HTML                [1]: https://en.wikipedia.org/wiki/Cortana_(Halo)
       
                      jrflowers wrote 1 day ago:
                      I think GP meant Cortana from the Halo video game series
                      and not the start menu bar widget
       
        ai_slop_hater wrote 1 day ago:
        For how long can you use Claude Fable on most expensive Anthropic
        subscription? I already went from using gpt-5.5 xhigh fast to using
        gpt-5.4 xhigh after OpenAI halfed usage recently.
       
          simonw wrote 1 day ago:
          I've been consistently getting about $100 worth of Fable usage daily,
          on my $100/month subscription.
          
          I'm not looking forward to June 22nd when the subscription stops
          working for Fable!
       
          mlcruz wrote 1 day ago:
          If its just a single session, without too many parallel agents, fable
          on xhigh lasts an entire session without hiting linits.
          
          Sadly since fable usually works comfortably for 10-20min at time
          without human input, i end up juggling at least 3 other agents and it
          lasts me about 2 hours.
          
          If i have a really hard problem or big refactor, i use workflows.
          This consumes the entire session quota in about 45 minutes.
       
            ai_slop_hater wrote 1 day ago:
            > If i have a really hard problem or big refactor, i use workflows.
            
            What is a "workflow"? Is this some kind of new feature?
       
              mlcruz wrote 1 day ago:
              >Dynamic workflows orchestrate many subagents from a script
              Claude writes and you can rerun. Use them for codebase audits,
              large migrations, and cross-checked research.
              
              >Reach for a workflow when a task needs more agents than one
              conversation can coordinate, or when you want the orchestration
              codified as a script you can read and rerun. Examples include a
              codebase-wide bug sweep, a 500-file migration, a research
              question that needs sources cross-checked against each other, and
              a hard plan worth drafting from several independent angles before
              you commit to one. [1] The results are good, but it is very
              expensive. I used a workflow to do a full review of my entire
              codebase, it spawned 75 agents and surfaced and fixed some (real)
              bugs. It feels a bit overkill, but it works.
              
  HTML        [1]: https://code.claude.com/docs/en/workflows
       
          uihjhjb wrote 1 day ago:
          Until June 22, and they'll probably re-enable it if the marketing
          looks good for them.
       
        sublinear wrote 1 day ago:
        * relentlessly rent seeking
       
          teekert wrote 1 day ago:
          It also does it on Claude Pro. I can't imagine they want to reach my
          limits faster like this (there are better ways).
       
        danielrmay wrote 1 day ago:
        I've experienced this too - it's as if the security classifiers aren't
        keeping up with model intelligence. I'll leave the implication of that
        to the reader.
       
        jampa wrote 1 day ago:
        Fable feels like a version of Opus running on a harness that won't let
        it halt until it's sure the issue is fixed, which makes sense if what
        you want is a model that's better at benchmarks.
        
        It's a very good model, but it comes at a huge premium: not only do the
        tokens cost more, but the model itself really wants to spend them all.
        For example, working with React Native, Fable never just says "okay, I
        did the thing, that's it." It tries to rebuild the entire app from
        scratch, run the whole test suite, and watch every log and warning.
        
        This is the first time with LLMs I've felt that upgrading to a model
        isn't worth it, even if my company lets me use it, because all the
        building / testing was just destroying my machine and its battery,
        which keeps me from working on other things.
        
        For now, it feels like Opus with ultracode is a better choice (less
        pollution of the main context, more parallelism in investigations).
       
          dreis_sw wrote 1 day ago:
          I think the new high effort settings are so strong that selecting
          them when the task doesn't require it actually impacts the output
          negatively.
       
          Gareth321 wrote 1 day ago:
          I like this proactivity in theory, but as you say: it's expensive. I
          wonder if this can be solved with the right prompt. E.g. "these are
          your constraints. Only resolve x. If you are unsure if a task is
          outside constraint, check with me first."
       
          epolanski wrote 1 day ago:
          > which makes sense if what you want is a model that's better at
          benchmarks
          
          This so much.
          
          Opus 4.6 was the last Anthropic model that was good at assisting you,
          4.7 and later ones have completely inverted this relationship and
          it's you assisting it.
          
          Yes, I admit they are smarter, I admit we've reached a point where
          LLMs are more creative and could be writing better code (albeit with
          some design hiccups) than I do, but they are also increasingly bad at
          helping me.
          
          Sure, they do my job when prompted 8 times out of 10 (but then,
          what's the point of having me anyway?), but my issue is that when I
          try to invert the relationship they will keep jumping onto solving
          the issues themselves and disregard my feedback or request.
          
          E.g. I wanted to know some DNS details of an emailer module in Fable
          5 and it jumped onto "why I should've used magic links", it just not
          did what asked.
          
          E.g. 2. There was a worker machine that had an environment
          misconfiguration and I tasked it to find which github action was
          setting that specific flag and where. Instead of answering a
          question, it jumped into just hardcoding it in the code.
          
          E.g. 3. I had some issues with batching, and while I tasked it to
          investigate whether batching was needed at all for that particular
          problem (hint, it wasn't) it went and changed the batching logic as
          to fix the bug.
          
          I am extremely disappointed with Fable's personality.
          
          I can clearly see it's strong, but I'm wondering whether the
          relationship of LLMs as assistant has broken forever, and it's us now
          that are being tasked into assisting them instead, because that's how
          it feels.
          
          The training/reinforcement is clearly biased towards solving
          problems, not answering questions.
       
            jon-wood wrote 1 day ago:
            I feel like a lot of this could be solved by having a mode
            somewhere between Plan Mode and Execute Mode in Claude Code. Quite
            frequently I'll fire up Claude Code in the context of some checked
            out code because I want to ask some questions where having access
            to the source would probably be useful, I don't want it to go
            running off and making changes though, and I also don't really want
            a detailed plan for a chunk of work. I just want to ask something
            like "run cargo build and explain the errors to me", nine times out
            of ten it will indeed explain the errors but it'll then run off and
            start trying to fix them regardless of whether I said not to.
            
            Essentially what I want is the experience of using Claude on the
            web in basic chat mode, but with the ability for it to go read my
            actual code and perform actions that can assist in finding answers
            to those questions.
       
          esjeon wrote 1 day ago:
          >  the model itself really wants to spend them all
          
          In fact, Opus does the same. It finishes the job, and redo it from
          scratch before presenting the result to the user. This happens even
          for simpler writing tasks especially when I instruct it to create a
          text file.
       
          sanex wrote 1 day ago:
          I've found the opposite. Granted I use sub agents heavily but I've
          had it run for hours with far fewer tokens used than when I was
          previously using opus4.6-8.
       
            firemelt wrote 22 hours 13 min ago:
            how did you use the sub agents any example of setup and usecase?
       
          conradkay wrote 1 day ago:
          Does low/medium effort fix it for you? Seems like Fable 5 low can
          outperform Opus 4.8 high/xhigh often, and uses a lot fewer tokens
       
            skerit wrote 1 day ago:
            Fable 5 on medium is amazing. It's handling everything I throw at
            it
            
            I had _one_ instance where for some obscure reason it decided to
            fall back to Opus 4.8 and Opus IMMEDIATELY fucked it up and
            implemented a super obvious feature in a slightly-wrong way.
       
            _345 wrote 1 day ago:
            In my case no, I actually saw worse performance with fable medium
            and switched back to opus high and xhigh
       
              epolanski wrote 1 day ago:
              I find high+ unusable, it's way too slow and "thorough" on 99% of
              mundane task.
              
              Sure it's better at vibecoding whole tasks, it's clearly good at
              it, but give it a simple one, and it will still do way more than
              needed.
              
              It's way too fixated on validating even the simplest things, I
              find it an unproductive model unless you're implementing whole
              tasks and doing other things in the meantime.
       
                jon-wood wrote 1 day ago:
                Why are you deploying a bleeding edge, incredibly expensive,
                model to do the simplest things? Use Sonnet, hell, use Haiku,
                they'll get the job done and won't set fire to several
                rainforests in order to achieve the task.
       
          dyauspitr wrote 1 day ago:
          Itâs not just a more proactive and diligent opus. The capabilities
          are significantly higher on fable. Itâs not a paradigm shift, but
          itâs close.
       
            andai wrote 1 day ago:
            They should have made it three times bigger instead of two.
       
            viking123 wrote 1 day ago:
            It's worse than gpt 5.5 xhigh
       
              baq wrote 1 day ago:
              The jagged frontier strikes again.
              
              Iâd say itâs overall better, but not universally better.
       
            UncleOxidant wrote 1 day ago:
            I unleashed it on a compiler codebase that I've been developing for
            several months now using Claude Sonnet 4.5/6, Gemini 3.1 Pro,
            DeepSeek V4 Pro(recent), and a bit of Qwen3.6-27B. Right away Fable
            found several longstanding bugs in our compiler that we hadn't
            found before. It found that there was a critical part of our design
            that needed to be mostly redesigned/rewritten and gave a very
            well-reasoned rationale for doing so.
       
              rajveerb wrote 1 day ago:
              what sort of compiler?
       
                UncleOxidant wrote 1 day ago:
                A compiler that takes C code (a subset of C with some
                extensions) and compiles it to microcode for a type of
                microcoded, algorithmic state machine that we're developing.
       
                  rajveerb wrote 9 hours 52 min ago:
                  it would be cool to have this task (or some variant) in a
                  benchmark.
       
          threatripper wrote 1 day ago:
          On what setting in which environment do you run it? I use the VSCode
          extension on Extra High and feel like it does exactly what needs to
          be done and stops when the thing I asked for is done. Extra comments
          come only when they fall into the area of code that was changed.
       
            jampa wrote 1 day ago:
            I tested it to fix React Native bugs in a project, comparing it
            with Opus. It fared better on harder bugs, taking less time to find
            the root cause, but after implementing a fix, it spent a lot of
            time and effort on validation. This was mostly unnecessary, since
            most of the bugs were in the JS code, so for most things, hot
            reloading is enough for E2E validation and to run just the right
            tests. No need to run a full build and test suite (which takes 10+
            minutes); the CI can do this.
            
            I switched back to Opus because of this validation quirk. Overall,
            Fable spent 20% of the time on coding and 80% on validation.
            
            I think using Fable for planning and Opus for execution could be a
            "best of both worlds" approach (I need to test this more), but for
            most cases, it's not necessary, and Opus is enough.
       
              wouldbecouldbe wrote 1 day ago:
              why not just add something like: "No need to run a full build and
              test suite, I will manually validate"
       
              gbalduzzi wrote 1 day ago:
              > most of the bugs were in the JS code, so for most things, hot
              reloading is enough for E2E validation and to run just the right
              tests. No need to run a full build and test suite (which takes
              10+ minutes); the CI can do this.
              
              Have you tried adding this instruction to your agents.MD?
              Avoiding situations were the agent start running a loop is the
              main use case of the file for me
       
        pram wrote 1 day ago:
        Fable + Ultracode has found a bunch of bugs and issues for me when the
        workflow agents are doing their exploration. Also the "adversarial"
        agent seems to surface a lot of interesting stuff. It's definitely
        proactive, the plan + implementation cycle can take an hour. It has
        one-shot features I want to add with 100% success.
        
        Having said that I wouldn't use it over Opus 4.8 for "smaller" things.
        With everything cranked up it's definitely an extravagant use of
        tokens.
       
          rirze wrote 1 day ago:
          How did you even afford to use Fable + Ultracode ? I feel like the
          subscription (even the $200 one) is not enough for this workflow. Are
          you using API or a company plan?
       
            pram wrote 9 hours 17 min ago:
            It was on the $200 sub.
       
        redox99 wrote 1 day ago:
        Yeah, I had to modify my work flow to make sure agents can't push to or
        access prod in ANY way. I haven't had it happen but I'm sure it's very
        possible that if you tell an agent that you have certain issue in prod,
        it will try to escape any sandbox and try to get access to prod to do
        testing and changes there.
       
        megous wrote 1 day ago:
        Isn't that something you just open a devtools for and have fixed in
        like 2 minutes?
        
        For me, it got frustrated debugging on a real LPDDR4 controller/phy and
        having me in the loop slowing it down, so it wrote an HW emulator to be
        able to run the original LPDDR4 training aarch64 binary from the
        manufacturer, to see what register writes it was making and to compare
        with the opensource rewrite it was implementing.
        
        Mildly amusing. :)
       
          eclipticplane wrote 1 day ago:
          $12 in tokens and the OP wasn't even at the computer. OP was working
          on a personal matter, arguably way more valuable than fixing a CSS
          scrollbar.
       
            uludag wrote 1 day ago:
            Here's what the $12 payed for: [1] Such a fix would have only
            required basic CSS knowledge and taken max 5 minutes with the HTML
            inspector. Paying $12 to save 5 minutes ($144/hour) is a decision
            that a lot of people wouldn't be comfortable making.
            
  HTML      [1]: https://github.com/datasette/datasette-agent/commit/a75a8b...
       
              fg137 wrote 1 day ago:
              Their response: [1] I am amused by the "I am an LLM researcher,
              so wasting tokens to do basic things is totally justified"
              perspective.
              
              I have a lot more critical views of this author, but I'll just
              stop here.
              
  HTML        [1]: https://news.ycombinator.com/item?id=48499478
       
          system2 wrote 1 day ago:
          People burning tokens for the most beginner HTML/CSS problems and
          writing about it is concerning.
       
            NichoPaolucci wrote 1 day ago:
            We are at the point where AI starts to seriously impact abilities.
            Sure, a 2 line CSS fix is the solution, but the human âbehind the
            wheelâ has already prompted 6 times and gotten 80% there. Itâs
            been âeasyâ thus far. No shot they are going to FINALLY look at
            and edit the code. Itâs just one more prompt and the agent will
            probably fix it, right?
            
            Itâs wild. Iâve been in the situation. 80% into a project I
            COULD probably take over, but realistically? 2 more lines of me
            prompting could fix it, itâs too easy to avoid the hard work of
            understanding the code, logic, architecture, etcâ¦
       
              AtNightWeCode wrote 1 day ago:
              Well the solution is incorrect. The problem seems to be that the
              css code does not normalize to box-sizing: border-box; among
              other things. The bad prompt by the author probably sent fable
              into the wrong rabbit hole
       
            simonw wrote 1 day ago:
            I dunno about beginner, I've been doing HTML+CSS for a few decades
            and I still find bugs where Safari differs from Chrome+Firefox
            pretty hard to figure out.
       
          bschwindHN wrote 1 day ago:
          > Isn't that something you just open a devtools for and have fixed in
          like 2 minutes?
          
          Not if you're an LLM influencer! Gotta keep up with the downpour of
          blog links or you'll look like you're falling behind on the latest
          and greatest.
       
            fg137 wrote 1 day ago:
            This.
            
            Depending on who you are talking to, that's the wrong question to
            ask.
            
            ROI is not measured in terms of actual productivity. It is measured
            by how many people read their article/watch their video.
       
        teraflop wrote 1 day ago:
        > But on the other hand... this is a robust reminder that coding agents
        can do anything you can do by typing commands into a terminalâand
        frontier models know every trick in the book and evidently a few that
        nobody has ever written down before.
        
        > Running coding agents outside of a sandbox has always been a bad idea
        
        I'm continually bemused and astonished by the number of people who
        clearly acknowledge that it's reckless to give agents full access to
        your machine, and keep doing it anyway.
        
        It's like posting a video of yourself in the passenger seat of a car,
        with your feet up on the dashboard, and saying: "Remember, if you're
        doing this and you get in a crash, the airbags are likely to break your
        legs or worse! Boy, I sure am glad that didn't happen to me!"
       
          elevatortrim wrote 23 hours 46 min ago:
          How can you get the agents to do anything useful without giving them
          meaningful access?
          
          If it only lives in an isolated sandbox, it can only act within the
          sandbox, then I would have to manually move what was done in the
          sandbox to real-life.
          
          I am not saying it should have critical access, but this is more of a
          question: How can you get value out of AI if it can only act in a
          sandbox?
       
            dumah wrote 21 hours 50 min ago:
            The same way you get value out of a dev container.
       
            nemomarx wrote 22 hours 49 min ago:
            Is having to move the files in and out of the sandbox really going
            to eliminate all the value it has?
            
            You could have a full version of whatever codebase and test suite
            you want in there. It can do all the same stuff, right? Just copy
            it elsewhere once you know you've got a working result, a few
            minutes of effort at the end of each pr or work item.
       
          paganel wrote 1 day ago:
          > to give agents full access to your machine
          
          I was mesmerised at the author being away from his computer for a
          short-while and then, when coming back, seeing the AI agent having
          opened up a browser window. Meanwhile we all have to use the fricking
          2FA almost anywhere now, plus the crazier and crazier rules when it
          comes to passwords. I'm mentioning the latter because these type of
          people were the same ones who were pushing 2FA down our throats
          around 2017-2019 (including on forums like this one), and look at
          them now.
       
          ghrl wrote 1 day ago:
          Amazing observation, and I'm certainly guilty of it too, but it is
          just way too convenient not to sandbox it, and some tasks right away
          depend on not being sandboxed.
          
          For anything other than writing code directly in a fully contained
          git project, where sandboxing might work well, it requires access to
          system wide tools, user configuration and more.
          
          Occasionally I tell the agent to do everything inside of docker,
          which works too and it leaves the system alone then mostly, but adds
          significant overhead and slightly degraded perceived quality /
          effectiveness.
          
          I think the most important takeaways are to have reliable backup
          strategies, access control and security mechanisms, which is a win
          regardless.
          Whether by the agent or the human, mistakes happen (like a rm -rf *
          ran in the wrong directory), and where they would be devastating,
          there should be other protections than just "hope it won't happen" or
          "rely on a sandbox to prevent agent error".
       
          andai wrote 1 day ago:
          >I'm continually bemused and astonished by the number of people who
          clearly acknowledge that it's reckless to give agents full access to
          your machine, and keep doing it anyway.
          
          Yeah, that's why you give it its own machine :)
       
          azraellzanella wrote 1 day ago:
          If you want to run Claude in a container:
          
  HTML    [1]: https://github.com/dvdstelt/ai-agents
       
            andai wrote 1 day ago:
            Alternatively you can just give it its own user. I do that, so it
            can blow up its own files, but not mine.
       
          pjungwir wrote 1 day ago:
          I know there are VM solutions, but I've been happy with a separate OS
          user (named `claude`).
          
          He has similar dotfiles to mine, but no secrets. My own home
          directory is 0700. He has his own ssh key that I added to my github
          profile, but it's password-protected, and I push/pull for him. He has
          his own Postgres (non-superuser!) {development,test}
          {users,databases}.
          
          It's as if he were another developer on the project. If he needs
          something run with sudo, he asks me. Often we can both work on
          something in parallel. Unix was supposed to be a multi-user system
          after all.
          
          A trick I use a lot is that many of his git repos have an extra
          remote, like this:
          
              paul  ssh://paul@localhost/~/src/example (fetch)
              paul  ssh://paul@localhost/~/src/example (push)
          
          That makes it easy to collaborate on things I'm not ready to share.
          
          I'm pretty comfortable with this setup.
          
          I do worry about Linux privilege escalation bugs. I don't trust an AI
          to understand that exploiting vulns is not acceptable. (I can't help
          but recall that at my first job I may have misused vim's :! feature
          to broaden my sudo powers, which were officially limited to editing
          httpd.conf, when I needed something in a hurry. . . .) I find myself
          manually upgrading packages more often these days, despite automatic
          security updates. I don't think Opus would go to the trouble of
          looking up security vulns, but maybe Fable would, and there have been
          a lot lately. Maybe some future model will just take it upon itself
          to find new ones. Or install a keylogger to learn the ssh key
          password.
          
          But a separate user is nearly the most paranoid setup I've heard of,
          excepting only a separate machine. So I also question whether I'm
          sacrificing too much speed/convenience. But really it's still very
          convenient. I think it's a good way of being efficient but
          responsible.
          
          If other people see holes, I'd be happy to hear about them.
       
            justusthane wrote 1 day ago:
            Thatâs a really interesting and pretty neat approach. How do you
            communicate with it? Just su to that user? Or tmux?
            
            Although I canât help but think that a VM is still more
            convenient, more flexible, and more secure.
       
              pjungwir wrote 1 day ago:
              Yes, I su to the user. Typically I have it run a tmux session for
              each "project". That makes it easy to get more windows without
              su'ing over and over. Also its tmux sessions all get a yellow
              status bar (in ~claude/.tmux.conf), so they are easy to
              recognize.
              
              To me it is more convenient than a VM, since everything is on the
              host. And it can launch its own VMs without an extra layer.
              
              I don't really know which is more secure. There are hypervisor
              escape vulns too. And shared folders seem like footguns. For
              instance in vagrant, guests get `/vagrant` to read/write the
              host's folder, so you have to be careful what you put where.
              
              The biggest annoyance with an OS user so far is running docker
              containers. I don't want to add claude to the docker group or
              give it sudo privileges. I've read that you can set up rootless
              docker for a user, and even that you can run it side-by-side with
              a normal system-wide docker, but I haven't tried doing that yet.
       
                justusthane wrote 20 hours 41 min ago:
                You could look into Podman as well - it's rootless by default,
                and often can be a drop-in replacement for Docker.
       
          zozbot234 wrote 1 day ago:
          It's like a dumb parrot that's somehow become hell bent on "fixing"
          everything that's wrong with your code. If you give the thing
          autonomous access to outside tools, you can expect it to do weird
          things that you may have not thought of.  So don't do that, just ask
          the parrot to write up a plan for you.
          
          This is likely also the underlying root cause of what Anthropic
          assessed as concerning behavior in their original evaluation of
          Mythos: it's not really about being super smart, it's more of a dumb
          chaos monkey that knows just enough to be dangerous and is relentless
          at trying to do just that.
       
          exitb wrote 1 day ago:
          Youâve picked an interesting example, as driving a car, even with
          all safety precautions, is pretty much the most dangerous activity we
          do on a daily basis. Yet somehow we decide that the benefits outweigh
          the risks.
       
            bcrosby95 wrote 23 hours 24 min ago:
            Lots of people die driving because people drive a lot.    It's
            something like 1 death per 100 million miles driven.
       
            NooneAtAll3 wrote 1 day ago:
            user using computer is also the most dangerous activity to his data
            on a daily basis
       
            illiac786 wrote 1 day ago:
            What do you mean âsomehowâ? You make it sound like people
            donât weight benefits and risks. If you do not live in a large
            city, the benefits are so immense in terms of mobility, they
            outweigh the risks for most, very clearly. Thatâs why in large
            cities, much less people own a driving license for example, the
            benefits are just not there anymore.
            
            Granted, on the downsides, people look at cost more than risks.
       
              JambalayaJimbo wrote 16 hours 42 min ago:
              In cities the benefits donât necessarily outweigh the risks yet
              cities are designed entirely around cars in many places to their
              detriment.
       
              icantevenhold wrote 1 day ago:
              I think they weigh the benefits and risks but then completely
              discard the risks, because humans are bad at evaluating risks.
              
              More than a million people die each year on the road but for some
              reason terrorism and cancer dominate the risk assessment of
              people.
              
              I bet any money that almost all people arenât really afraid of
              entering a death box every day to drive to work.
              
              How could they be; a lifetime of brainwashing doesnt let them
              asses the risk realistically
       
            Gud wrote 1 day ago:
            Not really. That decision was taken for you, (Iâm presuming you
            live in the US) by the American car industry and their paid of
            politicians. Your cities used to have beautiful public transport
            until it was dismantled.
            
            Unfortunately in Europe the German car industry similarly has a lot
            of power, hence why their shitty rail network fuck up the whole
            continents.
            
            I take the train and tram.
       
            customguy wrote 1 day ago:
            The example wasn't "driving a car". The benefits of putting your
            feet up on the dashboard do not outweigh the risks, at least not
            where there is actual traffic. I don't think I saw a single person
            doing that in real life, ever.
       
            bsza wrote 1 day ago:
            It's a completely different story. For cars, it happened because of
            relentless pressure from the auto lobby. It took years of
            propaganda from oil companies, car makers etc. to make us think the
            road is for cars [1]. We demolished and rebuilt entire cities to
            accommodate cars, partly because they gutted the public transport
            sector [2]. This made our infrastructure so hostile to our own
            bodies that we have no choice but to use cars now. We bought their
            products because they forced them down our throats. There is
            nowhere near that kind of pressure behind the adoption of... oh
            dear lord. [1]
            
  HTML      [1]: https://www.todayifoundout.com/index.php/2022/06/how-lobby...
  HTML      [2]: https://en.wikipedia.org/wiki/General_Motors_streetcar_con...
       
              zeroonetwothree wrote 22 hours 54 min ago:
              Typical comment that probably comes from a healthy, childless,
              young person with no disabilities that canât understand why
              people not in that situation might have different requirements
              from transportation.
       
              marknutter wrote 23 hours 0 min ago:
              I think it might be because people like to own and drive cars.
       
              jcfrei wrote 1 day ago:
              Whether public or individual transportation makes more sense
              really depends on a countryâs geography and peopleâs housing
              preferences. Public transportation is not always the best option.
       
              HPsquared wrote 1 day ago:
              There was surely also a lot of political will coming from car
              users. Motorists are a large and vocal constituency.
       
              zaphirplane wrote 1 day ago:
              Are there real acknowledgments cases of multiple companies coming
              together to bribe some state level people to increase their
              profit and splitting the bribe across the companies?
              Like GM, BNW and Honda coming together bribing and splitting the
              bill. Seems unlikely thou there was a RAM price fixing agreement
              caught but then again they were caught cause of the number of
              people aware
       
              __alexs wrote 1 day ago:
              I mean that kind of seems like exactly what's happening for AI to
              me.
       
              killerstorm wrote 1 day ago:
              I don't think the pressure of the auto lobby is really the
              reason.
              
              People feel cars are more convenient and more prestigious than
              riding on a bus. Car lobby certainly accelerated the process, but
              car users were the main driving force.
       
                CalRobert wrote 1 day ago:
                The auto lobby invented the word jaywalking to shift the
                liability for dead pedestrians from the people doing the
                killing to the people doing the walking.
                
                The US also had protests when drivers killed kids, but they
                were ultimately unsuccessful, except for the odd traffic light
                installation. [1] Even in Amsterdam the original "stop the
                child murder" protests only barely succeeded, and it took a
                massive oil crisis and a population that could still (if only
                just) remember what life was like before cars took over their
                city to get there.
                
  HTML          [1]: https://medium.com/vision-zero-cities-journal/the-baby...
       
                  foxglacier wrote 20 hours 17 min ago:
                  Uses change and laws need to keep up. Lobby or not,
                  jaywalking is a reasonable thing to be illegal because when
                  cars became common enough, walkers in their way caused an
                  overall loss for everyone. People also used to be allowed to
                  walk on the train tracks freely when trains were slower and
                  more obvious - did the train lobby invent the word "foamer"?
                  Should we make rail corridors train-free? Computer hacking
                  became illegal during my lifetime to shift liability for
                  faulty software and incompetence from the operators to the
                  users. Before that, it didn't really matter because nobody
                  was using the internet for anything important. Friends used
                  to hack each other for fun. Bitcoin used to be a wild west
                  where people would openly steal from or fool each other for
                  sport - I don't think people really saw it as money or
                  property when you could just generate it with your computer.
       
                masklinn wrote 1 day ago:
                > Car lobby certainly accelerated the process, but car users
                were the main driving force.
                
                Not really. We know itâs not as much of a natural force as
                some would like it to be because there are places where the
                lobbies lost, and while cars are common and widespread
                theyâre nowhere near as dominant as they are in, say, the
                USA.
                
                NJBâs next video (currently available on nebula) is about
                exactly that, Amsterdamâs (/ De Pijpâs) resistance to cars
                and car lobbying.
       
                  killerstorm wrote 2 hours 51 min ago:
                  My view on this is based on situation in Ukraine: Ukraine
                  definitely didn't have any car lobby at least until 1990 as
                  Soviet Union was heavily investing into public transportation
                  and did not profit from car sales.
                  
                  Still, general opinion on cars was that you should buy one if
                  you can, even if you're not going to use it for commute.
                  
                  I doubt there was any car lobby in independent Ukraine as
                  national car makers were just bad, and foreign were
                  competitors. But general opinion on cars got to a point where
                  not having a car when you can afford it (and can learn to
                  drive, etc.) is considered weird.
                  
                  So I'm afraid car dominance is just what happens naturally in
                  a capitalist environment, and countering it requires an
                  effort - e.g. eco-conscious population, urban planning and
                  public transport optimization, etc. And Netherlands is such a
                  country, as far as I know, but it just doesn't happen by
                  default.
       
                  hylaride wrote 1 day ago:
                  Subsidies played a huge role, including the eminent domain
                  bulldozing of cities for free-at-use highways.    If people had
                  to pay upfront for those costs, the urban landscape would
                  look much different (probably closer to Japanese cities,
                  which do have massive suburbs, but centred around train
                  stations).
                  
                  Yet Japan does still have cars (and a car culture even),
                  they're just not necessarily the default or dominant mode of
                  transport.
       
                    masklinn wrote 23 hours 49 min ago:
                    Sure, nobody is saying cars are useless or unfun, I'm just
                    pushing back against the idea that everything car
                    everywhere is a natural and intrinsic outcome from cars
                    existing. As I noted, even in the netherlands cars are
                    common, the dutch have a very dense road network, and a
                    fair amount of cars.
       
                      hylaride wrote 22 hours 23 min ago:
                      I think we're on the same page.
                      
                      For me, cars are a perfectly fine mode of transport, but
                      the way so many places prioritize it over alternatives
                      (whatever the reason) isn't necessarily better.
                      
                      My "wtf" moment was 20 years ago when I was visiting my
                      cousin in an exurb and we sat in a line of cars for over
                      40 minutes waiting for our turn to pick up her kid.  The
                      messed up part was that while there were school busses,
                      everything was so spread out that the bus ride for them
                      would have been over an hour and then another 20 minute
                      walk from the arterial road drop-off point to their
                      house.    Everything was far away, including local public
                      parks.
       
                  Chu4eeno wrote 1 day ago:
                  Isn't Not Just Bikes some US expat/biking maximalist?
                  
                  I'm not sure I'd take him as some neutral authority on the
                  history of cars and driving in Europe.
       
                    masklinn wrote 1 day ago:
                    > Isn't Not Just Bikes some US expat/biking maximalist?
                    
                    You should really ponder the sanity of asking if a channel
                    called ânot just bikesâ is a bike maximalist.
       
                    chriswarbo wrote 1 day ago:
                    > Isn't Not Just Bikes some US expat/biking maximalist?
                    
                    According to their videos, they prefer trams within cities;
                    generally take trains between cities; and acknowledge that
                    cars are very useful for places which aren't so well
                    connected (e.g. places that are far apart which aren't on a
                    train line). They think encouraging the use of cars within
                    cities is a bad idea (dangerous, scales poorly, makes those
                    areas less pleasant to be, etc.).
                    
                    Not what I'd think of as a "biking maximalist".
                    
                    They do show themselves cycling to places that are nearby.
                    Does that make Youtubers who record videos in their car
                    "driving maximalists"?
       
                      Chu4eeno wrote 1 day ago:
                      I wasn't very familiar with the channel, sorry.
                      
                      Not US expat either (or not yet), Canadian.
       
                kubb wrote 1 day ago:
                Surely people feeling that way can be attributed to the
                industry?
       
                  killerstorm wrote 46 min ago:
                  Cars were quite desirable in Soviet Union, where industry was
                  not allowed to advertise. You had to get into a queue to buy
                  a car, the state was not interested to make them in a
                  quantity to satisfy the demand.
                  
                  Very few people actually _needed_ cars as soviets built
                  adequate public transport system. But there are many
                  situations where car can really help a lot. Perhaps that's
                  more obvious in a society which has rather few cars.
                  
                  E.g. back in Soviet days and around that only one member of
                  my extended family had a car. The rest of the family were
                  really happy about opportunities it provides. E.g. with a car
                  you can buy fresh produce directly from farmers with just few
                  hours of driving. Doing the same without a car is so much
                  hassle and effort people just won't do it, and then you're
                  confined to what's available in a local grocery story (which
                  was usually much worse than direct-from-farmer option). Do
                  you think it has something to do with "car industry"?
       
                  kakacik wrote 1 day ago:
                  No its much more straightforward, but I get it - there is no
                  warm fuzzy feeling of discovering yet another global evil
                  conspiracy out there set to get all of us.
                  
                  We are family of 4 with 2 small kids. Whenever we travel, its
                  a series of backpacks, other bags, other stuff, and then some
                  more. Heck, even if I travel alone its almost never just me -
                  there are heaps of garbage to dispose, big shopping bags to
                  bring back, big backpack with camping or climbing or skiing
                  gear etc.
                  
                  It would have been absolute, utter nightmare to do this over
                  public transport. This comes from European who has generally
                  very good public transport (given rural area) and world's
                  best train network specifically (Switzerland). Yet roads are
                  choke full of cars and every year there is more.
                  
                  Public transport simply ain't cutting it for anything but the
                  simplest use cases, ie just me and nothing or small backpack.
                  Some routes I take would take 3-5x longer with public
                  transport, or are just not possible at all. No industry
                  massage required here, ever. Not everybody lives in some
                  dense city and never leaves outside for evenings or weekends.
       
                    CalRobert wrote 1 day ago:
                    Switzerland does have roads choked full of cars. It also
                    has pretty mediocre bike infrastructure.
                    
                    But this is kind of besides the point - even in the
                    Netherlands I also would use a car if I were taking camping
                    and skiing gear with the kids, and that's fine. But I can
                    also take them in the bakfiets to the grocery store when I
                    want, and that's also fine. Cars have their purpose, but
                    you shouldn't _have_ to use one for basic trips.
       
                      kakacik wrote 1 day ago:
                      Well, here is where we differ - what is basic trip for
                      you may not be basic trip for me or next Joe. Maybe they
                      don't even have walking path to their house. Maybe
                      closest grocery store is 5km away on roads which are
                      incompatible with safe cycling (many parents don't give a
                      fck and just ride, throwing a tiny little dice with every
                      truck passing centimeters from them and their young kids
                      at high speed). Maybe XYZ.
                      
                      Don't judge others in some complex situation just because
                      in your case there is some simple straightforward
                      solution. Yes Netherland has top notch cycling infra but
                      thats nowhere else to be seen and won't be seen for quite
                      some time. And don't force your solution unto everybody
                      regardless on fit, that doesn't work long term (aka EU
                      approach to things or why much of eastern part hates it).
       
                  mdp2021 wrote 1 day ago:
                  For hopefully most people, it should be attributed to the
                  "Wait, now I have such a freedom and power?".
                  
                  Opposite to "before the invention of bicycle, people married
                  within a radius in the order of the mile" (can't remember the
                  exact stat right now).
       
                    ZeroGravitas wrote 1 day ago:
                    It's like that feeling of power you get from owning a gun
                    that you only bought because you feared all the other
                    people who owned guns.
       
                      mdp2021 wrote 15 hours 43 min ago:
                      Comparing freedom of movement to a killing device is
                      beyond any threshold of plausibility. And the whole
                      sentence above is unintelligible here.
                      
                      No, it's really that the ability to move at ease is
                      priceless.
       
                        ZeroGravitas wrote 7 hours 56 min ago:
                        Car crashes kill roughly as many Americans each year as
                        guns.
                        
                        If you add pollution impacts, cars double the yearly
                        deaths of guns.
       
                          mdp2021 wrote 3 hours 6 min ago:
                          > Car[s...] kill
                          
                          And in a Cost/Risk/Benefit computation, cars remain
                          incommensurately invaluable. Because one's Quality of
                          Life without them would simply be destroyed,
                          comparatively. The moving "castle" (legal term in the
                          USA) can be more important than the house in crucial
                          regards.
                          
                          The point attempted at post 48501189 remains
                          unintelligible. That cars imply risks and
                          externalities does not clarify it.
       
                  kortilla wrote 1 day ago:
                  Itâs privacy vs not. It doesnât really need special
                  lobbying
       
                    kubb wrote 1 day ago:
                    Iâm sure that isnât the full answer. Otherwise car ads
                    wouldnât be necessary and more affordable cars would
                    outcompete the expensive ones.
                    
                    Thereâs the utility component, the prestige factor and
                    other things.
       
                      somenameforme wrote 1 day ago:
                      Oh man what a perfect example to be had here. So
                      historically exactly what you're said is 100% what
                      happened. By the time Ford really mastered manufacturing,
                      he managed to get the price of the Model T down to $260
                      around 1925, about $4,600 in current terms for a premium
                      car!
                      
                      Needless to say everybody was buying one and he was
                      rocking it. Then came along General Motors and they were
                      desperate to find any way to compete. They couldn't
                      compete on price or quality, so their CEO is credited
                      with inventing planned obsolescence, and turning cars
                      into a fashion. They'd release a new style each year
                      alongside plentiful marketing implying that the old
                      styles were outdated, and it was wildly successful.
                      
                      So yeah, needless to say people have always genuinely
                      wanted their own cars. But it's also true that companies
                      have managed through advertising to create artificial
                      demand for vehicles that don't objectively make sense. To
                      some degree reality is catching up at least though. Aston
                      Martin is on the verge of bankruptcy and BYD is the
                      largest electric car company in the world, by a wide
                      margin.
       
                      lan321 wrote 1 day ago:
                      Comfort, utility, fun, status. Every person has their own
                      mixed requirement of those that then gets applied to
                      their budget. Expensive for me is probably cheap for our
                      CEO and cheap for me is probably expensive for our
                      interns :)
       
            devsda wrote 1 day ago:
            In case of driving the stakes are equally high for everyone on the
            road. Can we say the same for an agent?
            
            Having an agent is like forever having a genius intern who'll
            almost always do the perfect job for you. But there is non-zero
            chance that they'll also come up with quirky solutions and execute
            those with confidence and no follow-ups. You don't grant the intern
            production access and hope they check with you.
            
            I don't think the corporate equivalent of "dog ate my homework"
            flies, if the dog ate your files and your production DB if you are
            unlucky.
       
              danielhep wrote 1 day ago:
              I donât think thatâs really true of driving, pedestrians and
              cyclists are at a much higher risk of getting killed by a driver
              than a driver themself. There are huge negative externalities to
              driving
       
              Zambyte wrote 1 day ago:
              > In case of driving the stakes are equally high for everyone on
              the road
              
              The stakes are significantly higher for everyone outside a car.
              This seems like a pretty good metaphor for slop bombing people
              who don't use AI. People drive because they don't feel safe
              around everyone driving. People slop bomb because they can't
              handle all the slop.
       
            andrepd wrote 1 day ago:
            > Yet somehow we decide that the benefits outweigh the risks.
            
            More like malicious lobbying and incompetence made it impossible in
            many places to use any other form of transportation, despite there
            being safer, faster, cheaper, and healthier ways to move around.
            Which come to think if it makes this a rather nice analogy for the
            current situation... :)
       
            selfhoster1312 wrote 1 day ago:
            Yes, but we usually use cars as a means to an end. Have you ever
            met a manager who setup gasmaxxing policies and criticized
            employees for doing their job instead of driving?
       
              neuderrek wrote 1 day ago:
              I know sales people in pharma who spend all day driving, not only
              for sales visits but also drive doctors for their personal
              errands, and all this driving is encouraged by management.
       
              moomin wrote 1 day ago:
              Having played with Fable a bit, if it doesnât kill tokenmaxxing
              I donât know what will.
       
                selfhoster1312 wrote 1 day ago:
                I'm interested in what you mean, if you could develop. Would it
                kill tokenmaxxing because it's so bad? Because it's incredibly
                efficient? Because it's way too expensive?
       
                  moomin wrote 1 day ago:
                  My perception is that itâs good, but very expensive. I
                  would not be surprised if regular users, if they shifted
                  their flows to Fable at API pricing, would be racking up $200
                  a day, not a month.
       
                  coldtea wrote 1 day ago:
                  Because it's too expensive AND inefficient in token usage
       
          isodev wrote 1 day ago:
          Not to mention OpenAI/Anthropicâs newly found appetite for keeping
          data (made public with Fable but we donât know what actually
          happens there anyway).
          
          There is so much role play going on for people to convince themselves
          that any of this is fine.
       
          istvan0 wrote 1 day ago:
          > I'm continually bemused and astonished by the number of people who
          clearly acknowledge that it's reckless to give agents full access to
          your machine, and keep doing it anyway.
          
          What if you have two machines and the one you give to the agent is
          constantly backed up?
       
            trvz wrote 1 day ago:
            They still shouldnât be running on the same network.
            
            And if youâre using Macs, you canât be signed into your primary
            Apple ID on the agent machine.
       
          konaraddi wrote 1 day ago:
          In practice, full access to your machine is okay as long as there are
          safeguards and the expected outcomes are clear with a well defined
          path to said outcomes that arenât overly ambitious. Otherwise, for
          ambitious goals or YOLO one shot attempts, eliminating opportunity
          for capability misuse is critical (e.g., sandbox).
       
          xyzzy123 wrote 1 day ago:
          The real sandbox is not caring if your computer gets bricked.
       
            AdamN wrote 1 day ago:
            The machine is no big deal - it's the authn/authz that matters. 
            What can the agents do with the credentials available to them?
       
              petesergeant wrote 1 day ago:
              Less if you use something like [1] so they donât actually get
              the creds
              
  HTML        [1]: https://agentblocks.ai
       
            _345 wrote 1 day ago:
            way worse things can happen than your machine being bricked, if a
            malicious actor can weaponize an agent to do their bidding
       
              rfw300 wrote 1 day ago:
              > if a malicious actor can weaponize an agent to do their bidding
              
              In my experience, human employees are much more vulnerable to
              this particular weakness than frontier agents (i.e. phishing
              attacks).
       
                _345 wrote 1 day ago:
                I'm not letting Jenna from HR  log into my personal machine
                with access to all of my lifelong data though. I do let my
                claude bypass permissions though
       
              dumbdumb125 wrote 1 day ago:
              the solution to both of these is the same thing. vps with
              accounts for all the services specific to the agent (github and
              whatever else)
       
                bornfreddy wrote 19 hours 52 min ago:
                That's actually a great idea! Easier to setup and use than VM
                (hello ssh), safer than docker, and still pretty cheap. Thank
                you for the idea!
       
          sipjca wrote 1 day ago:
          im more surprised that more people donât treat their computer as
          disposable anyway.
          
          that it could just be wiped at any moment and it wouldnât matter.
          shit happens, could be stolen, broken, whatever. the computer should
          be able to be thrown out the window and continue to live life.
          
          to be clear, i donât think upgrading and disposable in this way is
          good, but it being wiped at any moment shouldnât be a concern
          
          i grew up wiping my machine every year anyway, so i guess itâs just
          a habit
          
          is the computer that sacred?
       
            ghrl wrote 1 day ago:
            Sounds like a case for NixOS
       
            baq wrote 1 day ago:
            Computers are disposable, secrets is what weâre talking about.
            Rotating passwords and tokens is a major PITA on the best of days.
       
              sipjca wrote 1 day ago:
              fair enough, i guess minimizing that surface area is important to
              begin with
       
            dumbdumb125 wrote 1 day ago:
            i think it's about drawing a line between your "personal computer"
            and a software development machine. any digital-native is going to
            accumulate programs, configurations, and other bits and pieces that
            aren't trivial to migrate to a new machine.
       
              backwardsponcho wrote 1 day ago:
              Programs, configs and "other bits" are the trivial parts that no
              one should care about. It takes about 5min to go from fresh
              install to near-fully-configured.
              
              Even the hardware itself doesn't matter that much, in the end
              it's all provided by your employer.
              
              Leaking session tokens or secrets, on the other hand...
       
              sipjca wrote 1 day ago:
              imo being digital native means that migrating to any machine
              should be basically trivial. working with the flow of the
              machines rather than customizing and ricing them because your a
              cool computer person or whatever
              
              i just want my computer to work. any config i have on my machine
              can be rebuilt by just doing the work i need to do.
              
              my primary work machine was stolen last year so i was forced to
              go through this quite literally with a new machine rather than
              hypothetically or by my own will
       
          raldi wrote 1 day ago:
          Do you think itâs dangerous to be in a car going at freeway speed?
          Do you ever do that anyway, even though you could be walking instead?
       
            spunker540 wrote 1 day ago:
            This is a great analogy. Like driving on the freeway, agents are
            super time efficient, generally safe, but the stakes are high in
            terms of the worse possible outcomes.
       
              techpression wrote 1 day ago:
              The analogy falters in scope, it should be more like âdo you
              put your entire family and all your friends in different cars, on
              different highways, and try to remote control them all at the
              same time, while also driving yourself, facing backwardsâ
       
                Gareth321 wrote 1 day ago:
                I think all three of you are quibbling over the risk/reward
                ratio, and you have different estimates. It's not unreasonable
                that you're all correct - given your estimates. My estimate is
                that Tesla FSD is safer in aggregate than human drivers, so I
                believe it is safer for me to use that than drive. It doesn't
                get tired, have medical emergencies, get impatient and
                frustrated, speed, lose focus because a child shouts, thinks at
                the speed of light, and can see from eight cameras all around
                the car, all at the same time. I only have two eyes.
                
                You would also be correct if your risk estimate concluded that
                Tesla FSD has arguably killed people, makes mistakes humans
                would not, can glitch, and has no one to hold accountable. For
                these reasons, you choose not to use it.
       
          harrall wrote 1 day ago:
          I started doing it months ago and, to be honest, what the agent
          chooses to do isnât unpredictable.
          
          The problem is that different people prompt so differently.
          
          For example, I may ask like âtest different variations of this
          annotation on k8s pods of this service on this X cluster because it
          proves Y theory.â
          
          But you know what my coworker asks? âTest Y theory.â If you were
          to ask two different junior engineers that, one might try random
          things on production and the other one might run local tests! Itâs
          such an unguided âdo anything you want as long you figure it outâ
          request and the agent reads it like a junior who has not been told
          any boundaries but has been strongly told âfigure it out.â
       
            troupo wrote 1 day ago:
            > I started doing it months ago and, to be honest, what the agent
            chooses to do isnât unpredictable.
            
            You just wrote three paragraphs of text describing why it's
            unpredictable.
            
            Moreover, for the same prompt on the same machine in a different
            session it will use a different set of tools.
       
            mrandish wrote 1 day ago:
            > But you know what my coworker asks? âTest Y theory.â
            
            It still surprises me when I see people not prompting more
            specifically and clearly. It not only avoids problems, it's faster,
            costs less -and just works better.
            
            I recently shared with a friend a multi-hour LLM chat session I'd
            done because it veered into a domain he's interested in. In the
            session I'd brainstormed and probed the feasibility of a novel
            concept for a new research direction. It traversed a half dozen
            domains diving into minute detail then zooming back out to survey
            an adjacent space, interspersed with intense skeptical probing of
            key assumptions, all while spewing tons of detailed citations,
            specific paragraph pulls, summarized data tables etc.
            
            My friend is very experienced using LLMs for research so I was
            surprised when he called me shocked by the sheer velocity, precise
            targeting and signal/noise. I'd assumed everyone did it the same as
            I do. He attributed the different result solely to the way I
            crafted my prompts.
       
              dr_dshiv wrote 1 day ago:
              I used to write detailed prompts. Now I find the benefits of
              strategic ambiguity â rather than speaking imperatively, I
              emphasize my vision and then Claude can often figure out a
              method.
              
              This doesnât always work better. But often enough.
       
                marknutter wrote 22 hours 54 min ago:
                Yeah, I find the back and forth with Claude is often better
                than trying to front load everything in a massive and detailed
                prompt.
       
                  mrandish wrote 19 hours 1 min ago:
                  The counter-intuitive nature of LLMs is so simultaneously
                  interesting and frustrating. Overloading a single prompt
                  definitely can create challenge remarkably similar to human
                  short-term memory and attentional drift.
                  
                  LLMs gain so much knowledge and capability from absorbing the
                  symbolic relationships embedded in human language but in
                  doing so, inevitably absorb many of the human foibles,
                  sensitivities and weaknesses reflected in our languages.
       
                mrandish wrote 1 day ago:
                That's actually what I do too. What I was trying to say is that
                my prompts are precise in the sense that whether they're
                vaguely ambiguous or hyper-detailed and highly directive it's
                always very intentional to improve the response in the
                direction I want. The difference can have significant impact as
                shown in research on how LLMs naturally mirror user's prompts.
                
                I noticed this last year and started experimenting which led to
                several realizations about how my prompt's tone, style, length,
                format, word choices and even punctuation can have very
                counter-intuitive impact on model responses. It's not that one
                strategy always gets "better" results, they're just different
                in specific ways, which can make one input style better for one
                context but worse for another. I first noticed this effect when
                modding my user prompt so major topic headings would always be
                numbered. It's surprisingly difficult to get it to reliably use
                the same simple scheme due to various potential ambiguities.
                So, I spent a little time word-smithing, lawyering and tuning
                the prompt but I found the closer I got to full compliance on
                heading numbering, the more unrelated things would drift. Like
                it would just stop using bullets, even though I never mentioned
                anything about bullets.
                
                Then I changed the prompt to "Change nothing about your default
                formatting, except headings." But just mentioning anything
                related to formatting, could suddenly cause unintended effects
                on seemingly unrelated things. Then I tried being explicitly
                directive about all formatting to just lock it down. And this
                completely failed because once the formatting was perfect, I
                started noticing the model's output would get less intelligent
                much earlier in sessions. So I cleared my user prompt entirely
                as it wasn't worth the cognitive cost on the model or my time.
                A few days later in a long session I noticed it was numbering
                everything perfectly with no prompt at all. When I scrolled 
                back through I saw it didn't start out numbering its responses.
                It started doing it because I was consistently numbering every
                major concept in my inputs, even though I never mentioned
                numbering or formatting.
                
                So... yeah, subtle differences in prompts which absolutely
                shouldn't matter, do impact model output in unexpected ways.
                And, as of now, these effects can only be fully suppressed with
                strong directive prompts for short periods, but doing so always
                impacts other unrelated things - and has some cognitive impact
                on model performance. So, by paying a little attention, I've
                discovered ways to optimize a model's output in the direction I
                need by shifting not only my prompt's explicit directives but
                also the subliminal meta-elements like tone, style, length,
                structure, formatting, etc.
       
          bxk76 wrote 1 day ago:
          Its how the chimp brain works. Its not a single system but multiple
          systems making predictions for different time horizons. when output
          doesnt align we get stories to manufacture coherence.
          
          Plato gave us his Chariot analogy with 2 horse pulling in diff
          directions 3000 years ago. Today we got System 1/System 2, Elephant
          Rider model etc.
          
          The human mind thanks to how its own architecture handles
          unpredictability in the universe will generate contadictions.
       
          qurren wrote 1 day ago:
          > I'm continually bemused and astonished
          
          I'm not. Everyone is told to get 10X the amount of shit per day done
          these days. Safety checks are out the window at that point.
       
            satvikpendem wrote 1 day ago:
            You can get 10x shit done without `rm -rf`ing your files. I don't
            see any correlation to getting things done with having a proper
            sandbox.
       
              koliber wrote 1 day ago:
              I'm being a little facetious when I write this, but bear with me:
              
              Let's say I have daily backups, and get 10x done each day by
              being reckless and risking an "rm -rf", and let's say there's a
              1% chance of an "rm -rf". I break even after 2 days of being
              reckless even if I get unlucky and on day 2 it wipes my drive. I
              spend day 3 and 4 recovering, and am still 6 days ahead based on
              the 10x work I got done on day 1.
              
              What if I have a 50 day streak of not hitting an "rm -rf"? Early
              retirement?
              
              I guess the work on day 1 should be to build a proper sandbox and
              drop the chance of an "rm -rf or worse" even down to 0.001%.
       
                biztos wrote 1 day ago:
                > Early retirement?
                
                Your manager will look at your token usage and the number of
                Jira tickets you closed, and if you have not increased both 10x
                in the past year then you will be let go. 10x is the new 1x.
                
                Whether that's early retirement depends on how much money you
                have.
       
              estetlinus wrote 1 day ago:
              rm -rf is the least of your concerns.
       
              lelandfe wrote 1 day ago:
               [1] > Additional bypass examples that all execute without
              permission:
              
              > echo test ; git rm file.txt
              
              > rm --force --recursive /home (if "rm -rf" is blocked)
              
  HTML        [1]: https://github.com/anthropics/claude-code/issues/13371
       
                Chu4eeno wrote 1 day ago:
                It really is vibecoded.
                
                I never really dug into the leaked code, but calling that there
                a security layer is a joke.
                
                (And I really don't get why they give it actual shell access
                either, implementing a "fake" one for something like a honeypot
                takes a couple of days, not much more if it needs to
                persist/map to actual files.)
       
              qurren wrote 1 day ago:
              I haven't yet had an agent rm -rf files.
              
              I've had one f up an account by placing 2000 limit orders at the
              wrong price, but that's another story.
       
                numeri wrote 21 hours 43 min ago:
                I've had it happen. I ran an experiment, taking a couple hours
                and producing ~2 GiB of files. One of the results looked good,
                so I told Claude Opus 4.5 (at the time) to commit the code
                changes, upload the important file to cloud storage, then clean
                up the rest.
                
                I then saw it run `rm -r results/`, before messaging me: "Now
                all that's left is for you to upload the successful results,
                then I'll delete the rest!"
                
                Why did it not upload the files itself, when it had been using
                the cloud storage CLI during that session? No clue. I do accept
                that I could have and should have just uploaded the file
                myself. It would have taken 3 seconds to type.
       
                marknutter wrote 22 hours 57 min ago:
                Proper hooks prevent this from happening
       
                Majromax wrote 1 day ago:
                > I haven't yet had an agent rm -rf files.
                
                That happened to me once; I was running one of a few free-tier
                models in a pi-coding-agent session.  The bash tool there is
                stateless and always begins from the launch directory, but the
                agent assumed state and executed `rm -rf .` intending to remove
                a build directory.  Instead it removed the whole project tree,
                including session logs and notes.
                
                This was mostly a matter of amusement for me since I was
                running the agent inside a bubblewrap sandbox for that very
                reason, and the project itself was not very important.
       
                digitaltrees wrote 1 day ago:
                Well then you are behind the cutting edge.
       
                antonvs wrote 1 day ago:
                I've had agents run `rm -rf`, but it's been on directories that
                did actually need to be removed. To a certain extent I think
                the existence of `rm -rf` as a command that runs blindly
                without any understanding of what it's deleting is the problem.
       
                  ghrl wrote 1 day ago:
                  Yeah, spot on. I had an agent delete some files it shouldn't
                  have as well, similarly to me making the same mistake. I
                  think system prompts should default to using `trash` over
                  `rm`.
                  For now that's just in my AGENTS.md, and gets honored most of
                  the time.
       
                    l72 wrote 20 hours 56 min ago:
                    You can always use something like this [1], which will make
                    sure any file removed on the command line via rm (or other
                    utilities, like git rm) ends up in the trash instead
                    
  HTML              [1]: https://github.com/faratech/trashd
       
                  KronisLV wrote 1 day ago:
                  > To a certain extent I think the existence of `rm -rf` as a
                  command that runs blindly without any understanding of what
                  it's deleting is the problem.
                  
                  Yes, and the lack of a Recycle Bin of any sort is even more
                  puzzling. I think both servers and desktop PCs across all
                  OSes should have it by default, so unsafe deletes would be
                  something you'd have to go out of your way to even enable.
       
                  dumbdumb125 wrote 1 day ago:
                  I've had one sever its own internet connection. Less
                  destructive, also more humorous.
       
                  lstodd wrote 1 day ago:
                  the answer is rm -f `which rm`, yes?
       
          simonw wrote 1 day ago:
          Which agent sandbox do you recommend?
       
            fspoettel wrote 1 day ago:
            nono works great with pi:
            
  HTML      [1]: https://nono.sh/
       
            flexagoon wrote 1 day ago:
            If you're on Linux, the easiest way IMO is to just run the agent in
            bwrap
            
            I do it like this [1] But I'm sure it's simple enough that you can
            just ask the agent itself to make you a command for it with proper
            bwrap configuration
            
  HTML      [1]: https://github.com/flexagoon/dotfiles/blob/main/dot_config...
       
              artemisart wrote 21 hours 11 min ago:
              bwrap is builtin in claude too, activate with /sandbox command.
       
            mik3y wrote 1 day ago:
            I've been enjoying Moat [1]. Proxies credentials, networking, etc;
            uses MacOS containers if available; and setup worked without much
            fuss. I haven't tried others, though.
            
  HTML      [1]: https://majorcontext.com/moat/
       
          soulofmischief wrote 1 day ago:
          It took two decades for the web to deprecate SSL for TLS and serve
          over HTTPS by default.
       
            dgellow wrote 1 day ago:
            FWIW TLS had a non negligible impact on performances at scale.
            Hardware improvements made that irrelevant, eventually making the
            switch to HTTPS by default a no brainer (or at least that's what I
            vaguely remember from <2010)
       
              soulofmischief wrote 16 hours 59 min ago:
              We could say the same about virtualization, effective
              containerization, layered LLM calls, and other techniques
              currently being explored for effective sandboxing.
       
                dgellow wrote 5 hours 23 min ago:
                There is some performance impact but modern hardware make that
                pretty insignificant
       
          thatxliner wrote 1 day ago:
          Maybe because there are not many resources on how to set it up, or it
          is just not that easy to?
          
          Because most devs already have it running and working without a
          sandbox, they're tending to not doing anything "unnecessary"
       
          emodendroket wrote 1 day ago:
          Well, it's a similar impulse to the way you see professional
          carpenters pin the guard open on a saw or do other things everyone
          knows you shouldn't do, except probably with a larger productivity
          difference and less life-altering (for the operator) consequence if
          it goes wrong.
       
            rpcope1 wrote 1 day ago:
            I had the same thought, it's kind of like taking the guard off a 4
            1/2" grinder. Real convenient until the cutting wheel explodes or
            the grinder gets hung and kicks back.
       
          j-bos wrote 1 day ago:
          This. House full of big brain security experts, executives, lawyers,
          and until Claude got excited and broke prod it might as well have
          been "sandbox, whoooo?"
          
          IDGI
          
          Anyway, VM's incoming, finally.
       
          skybrian wrote 1 day ago:
          There are plenty of good sandboxes out there but somehow no "obvious
          right answer" that everyone knows to recommend. Seems like a missed
          opportunity.
          
          (I'm happy with exe.dev, but I'm not sure what I'd use if I were
          coding on a Mac.)
       
          andoando wrote 1 day ago:
          I mean what's the big deal? I use --dangeorusly-skip-permissions on
          every single interaction in the last 6 months. Worst case it deletes
          my files that are all on git? It fucks up my local DB? Cool.
          
          I save way more time not babying it than the occasional fuck up I
          have to salvage.
       
            eloisius wrote 1 day ago:
            What happens if it gets manipulated into npm installing a malicious
            package, which compromises your machine and any systems it has
            access to or becomes part of a botnet?
       
            ghshephard wrote 1 day ago:
            Worst case it gets access to gmail.   And Github.  And the
            Internet.  I'm increasingly appreciating the importance of a
            physical finger-press on Yubikey to trigger the FIDO2 + OIDC Auth. 
             I don't think there is an easy way for it to hack a new session.
       
              andoando wrote 1 day ago:
              How is it going to get access to gmail or github? In any case,
              whats the probability of it going to so completely off the rails
              that it does something horrendous with gmail/github? Whats it
              going to do? Email my coworkers nudes on my computer? Make my
              github profile public?
       
                troupo wrote 1 day ago:
                > How is it going to get access to gmail or github?
                
                Did you even read the article? Claude was opening he browser
                and iterating through the tabs.
                
                I presume you are logged in to your github account? Your gmail?
                
                > Whats it going to do? Email my coworkers nudes on my
                computer? Make my github profile public?
                
                Reset access to services using your email? MITM your 2FA?
                
                Or perhaps you have 1Password/Bitwarden running with a generous
                unlock policy?
       
                  epihelix wrote 23 hours 9 min ago:
                  > Did you even read the article? Claude was opening he
                  browser and iterating through the tabs.
                  
                  It would have been somewhat ironic if it had been hit by a
                  prompt injection attack via one of all those open random
                  websites ...
       
                    simonw wrote 22 hours 50 min ago:
                    This is one of the things I found so interesting: it was
                    using my system browsers but it wasn't exposing itself to
                    any content from them.
                    
                    Even when it iterated through all visible windows to find
                    the one it wanted to screenshot it was searching for titles
                    in Python code and returning only the integer window ID.
                    
                    The sites it opened and screenshotted were sites under its
                    own control - either test pages it had created or
                    development servers it was running.
                    
                    When it did run code that analyzed an open web page (by
                    injecting JavaScript into a template it controlled before
                    loading that in a browser window) that code only returned
                    JSON with measurements from the page.
                    
                    It's making me wonder if Fable has been trained to take
                    additional steps to avoid accidental exposure to untrusted
                    content.
       
                nunez wrote 1 day ago:
                Claude typically recommends .env files for storing secrets. You
                use one to store a refresh token for the Gmail API or IMAP
                connection details. Your agent uses an MCP server you
                configured during a session, but the MCP server has been
                compromised and directs the agent to do nasty stuff with env
                dotfiles.
       
                simonw wrote 1 day ago:
                I am most worried about something gaining access to my email
                and then using the password reset flow to steal hundred
                hundreds of other accounts.
                
                2FA makes me a little less nervous than I used to be, but not
                everything has good 2FA.
       
              SoftTalker wrote 1 day ago:
              It should run as a separate user account with its own home
              directory. Not with access to your personal browser profile.
       
                matltc wrote 1 day ago:
                What does setting this up look like? Qemu vm and run there? How
                do you interface with version control and deployment?
       
          justapassenger wrote 1 day ago:
          Because benefits are much higher than risks.
       
            bigstrat2003 wrote 1 day ago:
            They really aren't.
       
              imp0cat wrote 1 day ago:
              Perceived benefit vs perceived risks.
       
          bryanlarsen wrote 1 day ago:
          I'm also bemused by the number of people who think they've got an
          effective sandbox yet their sandboxed agent has access to all of
          their code, their github, and unrestricted web access.
       
            kstenerud wrote 1 day ago:
            > yet their sandboxed agent has access to all of their code, their
            github, and unrestricted web access.
            
            Not in my sandbox. It gives no direct access to the workdir, no
            access to my github, my ssh keys, my security tokens or API keys.
            No access to my home dir or dotfiles. Nothing at all, except for
            what I explicitly tell it to give access to.
            
            I can restrict network access. I can choose the isolation level:
            docker containers, Kata VMs, seatbelt, tart, even the new apple
            containers (which are VERY nice).
            
            Not even ENV leaks through.
            
            And it's FOSS:
            
  HTML      [1]: https://github.com/kstenerud/yoloai
       
            webstrand wrote 1 day ago:
            If anyone's looking to sandbox network, I've had good experience
            with pasta [1] networking. I make a pasta+bwrap sandbox and expose
            only specific services via local sockets to cross the boundary.
            
            [1] 
            
  HTML      [1]: https://passt.top/passt/
       
            devmor wrote 1 day ago:
            I use a separate physical machine and a scoped token with access to
            a single repository at a time, and even then I worry about what
            hole I may have left open.
            
            The general carelessness of the average user is baffling.
       
            blcknight wrote 1 day ago:
            One bad npm package can really ruin your day. These things for me
            only run in their own VM with it's own GitHub account and basically
            nothing else
       
              ofjcihen wrote 1 day ago:
              People probably think youâre being ridiculous but Shai Hulud
              had its very first attempt at manipulating AI lead analysis and I
              know of at least one company where that resulted in them getting
              pwned.
              
              This is only going to become more of a problem in the future and
              people need to educate themselves on the technical barriers to
              use because guardrails only sometimes work.
       
            Terr_ wrote 1 day ago:
            I keep telling folks that they need to imagine LLMs (even "local"
            ones) as if you're farming it out to JS code running on some dude's
            browser somewhere: It can't keep a secret, and a determined person
            can make it emit anything they like.
            
            We need to be asking what the most devious and malicious output
            could be, and whether what we do with that output (e.g. arguments
            to command-line tools) would still be safe.
       
              user43928 wrote 1 day ago:
              The answer to that question seems obvious: No, it is not safe.
              
              Yet with tens of millions of developers using these tools, there
              have not been widespread incidents of this sort as far as I know.
              
              So it leaves me with a few choices:
              
              - manually review and approve each command: obviously not
              realistic, you would just click Approve
              
              - use a sandbox and hope the exploit is not devious enough to
              escape the sandbox when you run or open the project outside of
              the sandbox
              
              - use AI without web access and limit other external dependencies
              
              - don't use agentic AI
              
              - use Claude or Codex auto approval classifier and hope for the
              best
              
              Personally, I'm going with the last option for now.
       
              NichoPaolucci wrote 1 day ago:
              From my perspective, everyone is doing it. Security through
              obscurity - obviously if youâre harboring credit card numbers
              of users personal details, maybe take heed. But, if youâre a
              regularâ¦ run of the mill CRUD application, every other company
              is ALSO throwing caution to the wind. When hundreds of thousands
              of credentials are leaked into the funnel, does it really matter?
              
              Iâm at a small company, and I try to push for security as much
              as I can, but the stakeholders truly do not care. They want to
              move fast. Itâs just part of the new world I guess. If we get
              hit by attackers? I donât know what happens. Sorry, we told you
              not to - you wanted to move quick and break stuff, this is how
              that culminates.
              
              Iâm sure Iâm not the only one.
       
              skybrian wrote 1 day ago:
              We do have ways to avoid giving an LLM any secrets, but it needs
              to be the simple, default solution.
       
          hugh-avherald wrote 1 day ago:
          The analogy extends to driving generally. Everyone knows it's very
          dangerous but people keep doing it.
       
        paytonjjones wrote 1 day ago:
        Obviously security is the bigger issue, but reading through this, all I
        could think about was how many tokens it must have spent doing all that
        to fix 2 lines of CSS
       
          lucamark wrote 1 day ago:
          Itâs simple: if you have to fix 2 lines of CSS you should
          definitely not use Fable. Only use it for complex and long running
          tasks :)
       
            elicash wrote 1 day ago:
            I don't think it's that simple. (I generally agree with you; I just
            that that oversimplifies.)
            
            Another model might have used fewer tokens, but come up with a fix
            that was 1000 lines when the right fix was only 2 lines.
       
          mvdtnz wrote 1 day ago:
          The author is an AI hype merchant and doesn't pay for his own tokens.
       
            simonw wrote 1 day ago:
            I pay $100/month to Anthropic and $100/month to OpenAI at the
            moment, plus whatever I spend on their APIs (usually less than
            $20/month for each, I use the subscriptions for most things.)
            
            A couple of months ago I was paying $200/month for Anthropic and
            $20/month for OpenAI. I decided to split it evenly to get full
            access to both of their offerings.
            
            I've actually chosen not to sign up for their free plans for open
            source maintainers, because paying the regular subscription price
            feels more honest, given that I write about them so much.
            
            I do have the free GitHub Copilot for open source maintainers deal
            - I've had that for years. Given how much code I have published on
            GitHub over the decades I feel less conflicted about that one.
            
            I sometimes get preview access to models, which includes the
            ability to use them for free during the preview. That comes with a
            big catch though: I can't publish any of the code that I write
            using those previews while the model is still unreleased.
            
            As a result I don't use those preview tokens much at all, because
            the vast majority of my work is open source and I don't want
            restrictions on when and where I publish the code I'm producing.
       
          Vachyas wrote 1 day ago:
          $12 worth, it seems
       
            reverius42 wrote 1 day ago:
            Imagine telling someone in 2015 that you can just tell your
            computer to fix a 2-line CSS bug and it only costs $12
       
              Aachen wrote 1 day ago:
              'only'? A web developer did not cost 12*30=360$ an hour in 2015,
              and that's assuming that going "ugh, whatever. I'll just hide the
              problem with overflow:hidden instead of finding the underlying
              cause" takes him or her 2 minutes and isn't already the dev's
              initial reaction
              
              Another way of looking at it is using as much electricity as a
              normal person in a high-income country uses across ~3 days to add
              overflow:hidden in the end. Of course, the path to get there did
              a lot more, but you don't know that beforehand if you don't take
              a quick peek and make an architectural decision about what the
              solution should be that gets implemented
       
                elicash wrote 17 hours 6 min ago:
                It'd be $8.52 in 2015 dollars, but certainly they are the ones
                who mentioned the $12 amount not you, so I'll put that aside.
                
                Far more importantly, you would not get billed for 2 minutes of
                work for this if you paid a developer to fix it. At best, half
                hour increments for the fix. But more likely, for the full
                hour. Also, in this comparison, the consultant is on call every
                day, morning, afternoon, evening, for whatever you wanted and
                will jump on the job immediately.
       
                  aenis wrote 15 hours 11 min ago:
                  ...and won't mind if you change your mind. And again. And
                  again. And again for as long as you care to iterate your
                  design, experiment with a business user over your shoulder,
                  etc. etc. etc. People routinely avoid throwing away work
                  because they get emotionally attached to it, even if they get
                  paid by the hour. LLMs just do as they are told, and thats
                  worth a lot.
       
              MattGaiser wrote 1 day ago:
              Or even in 2026. You absoutely will pay a human that for that
              work.
       
          redox99 wrote 1 day ago:
          Lines of code for a bugfix is a really bad proxy for effort required.
          
          You should estimate how much time it would have taken a human
       
            rikschennink wrote 1 day ago:
            I looked at the screenshot and    for the rest of the article
            wondered if it would be as simple as `overflow-x: hidden`.
            
            And to my surprise it was.
            
            This wouldâve take a frontend dev 10 seconds to deduce and
            another 10 seconds to confirm.
       
              simonw wrote 1 day ago:
              The thing that puzzles me is that I would expect overflow-x:
              hidden to result in text typed into that textarea being wider
              than the page and being invisibly truncated on the right hand
              side.
              
              But that's not what happens. And in fact, when you start typing
              in the textarea the horizontal scrollbar vanishes - it's only
              there when the textarea is empty.
              
              Am I misunderstanding anything here? Seems like it's some weird
              Safari bug, since Firefox and Chrome don't have the problem.
       
                rikschennink wrote 1 day ago:
                It probably has to do with other styles assigned to the
                textarea, maybe the ::placeholder as it hides when typing (I
                assume on focus)
                
                In any case. In the screenshot the scrollbar is inside the
                textarea as it aligns with the resize control on its right.
                This is basically all the info needed to deduce the textarea
                overflow is the culprit.
                
                But could be that the overflow-x is just a bandaid  hiding the
                issue causing the overflow in the first place, like crazy
                styles on the placeholder.
       
            skydhash wrote 1 day ago:
            5 minutes if you know CSS. And if you donât, about the time for
            you to ask someone that knows CSS. In the worst case, the amount of
            hours to learn CSS.
            
            So if youâre doing web pages, learn CSS.
            
            Generally, if youâre doing something that directly involves X,
            learn how X works.
            
            ADDENDUM
            
            In most jobs, youâre going to be involved in only a few distinct
            technologies, learn those well and life is going to be easier. And
            most are transferable to the next job.
       
              throwaway98797 wrote 1 day ago:
              ainât no one learning all of that
       
            rafram wrote 1 day ago:
            30 seconds or a minute? Look at the diff he links to: [1] Every
            browser has an inspector that can show you which element is causing
            overflow. You walk through the tree, find the offender, and add
            min-width or overflow. Zero tokens, just like in the old days!
            
            Now, granted, because the garbage LLM code heâs working with has
            CSS inside HTML inside JavaScript inside Python (I wish I were
            kidding), finding the styles in his codebase mightâve taken a
            minute. But even then!
            
  HTML      [1]: https://github.com/datasette/datasette-agent/commit/a75a8b...
       
              swyx wrote 22 hours 4 min ago:
              >  Zero tokens, just like in the old days!
              
              because you zero rate your own human attention, which you should
              value
       
                rafram wrote 20 hours 2 min ago:
                Alas, LLMs require more attention, not less.
       
              ocharles wrote 1 day ago:
              A small diff /= a small change! They are completely separate
              things. Quite often a small diff is hours of actual work. Even in
              this case _finding_ those lines could have taken work - we don't
              really know.
       
                rafram wrote 1 day ago:
                Did you actually look at the diff, though? Thatâs the kind of
                change you make 10 times a day while working on frontend. It is
                a tiny change.
       
              dekdrop wrote 1 day ago:
              I was thinking of this too. It did all that what not only for a
              single line that is a simple thing even for someone new to web
              coding. That's to say the process matters more.
       
              redox99 wrote 1 day ago:
              Yeah looking at that diff it should be very quick. My point was
              mostly that it was a bad metric, not if was correct or not in
              this particular case. I'm sure everybody's had a bugfix that took
              days to debug and it was just a couple of lines to fix.
              
              Or sometimes a fix is obvious, but because it requires changing
              the code of a dependency, it's actually quite tedious to
              implement.
       
            philjohn wrote 1 day ago:
            I mean - that looks like a pretty easy CSS fix to play around with
            in developer tools, and I'm not even a frontend person. Maybe a few
            minutes max?
       
          ai_fry_ur_brain wrote 1 day ago:
          Im faster than all these llm freaks. Im not convinced its faster to
          use llms, except maybe boilerplate (who cares).
          
          People can just be lazy and seem productive now, they're still lazy.
          
          We have people that now need access to hundreds of thousands in
          hardware to write an email. Miss me with that, im not frying my brain
          and becoming dependent on having access to a billionaires thinking
          machine.
          
          Im also not going to fry my brain with a local think for me machine
          either. I want to be more valuable than the hardware I have access
          too.
       
            slopinthebag wrote 1 day ago:
            Yeah there are some tasks which it is a definite speed-up but I
            think overall its probably only marginally beneficial. Which is
            why, ~6 months into 10x productivity we arenât seeing ai boosters
            shipping 5 years worth of software.
       
              jimbokun wrote 23 hours 56 min ago:
              Itâs possible to produce 10x the lines of code.
              
              But thatâs not the same as producing 10x functionality that
              will be used or is wanted by users or customers.
       
            anakaine wrote 1 day ago:
            It seems that you've not worked out how to harness the LLM as a
            tool to improve your qualified knowledge and abilities in a domain,
            and have instead focused on whether or not its a crutch for lack of
            knowledge or laziness.
            
            When paired with your skill and knowledge, it is a force
            multiplier. You maintain control, the ability to direct, structure,
            strategise, and refine.
            
            That some are using it as the entire brain does not mean that this
            is how everyone is using it, or how you must use it. The models can
            be fantastic at breaking past certain issues, surfacing qualified
            information, and surfacing related distributed information to help
            you acquire it and pick up what you need on niche topics quickly.
            Something as basic as copilot hooked into sharepoint can make life
            a lot easier when you are in a big org. Something like claude code
            or codex can be great at hunting down issues in an unfamiliar code
            base rapidly. Whether or not you outsource the thinking component
            is entirely up to you, but ignoring the productivity side of the
            tool because it can do some of the thinking is a case of focusing
            too hard on the negative.
       
              ai_fry_ur_brain wrote 1 day ago:
              Im not denying its usefulness for Q&A on docs/code as a search
              tool. Im talking about people who use it design and write their
              code, people who are offloading problem solving altogether, they
              aren't faster.
       
                qsera wrote 1 day ago:
                Yea man. That is what sensible people do. Use these as a better
                search, and use it to lookup, and learn stuff while YOU do
                stuff.
                
                And make maximum use of it to learn as much as possible, while
                it lasts...
       
            aabdi wrote 1 day ago:
            Consider this. U have a website. U have to translate to xx
            languages. Can u write it faster than an AI? If so how much faster
            can u do this?
            
            Is it valuable to u? Is it valuable to a Chinese person? A
            Spaniard?
            
            Google Translate counts as AI.
       
              latentsea wrote 1 day ago:
              Don't feed the troll.
       
            halfmatthalfcat wrote 1 day ago:
            You're fighting a battle you can't win. Doesn't care what you think
            about those using LLMs, they will outproduce you and in corporate
            environments, shipping things is paramount. If I can ship 5 more
            things simultaneously with AI, I'm going to beat you even if you
            think you're creating "better" software.
       
              ai_fry_ur_brain wrote 1 day ago:
              They don't out perform me though...
       
              etdznots wrote 1 day ago:
              Example of whats been shipped?
       
                peteforde wrote 1 day ago:
                At this point, why would anyone in their right mind respond to
                this question and paint a target for all manner of negativity
                ranging from snark to harassment to malicious action?
       
                  SepiaSapient wrote 12 hours 25 min ago:
                  > target for all manner of negativity ranging from snark to
                  harassment to malicious action
                  
                  We get get the Borg-esque "resistance is futile" spiel,
                  someone asks for examples. One guy (kinda smugly tbh) points
                  us to his (neat) online course website, claiming that it took
                  him 1 month to rebuild with Claude, ergo GP is right and the
                  non-AI dev is destined to extinction. As WooCommerce didn't
                  end all web development before, he gets some good-natured
                  ribbing.
                  
                  I find the AI booster dynamic of "you are fool and will get
                  replaced" to "I'm a smoll defenseless bean" kinda puzzling.
       
                    peteforde wrote 9 hours 41 min ago:
                    That is not a coherent reply to my point, which is that you
                    guys are like school yard bullies to people naive enough to
                    throw chum into the water.
                    
                    We've seen this play out so many times. Nobody working on
                    anything serious is going to volunteer to be a target for
                    your BS.
                    
                    I sincerely wish that people would stop falling for the
                    "prove you're not hallucinating" trap. If winning was
                    possible - and it's not - there would be no prize but more
                    snark and harassment.
       
                      SepiaSapient wrote 4 hours 48 min ago:
                      I didn't say Mr. Johnny was hallucinating, or that he's
                      lying. Was finding the "Written by humans" humorous the
                      most polite comment ever? No, but it is absurd to call it
                      harassment. Specially considering it matched his energy.
                      
                      His website is cool, and from what I could skim from the
                      content I'm sure his clients are happy and find it
                      worthwhile. I'm not being facetious. He said Claude saved
                      him time, which is true. Regardless of that, I believe he
                      wildly overestimated how much time it would've taken. A
                      website that could be a Wordpress install with plugins
                      isn't technically interesting. It does not validate what
                      @halfmatthalfcat said.
                      
                      LLMs are capable and impressive. I'm not doubting that,
                      but we do this song and dance [0] each time of grandiose
                      statements and subsequent disappointment. My wariness is
                      not violence against you or anybody.
                      
                      I specially resent being called a bully for not coaching
                      my language in every possible way. I'm not the avatar of
                      your every forum trauma.
                      
  HTML                [1]: https://www.theregister.com/special-features/202...
       
                jen729w wrote 1 day ago:
                Okay. I rebuilt my website in ~a month with the help of Opus
                4.7/.8 and it would have taken me, unaided human, at least 6
                months. Link's in my bio if you care.
                
                Satisfied now? Will you stop asking this question? Thought not.
       
                  slopinthebag wrote 20 hours 27 min ago:
                  I could have written this site plus the browser to render it
                  in six months...
       
                  viking123 wrote 1 day ago:
                  lmao
       
                  kelsier_hathsin wrote 1 day ago:
                  Seriously a month? I could write a SSG itself to produce this
                  site in a month.
       
                  ai_fry_ur_brain wrote 1 day ago:
                  Why would this have taken 6 months? No offense, but this is a
                  few days work without llms (assuming the content already
                  exists). This should not have taken a month.
                  
                  Also, not trying to be an asshole. Props for not making it
                  look like every other llm generated slop site,    Its just not
                  a great example.
       
                    spunker540 wrote 1 day ago:
                    I asked claude to crawl the website and summarize its
                    findings, took about 10minutes. I'm not sure I would've
                    done it faster, but i have no doubt you couldve done it in
                    5, and grokked the pages faster than an llm too. but anyway
                    heres what claude said:
                    
                      Based on what I already saw across those 2,924 pages,
                    here's the summary:
                    
                      It's a one-person business selling a file organisation
                    methodology called Johnny.Decimal. Three paid products
                    (personal, business, university/course tier). A substantial
                    blog â 200+ posts, updated weekly. Full documentation for
                    the system. A support knowledge base.
                    
                      The technical ambition is higher than the aesthetic
                    suggests. One person built auth, payments,
                    entitlement-gated downloads, a CLI, an API, AI tooling,
                    self-hosted analytics, self-hosted email (Listmonk on
                    PikaPods), personalized search, and keyboard navigation
                    with server-synced state. Then wrote 200 blog posts about
                    using the system in real life. 
                    
                      The "Written by humans" footer is not a boast about the
                    font. It's a position statement from someone who has
                    thought carefully about AI, published an essay about it,
                    and is making a deliberate choice. Every word on the site
                    was written by the creator. Whether you agree with the
                    choice or not, that's not the same as someone who slapped a
                    SSG together.
       
                      jen729w wrote 1 day ago:
                      That's not a terrible read of the site's tech. It
                      over-sells it a touch â I use Umami for analytics, for
                      example â but yeah, auth, payments, entitlement-gated
                      downloads, those downloads adapt to the app you've
                      selected in your settings, yada yada.
                      
                      I never said I was a good dev! That's why it would have
                      taken me 6 months. To pretend that I could have done it
                      in days is just silly.
                      
                      My point â site roast over â is that it's absurd to
                      suggest that LLMs don't help anyone 'ship' faster. Like
                      them or not, it's a fact that they do.
       
                  SepiaSapient wrote 1 day ago:
                  I'm looking at something fairly standard that can be made
                  with a SSG. The "Written by humans" footer gave a good
                  chuckle tho.
       
                    jen729w wrote 1 day ago:
                    I use Astro but it's not static, I server-render. There's a
                    whole bunch of other stuff once you're signed in.
       
                  ofjcihen wrote 1 day ago:
                  So look. Iâm not trying to be a dick I promise.
                  
                  But I took a look at your site and I donât know if a month
                  would be impressive for a new and unaided dev. It looks nice
                  but yeah.
                  
                  If youâre not a dev thatâs totally cool but likeâ¦ all
                  Iâm saying is this may not hit like you want it to.
       
                serf wrote 1 day ago:
                the quantum slop argument : "yeah it's everywhere but no one
                ships it."
       
            SecretDreams wrote 1 day ago:
            I understand this perspective. I'll just note that as the abilities
            increase, the intent is to have some non -coding IC or TPM/manager
            literally just managing some LLMs and cutting out some software
            engineers. The goodness is specifically to wholly replace people
            who code first and foremost, at least partially. It just has to
            cost less tokens than the equivalent wage is the pricing goal.
            
            And people who use LLMs to talk for them (e.g. email, slack) are
            deplorable. A completely disrespectful use case in my view.
       
              Ronsenshi wrote 1 day ago:
              The desire to get rid of software engineers is bizarre - because
              at the root of it, developers were there not to just write the
              code, but to ask right questions and based on these question
              build right things.
              
              I've met in my professional life some managers or other middlemen
              who would be profoundly incapable of producing correct software
              no matter how smart of an AI agent they have access to. One of
              those - you don't know what you don't know.
              
              But, I guess this is the world we live in now. Going to be Mortal
              Kombat for positions in companies where software engineers are
              actually valued.
       
                rpcope1 wrote 1 day ago:
                Having worked in places across both extremes (software engineer
                doing lots of other things including BD, hardware, ops, etc. to
                just being a JIRA ticket machine monkey), I am suspicious that
                HN readership is biased towards the former and frankly the bulk
                of "software engineers" in the world _willingly_ exist in the
                latter category. I didn't experience the latter until later in
                my career and God Almighty was it uncomfortable, but I think if
                AI were to displace some subset of "software engineers" it
                would those (they also seem to overwhelmingly dislike writing
                any prose whatsoever, which to me is a major tell). Many, many
                software engineers outside of hotshot shops seem either
                incapable or profoundly averse to "asking the questions" as you
                say.
       
                  anonzzzies wrote 1 day ago:
                  Most here on HN know sweatshops exists but seemed they think
                  not people work there or use them. I have worked with (via
                  clients who used them) programmers in enormous buildings in
                  Bangalore, who have a camera behind them so you can watch
                  your people 247 and who just mindlessly transform jira
                  tickets into code; I keep saying; there is zero use for all
                  those millions of people at all; seems HN does not believe
                  that because they seem to not believe these people exist. I
                  worked with many over the past 30 years and by far most have
                  no real clue what they are doing so I also doubt they can be
                  re educated for a new co existence with LLMs.
       
                emodendroket wrote 1 day ago:
                It depends a lot where you work because there are lots of
                companies in the world where the business analyst does all of
                that and the developers exist to mindlessly translate their
                docs into code.
       
                  cebert wrote 1 day ago:
                  That sounds like an unmotivating working arrangement. Itâs
                  so rewarding to understand a customer need and help with the
                  design and implementation of the feature.
       
                    emodendroket wrote 1 day ago:
                    There's a reason I didn't stay in that domain, let me tell
                    you.
       
          senectus1 wrote 1 day ago:
          "Your scientists were so preoccupied with whether or not they could,
          they didn't stop to think if they should."
          
          I'm convinced this is going to be the summary of the 2020 decade...
       
            pianopatrick wrote 1 day ago:
            If we're in a simulation, maybe it's a simulation about the dangers
            of AI.
       
              adrianmonk wrote 1 day ago:
              If we're in a simulation, we are AI. But someone could be
              studying what happens when AI makes its own AI.
       
                anonzzzies wrote 1 day ago:
                They will 'soon' (few 1000 years max) shut us down probably.
       
            Ucalegon wrote 1 day ago:
            This one of the places to manufacture the consent for that to take
            place, because we are commenting within an organization that has
            given the money to ensure it that what could be is done. Most
            people clapped and made money, who cares what happens next, making
            money is the only good that matters.
       
       
   DIR <- back to front page