URI:
        _______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
  HTML Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
  HTML   How to setup a local coding agent on macOS
       
       
        smetannik wrote 4 hours 38 min ago:
        I wonder why something like LM Studio didn't work for the author?
       
          b3ing wrote 2 hours 36 min ago:
          That’s what I was wondering, lm studio and draw things are easy to
          use apps that handle much of the cruft for you
       
            freerunnering wrote 1 hour 42 min ago:
            I do a lot of fine tuning and development with small models
            themselves (not just using an LLM over a HTTP API). So downloading
            the models directly and running them from the CLI was natural for
            me, so that's what I reached for when I wanted to play around with
            this.
       
        jumploops wrote 4 hours 50 min ago:
        I've been quite impressed with DeepSeek v4 Flash running via antirez's
        ds4[0].
        
        It feels like a GPT-4 class model in terms of "stored knowledge" but is
        better at long-horizon tool calling than any of the GPT-4 class models.
        
        Running on a 128GB MBP M4 Max, I'm getting ~24 t/s on generation and
        ~200 t/s on prefill. I was expecting it to feel slow, and it certainly
        does when e.g. generating code, but it's surprisingly useful as a
        "machine orchestrator" for simple tasks.
        
        For non-agentic usecases, it's a decent enough model to converse with,
        and has the benefit of being entirely self-contained/private.
        
        [0]
        
  HTML  [1]: https://github.com/antirez/ds4
       
        anigbrowl wrote 5 hours 1 min ago:
        This video is realtime. And shows the agent responding at a perfectly
        usable speed.
        
        Alas, this video appears not have been linked to the text that
        describes it. Perhaps I should ask an AI to generate an artistic
        rendering of the author's description.
       
          freerunnering wrote 1 hour 43 min ago:
          The video is stuck in an `` tag so you need to wait for it to load.
          On a slow connection it might just not show for a while. Though the
          video is only 1MB so should load in if you wait.
       
        everlier wrote 6 hours 23 min ago:
        You can also install Harbor and then it's:
        
        harbor up omlx opencode
       
        bicepjai wrote 6 hours 38 min ago:
        I assumed lmstudio is the obvious choice after ollama. Is there a
        reason lmstudio is not used widely ?
       
          dofm wrote 6 hours 8 min ago:
          LM Studio is fine. Gorgeous actually. I've found it really helpful
          for understanding parameters, settings, general figuring out.
          
          But there is an incentive not to use it if you want to write an
          article that uses only open-source tools, because it isn't.
       
          stingraycharles wrote 6 hours 34 min ago:
          Yeah I’ve also been using it on macOS, my experience is that it
          works better with the metal API and has better performance.
       
        mark_l_watson wrote 7 hours 29 min ago:
        Nice writeup, thanks.
        
        I run something very similar except for directly using pi as the
        agentic harness I use little-coder that wraps pi with reasonable
        defaults for running local models. Even though my local setup is a bit
        slow, it is a thrill to do real work completely locally.
       
        reenorap wrote 7 hours 52 min ago:
        My biggest pet peeve with all these articles on local AI is the only
        thing they talk about is tokens per second. No one mentions the quality
        of the answers. No one. I don't mind waiting a little longer if the
        quality is better. Quickly serving me slop doesn't make it more useful.
        Are people really only looking at tokens per second?
       
          frollogaston wrote 5 hours 43 min ago:
          The model already has its own quality benchmarks elsewhere. The
          article is just about running the model on X hardware, so the
          remaining question is then how fast it is. Or does the output quality
          somehow depend on the hardware too?
       
          ozim wrote 6 hours 44 min ago:
          Local model as such will give you "autocomplete on steroids" but it
          is not going to run away and implement cross project feature like
          frontier model in let's say Cursor.
          
          So there is no value in testing quality of answers, but there is
          value in testing token speed.
          
          You just have to have correct expectations.
       
          akman wrote 7 hours 31 min ago:
          That's fair. There are even many dimensions to define 'quality' which
          include use case (coding? writing? multimedia?) and prompt. I suppose
          if you ask testers to provide benchmarks with their analysis, that
          might hamper their desire to share.
       
        LoganDark wrote 8 hours 57 min ago:
        I poured a couple days into custom Burn inference for Qwen3-Coder-Next
        only to find it doesn't come with a speculative decoder, so on my M4
        Max I can't push it much further than 120t/s. That's still kinda slow,
        though still faster than llama.cpp's 70.9t/s and MLX's 80.6t/s with the
        same model. Claude Fable 5 is recommending I use the Qwen3 MTP -- I
        worry that will compromise the quality somewhat, but might give it a
        try to see if I can get more usable speeds.
       
        rectang wrote 9 hours 1 min ago:
        Does anybody run a local agent on a Mac using an outboard GPU?
       
          benbojangles wrote 8 hours 46 min ago:
          I run a second Mac for local llm use and access it remotely using ssh
          from the first mac
       
        sleepybrett wrote 9 hours 27 min ago:
        or you can just load up ollama, have it load a local model and point
        claude or opencode at it...
        
        is this article old? It's not. I'm not sure why he went through all the
        bother of llama.cpp
       
          malkosta wrote 9 hours 24 min ago:
          That was exactly my same question. Then I finished reading the post.
          The reason is pretty clear, and written in the post: it is faster
          than ollama+mlx.
       
            sleepybrett wrote 8 hours 58 min ago:
            how much faster?
       
              freerunnering wrote 1 hour 36 min ago:
              I was benchmarking different models, different engines, and
              different draft models, I posted a video on twitter, and people
              started asking about the setup in the final screen recording. So
              the blog post isn't so much "how a beginner should setup
              something" it's "here's the setup I posted in the video".
              
              Original video: [1] And in the blog post there is a table showing
              the different speeds I got from different engines.
              
              Slowest combo was 38.1 tk/s, and the fastest was 72.2 tk/s. All
              from "the same" model.
              
  HTML        [1]: https://x.com/Freerunnering/status/2065275403548168398
       
        metadaemon wrote 9 hours 38 min ago:
        Has anyone compared a setup like this to just using LM Studio?
       
          CharlesW wrote 9 hours 18 min ago:
          Yes, I can confirm that LM Studio works great for this.
       
        hanifbbz wrote 9 hours 39 min ago:
        Here's a visual post for using LM Studio and VS Code (and Pi): [1] One
        way or another local AI is the future. I actually find weaker models
        more interesting because it keeps me sharp (at the cost of velocity of
        course).
        
  HTML  [1]: https://blog.alexewerlof.com/p/local-llms-for-agentic-coding
       
        jmkni wrote 9 hours 42 min ago:
        FYI you can open Claude code in the terminal, point it at this article
        and just tell it to "do it", if you're feeling extra lazy
       
          echelon wrote 9 hours 19 min ago:
          This is the way.
          
          I'm not Googling much of anything anymore. 9/10 times the information
          is awful, it's hard to parse out of whatever other spam it's
          surrounded by. Meanwhile, Claude will just do the thing one-shot or
          with a tiny bit of refinement.
          
          The gateway to knowledge and getting stuff done is the LLM.
          
          Google Search is a dinosaur.
          
          It feels like we're living a century into the future. Not even
          smartphones were this cool.
       
            kingofthehill98 wrote 9 hours 4 min ago:
            Yeah, if the future is "Claude, think for me" I'm happy to stay at
            the good old present.
       
              echelon wrote 9 hours 1 min ago:
               [1] [2] New decade, same old argument.
              
              It's not
              
              > "Claude, think for me"
              
              It's
              
              > "Claude, be my subordinate and get this done for me"
              
              Instead of complaining on the sidelines, I'm getting a shit ton
              of work done.
              
  HTML        [1]: https://en.wikipedia.org/wiki/Is_Google_Making_Us_Stupid...
  HTML        [2]: https://newsletter.pessimistsarchive.org/p/when-educator...
       
                wwweston wrote 6 hours 33 min ago:
                As one famous agent said: “I say your civilization because as
                soon as we started thinking for you it really became our
                civilization which is of course what this is all about.”
                
                An argument can be as old as the search engine and hold real
                value. There are ways in which unreflective search engine use
                has misled and mistrained people.
                
                There’s always been argument to be had about how we manage
                and offload attention, what we gain and what we lose when
                resistance is reduced. It’s part of reflection that’s been
                necessary in order to make progress solid ground, and is more
                necessary with non-deterministic tech.
                
                The phrase “Tactical tornados” may be older than web search
                and describes people who also got a lot done.
                
                Models can be incredibly helpful boosters and situationally
                effective subordinates… and also patchy as a real engineering
                IC or org.
       
                this_user wrote 8 hours 30 min ago:
                > Instead of complaining on the sidelines, I'm getting a shit
                ton of work done.
                
                Nah, you are just producing a bunch of slop and hope that
                nobody notices.
       
                sdevonoes wrote 8 hours 39 min ago:
                > I'm getting a shit ton of work done.
                
                It’s weird when people are proud of doing ton of work. Im the
                opposite, Im proud that Im doing minimal stuff without llms.
       
                ultrarunner wrote 8 hours 48 min ago:
                For what it's worth, even this reply reads like LLM output.
                It's not "quote describing the scenario", it's "some other
                linked-in-coded plot twist". If you're the average of the
                people you spend the most time around, and you spend the most
                time around a chatbot, do you start to absorb its speech
                patterns and logic structures?
                
                Yeah, good ol' present for me too then, thanks.
       
            tobyhinloopen wrote 9 hours 8 min ago:
            Claude “respond in a friendly way that I agree with this
            comment”
       
        vladgur wrote 9 hours 48 min ago:
        I have used omlx.ai with great success to both download multiple mlx
        models (including gemma and qwen) suited for my hardware AND to be able
        to automagically launch both open-source and close-source (claude code,
        codex) harnesses using these models. All from a web or desktop UI
        
        You would not need to follow a blog post with omlx IMHO
       
          dofm wrote 7 hours 45 min ago:
          FWIW I have not, on a 64GB M1 Max, seen any advantage from oMLX
          specifically or MLX generally over GGUF with llama.cpp.
          
          The Gemma 4 MLX builds I have found so far have been slower at the
          same quantisation and much slower with MTP.
          
          The built-in web UI for llama.cpp is really quite good once you have
          chosen your model. Otherwise I quite like LM Studio for tinkering.
          
          One thing I would say is that both Gemma-4 and Qwen 3.6 simply do not
          need a large chunk of the typical opencode system prompt. Better off
          without it.
       
          Dotnaught wrote 8 hours 45 min ago:
          In case anyone is looking for a sandbox to go with oMLX and Pi:
          
  HTML    [1]: https://github.com/Dotnaught/pi-sandbox
       
            zmmmmm wrote 5 hours 31 min ago:
            it looks handy but ...
            
                sbx policy set-default open
            
            just so the single pi sandbox can talk to localhost? ... this gives
            me some grave doubts about the rest of it being set up well.
       
            dofm wrote 7 hours 43 min ago:
            This is useful. I'm still tinkering with Multipass VMs because I
            need the whole VM environment anyway and I'm on Sequoia. But I'd be
            interested if you did anything like that with Apple's container CLI
            instead; sooner or later I will have to upgrade to Tahoe because I
            want to play with the container CLI (and apfel).
       
          fridder wrote 9 hours 45 min ago:
          It truly is the SOTA for local inference on mac. Even when there are
          regressions the dev(s) are insanely responsive. It is the most
          impressive opensource project I've seen in a awhile
       
            benbojangles wrote 8 hours 47 min ago:
            Omlx needs to incorporate macos native shortcuts use - macos can
            almost instantly extract text from pdfs and a bunch of other things
            using it's ane neural engine keeping unified ram for llm use. The
            two together would be awesome
       
        Aurornis wrote 9 hours 50 min ago:
        > The benchmark prompt was:
        
        > Write a compact Python function that parses a unified diff and
        returns the changed file paths. Then explain two edge cases.
        
        > Each benchmark generated about 128 tokens.
        
        Generating 128 tokens is probably not enough for good benchmark
        results. MTP speedup depends on how often the predicted tokens are
        accepted. In my experience, the very early output has a higher
        acceptance rate, so short testing can give false positive speedups.
        
        llama.cpp includes a tool specifically for benchmarking that will sweep
        the arguments for you so you don't have to restart the server and send
        it prompts: [1] EDIT: Also the section about downloading the models
        should have mentioned that llama.cpp has a "-hf" argument that will
        download the models for you. I appreciate the author for sharing their
        experience, but for beginners this might not be the best guide to use.
        
  HTML  [1]: https://github.com/ggml-org/llama.cpp/blob/master/tools/llama-...
       
          freerunnering wrote 1 hour 48 min ago:
          > I appreciate the author for sharing their experience, but for
          beginners this might not be the best guide to use.
          
          Yeah, I didn't write this as a proper developer guide. My screen
          recording started getting loads of favourites and I started getting
          messages asking about how I set it up, so just through up a quick
          rundown of how I setup this test.
          
          I little just saw the Unclothe announcement about "Double the speed"
          and thought "Ha. I wonder if that will get it fast enough I'd
          actually be prepared to use it" and had a go at setting it up.
          
          I'd done tests before last year with things like Devstral, but they
          were always both so slow and dumb, I didn't want to bother.
          
          This finally hit the "wow, this is useable" level of both speed and
          intelligence.
       
          willXare wrote 6 hours 8 min ago:
          At 128 tokens, you’re benchmarking the overture, not the opera.
       
          reactordev wrote 7 hours 28 min ago:
          This is akin to saying “it runs on my machine” without actually
          examining the problem. Sad. You’re absolutely right that 128 tokens
          is nothing, it’s a little more than a hello response.
       
          liuliu wrote 9 hours 4 min ago:
          Realistically, you need to experiment with any user prompt + a good
          amount of system prompt (at least > 1000 tokens, but realistically,
          in the range of 3000 tokens probably good).
          
          llama.cpp includes tools for that, what you are looking at is to have
          a prefill before token generation to measure it properly.
          Increasingly also, measuring token generation speed at longer context
          (32k or 64k) is important too.
       
        attogram wrote 9 hours 57 min ago:
        8b max on a std 16gb macbook.  Anything more and your mac is toast
       
          benbojangles wrote 8 hours 43 min ago:
          70b on my M1 max 64gb
       
        reddit_clone wrote 10 hours 0 min ago:
        >64 GB
        
        Thats the rub.
        I have an M4 with 48G. I wonder if it is worth testing this out.
        
        My past attempts (with Ollama and various LLMs) were too slow to use.
       
          dofm wrote 7 hours 40 min ago:
          Some of these models will be a bit of a squeeze at Q4_0 I suspect;
          almost certainly they will be using CPU. Probably the 31B Gemma will
          be too much. Maybe not the Gemma-4 26B QAT.
          
          But if you just want to play around rather than code, you really
          might find the Gemma 4 12B model worth mucking about with just so
          you've gone through the steps. Especially if you want to muck about
          with image analysis or audio transcription.
          
          If you're writing PHP I think you could even find it good enough.
          I've been modestly surprised. You can do that basic fiddling with the
          Edge AI Gallery app, which can enable thinking and has a customisable
          system prompt and some agent support.
          
          You could also try the 14B Deepseek R1.
          
          Honestly even if it is not good enough, if you are anything like me,
          I think you'll find that going through this process is really quite
          educational — it has made a lot of things more concrete for me in a
          way that I have found reassuring and valuable.
       
          contingencies wrote 8 hours 33 min ago:
          M4 24GB here. You'll be fine, if you're anything like me minor
          latency is acceptable to obtain (a) privacy (b) reliability (c)
          CI/CD/guardrails (d) network independence (e) future-proofing vs.
          AIaaS. [1] gives you intelligent local hardware based model download
          recommendations. That said it probably depends heavily on your
          workload, process and polish expectations. See also
          
  HTML    [1]: https://omlx.ai/
  HTML    [2]: https://news.ycombinator.com/item?id=48089091
       
            spike021 wrote 2 hours 23 min ago:
            what are you using on yours? I've got a M4 Pro 24GB also. tried the
            open source gpt one. it's alright but I found it can get stuck at
            times. maybe just my config in LM Studio.
       
          codazoda wrote 9 hours 1 min ago:
          I'm running an M3 on an Air with just 16GB. I can still get useful
          results without an internet connection in "chat mode". It's a
          different experience than using Claude, for sure, but it's workable.
          I typically use the Qwen variants these days.
       
            mark_l_watson wrote 7 hours 13 min ago:
            This might be useful when ‘coding in chat mode’: I have a few
            scripts that I run in a project directory that takes a prompt from
            me, and creates a single long one-shot prompt that I can paste into
            a chat window and ask that any generating code is inside markdown
            code blocks for easier copy/pasting. Also, pardon the plug, but you
            can read my new tiny book free online that documents my experiences
            using agentic coding on my 16G Mac and my 32G Mac:
            
  HTML      [1]: https://leanpub.com/read/local-coding-agents
       
              codazoda wrote 6 hours 55 min ago:
              Looks cool, I’ll checkout the book. Your download links (PDF
              and EPUB) are down for me.
              
              > NoSuchKeyThe specified key does not exist…
       
          hkchad wrote 9 hours 50 min ago:
          I have a M5 MAX with 128, local models are toys compared to hosted
          ones. I've spent a lot of time and money trying to make it work even
          1/2 as well.
       
            dofm wrote 6 hours 5 min ago:
            It all depends on what you want to do, I guess.
            
            If you're seeking the kind of hands-off claude experience,
            obviously not. They are slow.
            
            If you want to learn how these things work, train them locally,
            tinker, play with the code, grasp the fundamentals, or just out of
            sheer bloody-mindedness and principle refuse to tether the
            functioning of your application to a cloud API...
       
        namnnumbr wrote 10 hours 23 min ago:
        oMLX ( [1] ) makes running the mlx inference server quite easy for
        those interested in UI-based hosting. oMLX also supports mtp or dflash
        drafting.
        
  HTML  [1]: https://github.com/jundot/omlx
       
          w10-1 wrote 9 hours 52 min ago:
          Agreed (not sure what you mean by UI-based hosting).
          
          oMLX does the caching I need to fit models that are near gross
          memory, and it handles most of the work in finding usable models. 
          After cobbling together various solutions over months, I now just use
          oMLX, often from Xcode.  I can tell the difference between Gemma-4
          (local/free) and Claude (paid) only on the largest tasks.
       
        dofm wrote 10 hours 23 min ago:
        Useful stuff in here that I wish I'd seen a few days ago :-)
        
        I am not convinced that the MTP setup for the QAT model adds very much
        in terms of speed on my M1 Max, but it is definitely worth
        experimenting with.
        
        Fiddling about with local models has done so much for my conceptual
        understanding of what is going on.
        
        FWIW and YMMV but I also found the Gemma 4 MTP head was occasionally
        breaking markup in Opencode, causing the thinking to display untidily
        and ultimately in some cases missing the stop token. So I've stopped
        using MTP there for now.
        
        Recent Qwen 3.6 models have developer role support so it will
        occasionally surprise you with a structured multiple choice
        questionnaire.
       
          mark_l_watson wrote 7 hours 19 min ago:
          when I started using QAT recently, I stopped trying to improve my
          configuration after that. I will try tuning my local environment
          again in a few months, but with QAT things are good enough for now.
       
          mft_ wrote 10 hours 5 min ago:
          I found a marginal downside to Qwen3.6-35B-A3B-MTP vs. the non-MTP
          equivalent on an M1 Max. I’ll maybe experiment with settings
          further though.
       
            freehorse wrote 9 hours 25 min ago:
            And the upsides of using draft models for MOE models with so low
            number of active parameters (as here or as in the article) are
            quite low, compared to dense models where you can get enormous
            speedups. I would prefer running the dense 27b models with
            speculative decoding instead.
       
              dofm wrote 8 hours 12 min ago:
              That is what I have learned, yes. Not tested the dense Qwen yet.
              IIRC the 31B Gemma was slow enough that I doubt MTP will help me
              much.
       
            dofm wrote 9 hours 49 min ago:
            Yeah. I think it might speed up time to first token but I am not
            sure how much that matters.
            
            I do enjoy their different personalities when they are tackling
            "explain this" type puzzles, though.
            
            Gemma writes so well — like a concise code blogger. It makes you
            understand that the thing we hate about AI slop writing is
            specifically the cheesy, marketingese sycophantic ChatGPT tone.
            It's a choice to sound that way.
            
            Qwen writes more tersely by default, like much english language
            documentation in Chinese open source projects. A couple of lines,
            code example, fact, code example, line of blurb.
            
            I use this prompt every now and then with a new model. It's
            obviously a classic SQL puzzle but I've asked new web developers
            this in the past (prompted by discovering that a client's
            subcontractor didn't understand it and was therefore unable to
            migrate some code from relying on dodgy pre-MySQL 5.x behaviours)
            
            —
            
              I have a MySQL 5 table like this: [id, label, category, score].  
            It contains a list of items in different categories (text names
            like cat1, cat2, cat3) with a numerical score. Is there a way I can
            write a SQL query to find the item in each category that has the
            highest score, without using a subquery? No two entries in any
            category share a score.
            
            —
            
            I enjoy seeing what it deduces from the subtext.
            
            Without "thinking" mode on, they always initially fail and you need
            to prompt them to find the answer. With thinking mode, they both
            produce really nice explanations.
            
            For me, as an old freelancer who is pretty cynical about vibe
            coding or "agentic engineering", what I really want is an AI tool
            that can help me start to solve problems and help me find the right
            terminology or generate some boilerplate I can tinker with. Both of
            these models do fine at the kind of "starter" writing that I want
            when I am trying to untangle an idea.
       
        ig0r0 wrote 10 hours 24 min ago:
        I wrote a similar post some time ago just used ollama and opencode
        
  HTML  [1]: https://blog.kulman.sk/running-local-llm-coding-server/
       
          takethebus wrote 8 hours 52 min ago:
          this is the way, given anyone could swap for oh my pi / pi / etc
       
            mark_l_watson wrote 7 hours 22 min ago:
            yes, whether for home experiments or at work, it is good practice
            (good hygiene) to be able to swap out both agentic harnesses and
            models. It is important to have a good strategy for exporting
            skills, etc.
       
          sleepybrett wrote 9 hours 26 min ago:
          actually useful and the ollama gui could probably even simplify this
          more.
       
        c-hendricks wrote 10 hours 31 min ago:
        Not sure you really need huggingface-cli to download anything if you're
        just using llama.cpp. You can pass `-hf ...` and it will download the
        models for you. Set `LLAMA_CACHE` to change where the downloads go:
        
          LLAMA_CACHE="models" ./llama-server \
            -hf unsloth/gemma-4-31B-it-GGUF:UD-Q4_K_XL \
            ...
       
          dofm wrote 10 hours 27 min ago:
          Yes.
          
          -hfd for the draft model.
       
            c-hendricks wrote 10 hours 12 min ago:
            Nice, was wondering if there was a flag for the draft as well.
            
            Not knocking huggingface-cli, just find it's much easier for people
            to try out this stuff when they can just
            
              mise use --global github:ggml-org/llama.cpp
              LLAMA_CACHE="models" llama-server \
                -hf unsloth/gemma-4-26B-A4B-it-qat-GGUF:UD-Q4_K_XL \
                --host 0.0.0.0 \
                --port 11434 \
                ...
       
              dofm wrote 5 hours 38 min ago:
              —no-mmproj 
              
              is also pretty useful if you're doing this just to try agentic
              coding and you're not processing images/voice. Stops it
              downloading the multimodal projector.
       
        cdolan wrote 10 hours 45 min ago:
        Is there a link to the video? It did not render when I went to the
        page. Curious about the real-time feel of this
       
          dewey wrote 10 hours 18 min ago:
          That's the direct link:
          
  HTML    [1]: https://ikyle.me/blog/2026/how-to-setup-a-local-coding-agent...
       
            c-hendricks wrote 9 hours 45 min ago:
            Note this is cut to just before the model responds, so not a great
            way for people to judge the real-time feel of this.
       
              freerunnering wrote 1 hour 40 min ago:
              The full video is on Twitter: [1] Plus a followup one where you
              see me type the question in and press enter (though that video is
              with Qwen 3.6, not Gemma 4)
              
  HTML        [1]: https://x.com/Freerunnering/status/2065275403548168398
  HTML        [2]: https://x.com/Freerunnering/status/2065354101878055038
       
       
   DIR <- back to front page