codevoid.de/1/hn/comments_47088037.gph

        _______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
  HTML Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
  HTML   Ggml.ai joins Hugging Face to ensure the long-term progress of Local AI
       
       
        jpcompartir wrote 1 day ago:
        This is great, brings clear benefits to both sides and the rest of us.
        
        Always rooting for Hugging Face
       
        mhher wrote 1 day ago:
        It's great to see the ggml team getting proper backing. Keeping
        inference in bare-metal C/C++ without the Python bloat is the only way
        local AI is going to scale efficiently. Well deserved for Georgi,
        Johannes, Piotr, and the rest of the team.
       
        am17an wrote 1 day ago:
        One often overlooked after that is ggml, the tensor library that runs
        llama.cpp is not based on pytorch, rather just plain cpp. In a world
        where pytorch dominates, it shows that alternatives are possible and
        are worthy to be pursued.
       
        car wrote 1 day ago:
        So great to see my two favorite Open Source AI projects/companies
        joining forces.
        
        Since I don't see it mentioned here, LlamaBarn is an awesome
        littleâbut mightyâMacOS menubar program, making access to
        llama.cpp's great web UI and downloading of tastefully curated models
        easy as pie. It automatically determines the available model- and
        context-sizes based on available RAM. [1] Downloaded models live in:
        
          ~/.llamabarn
        
        Apart from running on localhost, the server address and port can be set
        via CLI:
        
          # bind to all interfaces (0.0.0.0)
          defaults write app.llamabarn.LlamaBarn exposeToNetwork -bool YES
        
          # or bind to a specific IP (e.g., for Tailscale)
          defaults write app.llamabarn.LlamaBarn exposeToNetwork -string
        "100.x.x.x"
        
          # disable (default)
          defaults delete app.llamabarn.LlamaBarn exposeToNetwork
        
  HTML  [1]: https://github.com/ggml-org/LlamaBarn
       
          noisy_boy wrote 1 day ago:
          Github is showing me unicorn - is there an Linux equivalent? I have a
          old Thinkpad with a puny Nvidia GPU, can I hope to find anything
          useful to run on that?
       
            car wrote 22 hours 47 min ago:
            Building Llama.cpp from source with CUDA enabled should get you
            pretty far. llama-server has a really good web UI, the latest
            version supports model switching.
            
            As for models, plenty of GGUF quantized (down to 2-bit) available
            on HF and modelscope.
       
        ontouchstart wrote 1 day ago:
        I have played with both mlx-lm and llama.cpp after I bought a 24GB M5
        MacBook Pro last year.
        
        Then I fell down the rabbit holes of uv, rust and C++ and forgot about
        LLMs. Today after I saw this announcement and answered someoneâs
        question about how to set it up, when I got home, I decided play with
        llama.cpp again.
        
        I was surprised and impressed: [1] I am not going to use mlx-lm or
        lmstudio anymore. llama.cpp is so much fun.
        
  HTML  [1]: https://ontouchstart.github.io/rabbit-holes/llama.cpp/
       
        sbinnee wrote 1 day ago:
        I am happy for ggml team. They did so much work for quantization and
        actually made it available to everyone. Thank you.
       
        snowhale wrote 1 day ago:
        good to see them get proper backing. llama.cpp is basically
        infrastructure at this point and relying on volunteer maintainers for
        something this critical was starting to feel sketchy.
       
        moralestapia wrote 1 day ago:
        I hope Georgi gets a big fat check out of this, he deserves it 100%.
       
        cyanydeez wrote 1 day ago:
        Is there a local webui that integrates with Hugging face?
        
        Ollama and webui seem to rapidly lose their charm. Ollama now includes
        cloud apis which makes no sense as a local.
       
        forty wrote 1 day ago:
        Looks like someone tried to type "Gmail" while drunk...
       
          rkomorn wrote 1 day ago:
          Looks like Gargamel of Smurfs fame to me.
       
        lukebechtel wrote 1 day ago:
        Thank you Georgi <3
       
        kristianp wrote 1 day ago:
        > Towards seamless âsingle-clickâ integration with the transformers
        library
        
        That's interesting. I thought they would be somewhat redundant. They do
        similar things after all, except training.
       
        karmasimida wrote 1 day ago:
        Does local AI have a future? The models are getting ridiculously big
        and any storage hardware is hoarded by few companies for next 2 years
        and nvidia has stopped making consumer GPU for this year.
        
        It seems to me there is no chance local ML is going to be anywhere out
        of the toy status comparing to closed source ones in short term
       
          dust42 wrote 1 day ago:
          I am actually doing now a good part of dev with Qwen3-Coder-Next on
          an M1 64GB with Qwen Code CLI (a fork of Gemini CLI). I very much
          like
          
            a) to have an idea how much tokens I use and 
            b) be independent of VC financed token machines and 
            c) I can use it on a plane/train
          
          Also I never have to wait in a queue, nor will I be told to wait for
          a few hours. And I get many answers in a second.
          
          I don't do full vibe coding with a dozen agents though. I read all
          the code it produces and guide it where necessary.
          
          Last not least, at some point the VC funded party will be over and
          when this happens one better knows how to be highly efficient in AI
          token use.
       
            ttoinou wrote 1 day ago:
            How much tokens per seconds are you getting ?
            
            Whats the advantage of qwen code cli over opencode ?
       
              dust42 wrote 1 day ago:
              320 tok/s PP and 42 tok/s TG with 4bit quant and MLX. Llama.cpp
              was half for this model but afaik has improved a few days ago, I
              haven't yet tested though.
              
              I have tried many tools locally and was never really happy with
              any. I tried finally Qwen Code CLI assuming that it would run
              well with a Qwen model and it does. YMMV, I mostly do javascript
              and Python. Most important setting was to set the max context
              size, it then auto compacts before reaching it. I run with 65536
              but may raise this a bit.
              
              Last not least OpenCode is VC funded, at some point they will
              have to make money while Gemini CLI / Qwen CLI are not the
              primary products of the companies but definitely dog-fooded.
       
          rhdunn wrote 1 day ago:
          Mistral have small variants (3B, 8B, 14B, etc.), as do others like
          IBM Granite and Qwen. Then there are finetunes based on these models,
          depending on your workflow/requirements.
       
            karmasimida wrote 1 day ago:
            True, but anything remotely useful is 300B and above
       
              Eupolemos wrote 1 day ago:
              That is a very broad and silly position to take, especially in
              this thread.
              
              I use Devstral 2 and Gemini 3 daily.
       
        fancy_pantser wrote 1 day ago:
        Was Georgi ever approached by Meta? I wonder what they offered (I'm
        glad they didn't succeed, just morbid curiosity).
       
        mattfrommars wrote 2 days ago:
        I donât know if this warrants a separate thread here but I have to
        askâ¦
        
        How can I realistically get involved the AI development space? I feel
        left out with whatâs going on and living in a bubble where AI is
        forced into by my employer to make use of it (GitHub Copilot), what is
        a realistic road map to kinda slowly get into AI development, whatever
        that means
        
        My background is full stack development in Java and React, albeit
        development is slow.
        
        Iâve only messed with AI on very application side, created a local
        chat bot for demo purposes to understand what RAG is about to running
        models locally. But all of this is very superficial and I feel Iâm
        not in the deep with what AI is about. I get Iâm too âlateâ to be
        on the side of building the next frontier model and makes no sense,
        what else can I do?
        
        I know Python, next step is maybe do âLLM from scratchâ? Or I pick
        up Google machine learning crash course certificate? Or do recently
        released Nvidia Certification?
        
        Iâm open for suggestions
       
          swyx wrote 1 day ago:
          go thru workshops here
          
  HTML    [1]: https://www.youtube.com/@aiDotEngineer/
       
          w10-1 wrote 1 day ago:
          The competition for root and branch AI models and infrastructure is
          intense and skilled.
          
          But if you're adjacent to some leaf use-case for AI, you're likely
          already as good as anyone else at productizing it.
          
          And that's who is getting hired: people who show they can deliver
          product-market fit.
       
          breisa wrote 1 day ago:
          Maybe look into model finetuning/distilation. Unsloth [1] has great
          guides and provides everything you need to get started on Google
          Colab for free.
          
  HTML    [1]: https://unsloth.ai/
       
          fc417fc802 wrote 1 day ago:
          I'm not entirely clear what your goals are but roughly, just figure
          out an application that holds your interest and build a model for it
          from scratch. Probably don't start with an LLM though. Same as for
          anything else really. If you're interest in computer graphics then
          decide on a small scale project and go build it from scratch. Etc.
       
        simonw wrote 2 days ago:
        It's hard to overstate the impact Georgi Gerganov and llama.cpp have
        had on the local model space. He pretty much kicked off the revolution
        in March 2023, making LLaMA work on consumer laptops.
        
        Here's that README from March 10th 2023 [1] > The main goal is to run
        the model using 4-bit quantization on a MacBook. [...] This was hacked
        in an evening - I have no idea if it works correctly.
        
        Hugging Face have been a great open source steward of Transformers, I'm
        optimistic the same will be true for GGML.
        
        I wrote a bit about this here:
        
  HTML  [1]: https://github.com/ggml-org/llama.cpp/blob/775328064e69db1ebd7...
  HTML  [2]: https://simonwillison.net/2026/Feb/20/ggmlai-joins-hugging-fac...
       
          ushakov wrote 2 days ago:
          i am curious, why are your comments always pinned to the top?
       
            magicalhippo wrote 1 day ago:
            New comments get a boost, and as such are frequently near the top
            just due to that. Frequent upvotes also boosts. There might be
            other factors.
            
            However these things are dynamic and change over time. As I read
            the discussion just now, the GP comment was the ~5th top-level
            comment.
       
            satvikpendem wrote 1 day ago:
            They aren't pinned, people just vote on them, and more so because
            simonw is a recognizable name with lots of posts and comments.
       
            francispauli wrote 1 day ago:
            thanks for reminding me i need to follow his blog weekly again
       
            throwaway2027 wrote 2 days ago:
            Time flies and simonw his AI feedback isn't always received
            favorably, sometimes he pushes it too much.
       
            llm_nerd wrote 2 days ago:
            HN goes through phases. I remember when patio11 was the star of the
            hour on here. At another time it was that security guy (can't
            remember his name).
            
            And for those who think it's just organic with all of the upvotes,
            HN absolutely does have a +/- comment bias for users, and it does
            automatically feature certain people and suppress others.
       
              rymc wrote 2 days ago:
              the security you mean is probably tptacek ( [1] )
              
  HTML        [1]: https://news.ycombinator.com/user?id=tptacek
       
              imiric wrote 2 days ago:
              > And for those who think it's just organic with all of the
              upvotes, HN absolutely does have a bias for authors, and it does
              automatically feature certain people and suppress others.
              
              Exactly.
              
              There are configurable settings for each account, which might be
              automatically or manually setâI'm not sureâ, that control the
              initial position of a comment in threads, and how long it stays
              there. There might be a reward system, where comments from
              high-karma accounts are prioritized over others, and accounts
              with "strikes", e.g. direct warnings from moderators, are
              penalized.
              
              The difference in upvotes that account ultimately receives, and
              thus the impact on the discussion, is quite stark. The more
              visible a comment is, i.e. the more at the top it is, the more
              upvotes it can collect, which in turn makes it stay at the top,
              and so on.
              
              It's safe to assume that certain accounts, such as those of YC
              staff, mods, or alumni, or tech celebrities like simonw, are
              given the highest priority.
              
              I've noticed this on my own account. Before being warned for an
              IMO bullshit reason, my comments started to appear near the
              middle, and quickly float down to the bottom, whereas before they
              would usually be at the top for a few minutes. The quality of
              what I say hasn't changed, though the account's standing, and
              certainly the community itself, has.
              
              I don't mind, nor particularly care about an arbitrary number.
              This is a proprietary platform run by a VC firm. It would be
              silly to expect that they've cracked the code of online
              discourse, or that their goal is to keep it balanced. The
              discussions here are better on average than elsewhere because of
              the community, although that also has been declining over the
              years.
              
              I still find it jarring that most people would vote on a comment
              depending on if they agree with it or not, instead of engaging
              with it intellectually, which often pushes interesting comments
              to the bottom. This is an unsolved problem here, as much as it is
              on other platforms.
       
                Eisenstein wrote 1 day ago:
                There is a saying that if everyone you encounter seems to be
                unreasonable, maybe it isn't the other people that are being
                unreasonable.
                
                This isn't to say that social media is fair, or that people
                vote properly or that any ranking system based on agreement by
                readers is a good one. However, generally when you are getting
                negativity communicated to you and you are seeing consistently
                poor results around actions you take, it is going to be useful
                to examine the possibility that there is a difference in how
                you perceive what you are doing vs how others do. In that case
                spending time trying to figure out ways in which you are being
                wronged so that you can continue in the same manner is going to
                be time wasted.
       
                  llm_nerd wrote 20 hours 59 min ago:
                  You seem to be assuming that everything is organic and above
                  board on here. That it's all just user/community stimuli, and
                  if someone flies high well clearly it's great content, from
                  which we can infer the reverse as well.
                  
                  We don't have the source for HN, nor do we have the obvious
                  bias metadata that the moderators have put in place, but
                  simply paying attention betrays that manipulation mechanisms
                  exist and are heavily utilized.
                  
                  For instance I clearly have a "bad guy" flag on my account,
                  and frequently see my highly rated comments sorted below
                  literally greyed out comments. Comments older than mine, so
                  it isn't just the normal "well newer comments get a boost",
                  it's just that there is a comment "DEI" in place where some
                  people get a freebie boost and some people get a freebie
                  detriment. It's why often mediocre content and comments by
                  the core group is always floating high.
                  
                  And let me make it very clear that I do not care. I don't
                  harbour any delusions about some tight community or the like,
                  and HN is not important in my life or my ego. I also know
                  that it's basically a propaganda network for YC (I
                  mean...it's right in the URL), and good for them. It's their
                  site and they can do anything they want with it.
                  
                  I only commented because some people really think this place
                  is a meritocracy+democracy. That isn't how it works, even if
                  they really want people to think that.
       
                    Eisenstein wrote 14 hours 30 min ago:
                    No one is under the assumption that any social media space
                    is going to be meritocratic or democratic. The assumption
                    is that some percentage of users are manipulating it and
                    the backend and admins are doing the same. It is an
                    attention economy. I don't think anyone is naive about
                    this. My comment was merely a take on the 'the video game
                    controller is broken' excuse that everyone has when they
                    need to cover for their ego. Sometimes the controller is
                    broken, but it almost never is.
       
                  imiric wrote 1 day ago:
                  How are you getting persecution complex from what I said? If
                  anything, your comment might be feeding that delusion. :)
                  
                  My point is that HN definitely has certain weights associated
                  with accounts, which control the karma, visibility, and
                  ultimately discussion of certain topics.
                  
                  This problem doesn't affect only negativity or downvotes, but
                  upvotes as well. The most upvoted comments are not
                  necessarily of the highest quality, or contribute the most to
                  the discussion. They just happen to be the most visible, and
                  to generally align with the feeling of the hive mind.
                  
                  I know this because some of my own comments have been at the
                  top, without being anything special, while others I think
                  are, barely get any attention. I certainly examine my
                  thinking whenever it strongly aligns with the hive mind, as
                  this community does not particularly align with my values.
                  
                  I also tend to seek out comments near the bottom of threads,
                  and have dead comments enabled, precisely to counteract this
                  flawed system. I often find quality opinions there, so I
                  suggest everyone do the same as well.
                  
                  An essential feature of a healthy and interesting discussion
                  forum is to accomodate different viewpoints. That starts by
                  not burying those that disagree with the majority, or
                  boosting those that agree. AFAIK no online system has gotten
                  this right yet.
       
            simonw wrote 2 days ago:
            At a guess that's because my comment attracted more up-votes than
            the other top-level comments in the thread.
            
            I generally try to include something in a comment that's not
            information already under discussion - in this case that was the
            link and quote from the original README.
       
              ushakov wrote 2 days ago:
              of course your comment attracts more upvotes - it's at the top.
       
                seanhunter wrote 2 days ago:
                Itâs at the top because of upvotes.  They donât have an
                âif simonw: boostâ branch in the code.
       
                  ushakov wrote 1 day ago:
                  the code is not public, so we can't know. i think it's much
                  more nuanced and certain users' comments might get a
                  preferential treatment, based on factors other than the
                  upvote count - which itself is hidden from us.
       
                    satvikpendem wrote 1 day ago:
                    > certain users' comments might get a preferential
                    treatment
                    
                    This does not happen. It hasn't even happened when pg made
                    the forum in the first place.
       
                      dcrazy wrote 1 day ago:
                      I thought dang explicitly said it does happen? It
                      certainly happens for stories.
       
                    ComplexSystems wrote 1 day ago:
                    > the code is not public, so we can't know.
                    
                    I feel like you're making this statement in bad faith,
                    rather than honestly believing the developers of the forum
                    software here have built in a clause to pin simonw's
                    comments to the top.
       
                ontouchstart wrote 2 days ago:
                Attention feeds attention.
                
                Attention is ALL You Need.
       
            carbocation wrote 2 days ago:
            Because many of us think simonw has discerning taste on this topic
            and like to read what he has to say about it, so we upvote his
            comments.
       
              ushakov wrote 2 days ago:
              i don't doubt this. i just find it questionable that one
              particular poster always gets in the spotlight when AI is the
              topic - while other conversations in my opinion offer more
              interesting angles.
       
                colesantiago wrote 2 days ago:
                Agreed,
                
                I would like to see others, being promoted to the top rather
                than Simonâs constant shilling for backlinks to his blog
                every time an AI topic is on the front page.
       
                jonas21 wrote 2 days ago:
                Upvote the conversations that you find to be more interesting.
                If enough people do the same, they too will make it to the top.
       
                  coldtea wrote 1 day ago:
                  Parent implies there might be some "boosting" involved, in
                  which case, "upvote the conversations that you find to be
                  more interesting" wont change anything...
                  
                  Not saying this is the case, but it's what the comment
                  implies, so "just upvote your faves" doesn't really address
                  it.
       
        sheepscreek wrote 2 days ago:
        Curious about the financials behind this deal. Did they close above
        what they raised? Whatâs in it for HuggingFace?
       
        0xbadcafebee wrote 2 days ago:
        > The community will continue to operate fully autonomously and make
        technical and architectural decisions as usual. Hugging Face is
        providing the project with long-term sustainable resources, improving
        the chances of the project to grow and thrive. The project will
        continue to be 100% open-source and community driven as it is now.
        
        I want this to be true, but business interests win out in the end.
        Llama.cpp is now the de-facto standard for local inference; more and
        more projects depend on it. If a company controls it, that means that
        company controls the local LLM ecosystem. And yeah, Hugging Face seems
        nice now... so did Google originally. If we all don't want to be locked
        in, we either need a llama.cpp competitor (with a universal
        abstration), or it should be controlled by an independent nonprofit.
       
          zozbot234 wrote 2 days ago:
          Llama.cpp is an open source project that anyone can fork as needed,
          so any "control" over it really only extends to facilitating
          development of certain features.
       
            0xbadcafebee wrote 1 day ago:
            In practice, nobody does this, because you then have to keep the
            fork up to date with upstream plus your changes, and this is an
            endless amount of work.
       
        ukblewis wrote 2 days ago:
        Honestly Iâm shocked to be the only one I see of this opinion:
        HuggingFaceâs `accelerate`, `transformers` and `datasets` have been
        some of the worst open source Python libraries I have ever used that I
        had to use.
        They break backwards compatibility constantly, even on APIs which are
        not underscore/dunder named even on minor version releases without even
        documenting this, they refuse PRs fixing their lack of `overloads` type
        annotations which breaks type checking on their libraries and they just
        generally seem to have spaghetti code. I am not excited that another
        team is joining them and consolidating more engineering might in the
        hands of these people
       
          ukblewis wrote 2 days ago:
          And clearly I say all of this in my name and not my employers name
       
          ukblewis wrote 2 days ago:
          And I said all of that despite us continuing to use their platform
          and libraries extensivelyâ¦ We just donât have a choice due to
          their dominance of open source ML
       
        periodjet wrote 2 days ago:
        Prediction: Amazon will end up buying HuggingFace. Screenshot this.
       
        superkuh wrote 2 days ago:
        I'm glad the llama.cpp and the ggml backing are getting consistent
        reliable economic support. I'm glad that ggerganov is getting rewarded
        for making such excellent tools.
        
        I am somewhat anxious about "integration with the Hugging Face
        transformers library" and possible python ecosystem entanglements that
        might cause. I know llama.cpp and ggml already have plenty of python
        tooling but it's not strictly required unless you're quantizing models
        yourself or other such things.
       
        jgrahamc wrote 2 days ago:
        This is great news. I've been sponsoring ggml/llama.cpp/Georgi since
        2023 via Github. Glad to see this outcome. I hope you don't mind Georgi
        but I'm going to cancel my sponsorship now you and the code have found
        a home!
       
        stephantul wrote 2 days ago:
        Georgi is such a legend. Glad to see this happening
       
        segmondy wrote 2 days ago:
        Great news!  I have always worried about ggml and long term prospect
        for them and wished for them to be rewarded for their effort.
       
        option wrote 2 days ago:
        Isn't HF banned in China? Also, how are many Chinese labs on Twitter
        all the time?
        
        In either case - huge thanks to them for keeping AI open!
       
          dragonwriter wrote 2 days ago:
          > Isn't HF banned in China?
          
          I think, for some definition of âbannedâ, thatâs the case. It
          doesnât stop the Chinese labs from having organization accounts on
          HF and distributing models there. ModelScope is apparently the
          HF-equivalent for reaching Chinese users.
       
          disiplus wrote 2 days ago:
          I think in the West we think everything is blocked. But for example,
          if you book an eSIM, when you visit you already get direct access to
          Western services because they route it to some other server. Hong
          Kong is totally different: they basically use WhatsApp and Google
          Maps, and everything worked when I was there.
       
            embedding-shape wrote 2 days ago:
            But also yes, parent is right, HF is more or less inaccessible, and
            Modelscope frequently cited as the mirror to use (although many
            Chinese labs seems to treat HF as the mirror, and Modelscope as the
            "real" origin).
       
          woadwarrior01 wrote 2 days ago:
          HF is indeed banned in China. The Chinese equivalent of HF is
          ModelScope[1]:
          
  HTML    [1]: https://modelscope.cn/
       
        tkp-415 wrote 2 days ago:
        Can anyone point me in the direction of getting a model to run locally
        and efficiently inside something like a Docker container on a system
        with not so strong computing power (aka a Macbook M1 with 8gb of
        memory)?
        
        Is my only option to invest in a system with more computing power?
        These local models look great, especially something like [1] for
        assisting in penetration testing.
        
        I've experimented with a variety of configurations on my local system,
        but in the end it turns into a make shift heater.
        
  HTML  [1]: https://huggingface.co/AlicanKiraz0/Cybersecurity-BaronLLM_Off...
       
          yjftsjthsd-h wrote 2 days ago:
          With only 8 GB of memory, you're going to be running a really small
          quant, and it's going to be slow and lower quality. But yes, it
          should be doable. In the worst case, find a tiny gguf and run it on
          CPU with llamafile.
       
          0xbadcafebee wrote 2 days ago:
          8GB is not enough to do complex reasoning, but you could do very
          small simple things. Models like Whisper, SmolVLM, Quen2.5-0.5B,
          Phi-3-mini, Granite-4.0-micro, Mistral-7B, Gemma3, Llama-3.2 all work
          on very little memory. Tiny models can do a lot if you tune/train
          them. They also need to be used differently: system prompt preloaded
          with information, few-shot examples, reasoning guidance, single-task
          purpose, strict output guidelines. See [1] for an example. For each
          small model, check if Unsloth has a tuned version of it; it reduces
          your memory footprint and makes inference faster.
          
          For your Mac, you can use Ollama, or MLX (Mac ARM specific, requires
          different engine and different model disk format, but is faster).
          Ramalama may help fix bugs or ease the process w/MLX. Use either
          Docker Desktop or Colima for the VM + Docker.
          
          For today's coding & reasoning models, you need a minimum of 32GB
          VRAM combined (graphics + system), the more in GPU the better.
          Copying memory between CPU and GPU is too slow so the model needs to
          "live" in GPU space. If it can't fit all in GPU space, your CPU has
          to work hard, and you get a space heater. That Mac M1 will do 5-10
          tokens/s with 8GB (and CPU on full blast), or 50 token/s with 32GB
          RAM (CPU idling). And now you know why there's a RAM shortage.
          
  HTML    [1]: https://github.com/acon96/home-llm
       
            BoredomIsFun wrote 1 day ago:
            >  Mistral-7B
            
            Is hopelessly dated. There are much better newer models around.
       
          Hamuko wrote 2 days ago:
          I tried to run some models on my M1 Max (32 GB) Mac Studio and it was
          a pretty miserable experience. Slow performance and awful results.
       
          ontouchstart wrote 2 days ago:
          This is the easiest set up on a Mac. You need at least 16gb on a
          MacBook:
          
  HTML    [1]: https://github.com/ggml-org/llama.cpp/discussions/15396
       
          HanClinto wrote 2 days ago:
          Maybe check out Docker Model Runner -- it's built on llama.cpp (in a
          good way -- not like Ollama) and handles I think most of what you're
          looking for? [1] As far as how to find good models to run locally, I
          found this site recently, and I liked the data it provides:
          
  HTML    [1]: https://www.docker.com/blog/run-llms-locally/
  HTML    [2]: https://localclaw.io/
       
          mft_ wrote 2 days ago:
          Thereâs no way around needing a powerful-enough system to run the
          model. So you either choose a model that can fit on what you have
          âi.e. via a small model, or a quantised slightly larger modelâ or
          you access more powerful hardware, either by buying it or renting it.
           (IME you donât need Docker. For an easy start just install LM
          Studio and have a play.)
          
          I picked up a second-hand 64GB M1 Max MacBook Pro a while back for
          not too much money for such experimentation. Itâs sufficiently fast
          at running any LLM models that it can fit in memory, but the gap
          between those models and Claude is considerable. However, this might
          be a path for you?
          It can also run all manner of diffusion models, but there the
          performance suffers (vs. an older discrete GPU) and youâre waiting
          sometimes many minutes for an edit or an image.
       
            ryandrake wrote 2 days ago:
            I wasn't able to have very satisfying success until I bit the
            bullet and threw a GPU at the problem. Found an actually reasonably
            priced A4000 Ada generation 20GB GPU on eBay and never looked back.
            I still can't run the insanely large models, but 20GB should hold
            me over for a while, and I didn't have to upgrade my 10 year old
            Ivy Bridge vintage homelab.
       
            sigbottle wrote 2 days ago:
            Are mac kernels optimized compared to CUDA kernels? I know that the
            unified GPU approach is inherently slower, but I thought a ton of
            optimizations were at the kernel level too (CUDA itself is a moat)
       
              ttoinou wrote 1 day ago:
              Thereâs this developer called nightmedia who converts a lot of
              models to apple MLX. I can run Qwen3 coder next at 60 tps on my
              m4 max. It works
       
              liuliu wrote 1 day ago:
              Depending on what you do. If you are doing token generations,
              compute-dense kernel optimization is less interesting (as, it is
              memory-bounded) than latency optimizations else where (data
              transfers, kernel invocations etc). And for these, Mac devices
              actually have a leg than CUDA kernels (as pretty much Metal
              shaders pipelines are optimized for latencies (a.k.a. games)
              while CUDA shaders are not (until cudagraph introduction, and of
              course there are other issues).
       
              bigyabai wrote 2 days ago:
              Mac kernels are almost always compute shaders written in Metal.
              That's the bare-minimum of acceleration, being done in a
              non-portable proprietary graphics API. It's optimized in the
              loosest sense of the word, but extremely far from "optimal"
              relative to CUDA (or hell, even Vulkan Compute).
              
              Most people will not choose Metal if they're picking between the
              two moats. CUDA is far-and-away the better hardware architecture,
              not to mention better-supported by the community.
       
          zozbot234 wrote 2 days ago:
          The general rule of thumb is that you should feel free to quantize
          even as low as 2 bits average if this helps you run a model with more
          active parameters.  Quantized models are not perfect at all, but
          they're preferable to the models with fewer, bigger parameters.  With
          8GB usable, you could run models with up to 32B active at heavy
          quantization.
       
            zargon wrote 1 day ago:
            A large model (100B+, the more the better) may be acceptable at
            2-bit quantization, depending on the task. But not a small model.
            Especially not for technical tasks. On top of that, one still needs
            room for OS, software and KV cache. 8GB is just not very useful for
            local LLMs. That said, it can still be entertaining to try out a
            4-bit 8B model for the fun of it.
       
              zozbot234 wrote 1 day ago:
              100B+ is the amount of total parameters, whereas what matters
              here is active - very different for sparse MoE models. You're
              right that there's some overhead for the OS/software stack but
              it's not that much.  KV-cache is a good candidate for being
              swapped out, since it only gets a limited amount of writes per
              emitted token.
       
                zargon wrote 1 day ago:
                Total parameters, not active parameters, is the property that
                matters for model robustness under extreme quantization.
                
                Once you're swapping from disk, the performance will be quite
                unusable for most people. And for local inference, KV cache is
                the worst possible choice to put on disk.
       
          xrd wrote 2 days ago:
          I think a better bet is to ask on reddit. [1] Everytime I ask the
          same thing here, people point me there.
          
  HTML    [1]: https://www.reddit.com/r/LocalLLM/
       
        androiddrew wrote 2 days ago:
        One of the few acquisitions I do support
       
        dhruv3006 wrote 2 days ago:
        Huggingface is actually something thats driving good in the world.
        Good to see this collab/
       
        the__alchemist wrote 2 days ago:
        Does anyone have a good comparison of HuggingFace/Candle to Burn? I am
        testing them concurrently, and Burn seems to have an easier-to-use API.
        (And can use Candle as a backend, which is confusing) When I ask on
        Reddit or Discord channels, people overwhelmingly recommend Burn, but
        provide no concrete reasons beyond "Candle is more for inference while
        Burn is training and inference". This doesn't track, as I've done
        training on Candle. So, if you've used both: Thoughts?
       
          csunoser wrote 2 days ago:
          I have used both (albeit 2 years ago, and things change really fast).
          At the time, Candle didn't have 2d conv backprop with strides
          properly implemented. And getting Burn running libtch backend was
          just a lot simpler.
          
          I did use candle for wasm based inference for teaching purposes -
          that was reasonably painless and pretty nice.
       
        mythz wrote 2 days ago:
        I consider HuggingFace more "Open AI" than OpenAI - one of the few
        quiet heroes (along with Chinese OSS) helping bring on-premise AI to
        the masses.
        
        I'm old enough to remember when traffic was expensive, so I've no idea
        how they've managed to offer free hosting for so many models. Hopefully
        it's backed by a sustainable business model, as the ecosystem would be
        meaningfully worse without them.
        
        We still need good value hardware to run Kimi/GLM in-house, but at
        least we've got the weights and distribution sorted.
       
          Tepix wrote 2 days ago:
          It's insane how much traffic HF must be pushing out of the door. I
          routinely download models that are hundreds of gigabytes in size from
          them. A fantastic service to the sovererign AI community.
       
            Onavo wrote 2 days ago:
            Bandwidth is not that expensive. The Big 3 clouds just want to milk
            customers via egress. Look at Hetzner or CloudFlare R2 if you want
            to get get an idea of commodity bandwidth costs.
       
            razster wrote 2 days ago:
            My fear is that these large "AI" companies will lobby to have these
            open source options removed or banned, growing concern. I'm not
            sure how else to explain how much I enjoy using what HF provides, I
            religiously browse their site for new and exciting models to try.
       
              toofy wrote 1 day ago:
              itâs only a matter of time. we have all seen first hand how â¦
              wrong â¦ these companies behave, almost on a regular basis.
              
              thereâs a small tinfoil hat part of me that suspects part of
              their obscene investments and cornering the hardware market is
              driven by an conscious attempt to stop open source local from
              taking off. they want it all, the money, the control, and to be
              the only source of information to us.
       
              dotancohen wrote 1 day ago:
              How do you choose which models to try for which workflows? Do you
              have objective tests that you run, or do you just get a feel for
              them while using them in your daily workflow?
       
              throwaway27448 wrote 1 day ago:
              They can try. I don't think they'll be able to get the toothpaste
              back in the tube. The data will just move our of the country.
       
                seanmcdirmid wrote 1 day ago:
                Many of the models on hugging face are already Chinese. Itâs
                kind of obvious that local AI is going to flourish more in
                China than the USA due to hardware constraints.
       
              culi wrote 2 days ago:
              ModelScope is the Chinese equivalent of Hugging Face and a good
              back up. All the open models are Chinese anyways
       
                thot_experiment wrote 1 day ago:
                Not true! Mistral is really really good, but I agree that there
                isn't a single decent open model from the USA.
       
                  CamperBob2 wrote 1 day ago:
                  To be fair there are lots of worse models than OpenAI's
                  GPT-OSS-120b.  It's not a standout when positioned next to
                  the latest releases from China, but prior to the current wave
                  it was considered one of the stronger local models you can
                  reasonably run.
       
                  culi wrote 1 day ago:
                  Mistral is cool and I wish them success but it consistently
                  ranks extremely low on benchmarks while still being
                  expensive. Chinese models like DeepSeek might rank almost as
                  low as Mistral but they are significantly cheaper. And Kimi
                  is the best of both worlds with incredible benchmark results
                  while still being incredibly cheap
                  
                  I know things change rapidly so I'm not counting them out
                  quite yet but I don't see them as a serious contender
                  currently
       
                    BoredomIsFun wrote 1 day ago:
                    >  it consistently ranks extremely low on benchmarks
                    
                    As general purpose chatbots small Mistral models are better
                    than comparably sized Chiniese models, as they have better
                    SimpleQA scores and general knowledge of Western culture.
       
                      seanmcdirmid wrote 1 day ago:
                      Itâs really hard to beat qwen coder, especially for
                      role play where the instruction following is really
                      useful. I donât think their corpus is lacking in
                      western knowledge, although I wonder if Chinese users get
                      even better results from it?
       
                        BoredomIsFun wrote 1 day ago:
                        > Itâs really hard to beat qwen coder, for role play
                        
                        I am not sure if you actually tried that. Mistrals are
                        widely asccepted go-to models for roleplay and creative
                        writing. No Qwens are good at prose, except for their
                        latest big Qwen 3.5.
                        
                        > I donât think their corpus is lacking in western
                        knowledge,
                        
                        It absolutely does, especially pop culture knowledge.
       
                          seanmcdirmid wrote 1 day ago:
                          Instruct and coder just follow instructions so well
                          though. I guess Iâve just never been able to make
                          mistral work well, I guess.
       
                            BoredomIsFun wrote 1 day ago:
                            Qwen3 30B A3B and that big 400+ B Coder were
                            absolutely terrible at editing fiction. I would
                            tell them what to change in the prose and they'd
                            just regurgitate text with no changes.
       
                              seanmcdirmid wrote 18 hours 56 min ago:
                              Did you try asking Gemini what model to use and
                              how to configure/set it up? It has worked wonders
                              for me, ironically (since Iâm using a big model
                              to setup smaller local models).
       
                                BoredomIsFun wrote 12 hours 55 min ago:
                                > Did you try asking Gemini what model to use
                                and how to configure/set it up?
                                
                                That would besuboptimal, as Gemini has too old
                                knowledge cutoff. I am long past the need for
                                such an advice anyway, as I've been using local
                                models since mid 2024.
       
                                  seanmcdirmid wrote 8 hours 25 min ago:
                                  Gemini will search the web for most things
                                  (at least if you are using it via the web
                                  search interface), it isnât limited to the
                                  knowledge it was trained on. Actually, Iâm
                                  a bit mortified that not everyone knows this.
                                  If you ask Gemini (from the search interface)
                                  about a current event that happened
                                  yesterday, they will use search to pull in
                                  context and work with that. Also about    model
                                  that was released yesterday, it can do that.
                                  
                                  Itâs only a very low level model access
                                  where search isnât used. Local models also
                                  need to be configured to use search, and I
                                  haven't had a use case to do that yet.
                                  
                                  Gemini seems to call this âgrounding with
                                  google searchâ. If you have Gemini
                                  installed in your enterprise, it will also
                                  search internal data sources for context.
       
                    thot_experiment wrote 1 day ago:
                    Sure, benchmarks are fake and I use Mistral over
                    equivalently sized models most of the time because it's
                    better in real life. It runs plenty fast for me, I don't
                    pay for inference.
       
                    Eupolemos wrote 1 day ago:
                    Why are you talking price when we are talking local AI?
                    
                    That doesn't make any sense to me. Am I missing something?
       
                      dirasieb wrote 1 day ago:
                      15 missed calls from your local power company
       
                      culi wrote 1 day ago:
                      Your electricity is free?
       
                        thot_experiment wrote 23 hours 38 min ago:
                        for almost the entire year, yes.
       
                        seanmcdirmid wrote 1 day ago:
                        Apple silicon is crazy efficient as well as being
                        comparable to GPUs in performance for max and ultra
                        chips.
       
                        cpburns2009 wrote 1 day ago:
                        If you have the hardware to run expensive models, is
                        the cost of electricity much of a factor? According to
                        Google, the average price in the Silicon Valley Area is
                        $0.448 per kWh. An RTX 5090 costs about $4,000 and has
                        a peak power consumption of 1000 W. Maxing out that GPU
                        for a whole year would cost $3,925 at that rate. It's
                        not particularly more expensive than that hardware
                        itself.
       
                          culi wrote 1 day ago:
                          At that point it'd be cheaper to get an expensive
                          subscription to a cloud platform AI product. I
                          understand the case for local LLMs but it seems silly
                          to worry about pricing for cloud-based offerings but
                          not worry about pricing for locally run models.
                          Especially since running it locally can often be more
                          expensive
       
            vardalab wrote 2 days ago:
            Yup, I have downloaded probably a terabyte in the last week,
            especially with the Step 3.5 model being released and Minimax
            quants. I wonder what my ISP thinks. I hope they don't cut me off.
            They gave me a fast lane, they better let me use it, lol
       
              fc417fc802 wrote 1 day ago:
              Even fairly restrictive data caps are in the range of 6 Tb per
              month. P2P at a mere 100 Mb works out to 1 TiB per 24 hours.
              
              Hypothetically my ISP will sell me unmetered 10 Gb service but I
              wonder if they would actually make good on their word ...
       
                3eb7988a1663 wrote 1 day ago:
                I have a 1.2TB cap before you start getting charged extra, so
                you might need to recalibrate your restrictive level.
       
                  fc417fc802 wrote 1 day ago:
                  Is that with a WISP by chance? Or in a developing country? Or
                  are there really wired providers with such low caps in the
                  western world in this day and age?
       
                    Zetaphor wrote 1 day ago:
                    ATT once told me if I don't pay for their TV service then
                    my home gigabit fiber would have a 1TB cap. They had an
                    agreement with the apartment building so I had no other
                    choice of provider.
       
                      fc417fc802 wrote 1 day ago:
                      Buy our off brand netflix or else we'll make it so you
                      can't watch netflix. How is that legal?
       
                        Zetaphor wrote 1 day ago:
                        The law is written by the highest bidder, and the
                        telecom lobbyists are very generous
       
                    zargon wrote 1 day ago:
                    Comcast.
       
                    nagaiaida wrote 1 day ago:
                    well it's my wired cap a stone's throw from buildings with
                    google cloud logos on the side in a major us city, so...
       
          Fin_Code wrote 2 days ago:
          I still don't know why they are not running on torrent. Its the
          perfect use case.
       
            freedomben wrote 2 days ago:
            That would shut out most people working for big corp, which is
            probably a huge percentage of the user base.  It's dumb, but that's
            just the way corp IT is (no torrenting allowed).
       
              zozbot234 wrote 2 days ago:
              It's a sensible option, even when not everyone can really use it.
               Linux distros are routinely transfered via torrent, so why not
              other massive, open-licensed data?
       
                thot_experiment wrote 1 day ago:
                I have terabytes of linux isos I got via torrents, many such
                cases!
       
                freedomben wrote 2 days ago:
                Oh as an option, yeah I agree it makes a ton of sense. I just
                would expect a very, very small percentage of people to use the
                torrent over the direct download. With Linux distros, the vast
                majority of downloads still come from standard web servers.
                When I download distro images I opt for torrents, but very few
                people do the same
       
                  Const-me wrote 1 day ago:
                  > very small percentage of people to use the torrent over the
                  direct download
                  
                  BitTorrent protocol is IMO better for downloading large
                  files. When I want to download something which exceeds couple
                  GB, and I see two links direct download and BitTorrent, I
                  always click on the torrent.
                  
                  On paper, HTTP supports range requests to resume partial
                  downloads. IME, it seems modern web browsers neglected to
                  implement it properly. They wonât resume after browser is
                  reopened, or the computer is restarted. Command-line HTTP
                  clients like wget are more reliable, however many web servers
                  these days require some session cookies or one-time query
                  string tokens, and itâs hard to pass that stuff from
                  browser to command-line.
                  
                  I live in Montenegro, CDN connectivity is not great here.
                  Only a few of them like steam and GOG saturate my 300
                  megabit/sec download link. Others are much slower, e.g.
                  windows updates download at about 100 megabit/sec. BitTorrent
                  protocol almost always delivers the 300 megabit/sec
                  bandwidth.
       
                  zrm wrote 2 days ago:
                  With Linux distros they typically put the web link right on
                  the main page and have a torrent available if you go look for
                  it, because they want you to try their distro more than they
                  want to save some bandwidth.
                  
                  Suppose HF did the opposite because the bandwidth saved is
                  more and they're not as concerned you might download a
                  different model from someone else.
       
            heliumtera wrote 2 days ago:
            How can you be the man in the middle in a truly P2P environment?
       
          sowbug wrote 2 days ago:
          Why doesn't HF support BitTorrent? I know about hf-torrent and
          hf_transfer, but those aren't nearly as accessible as a link in the
          web UI.
       
            embedding-shape wrote 2 days ago:
            > Why doesn't HF support BitTorrent?
            
            Harder to track downloads then. Only when clients hit the tracker
            would they be able to get download states, and forget about private
            repositories or the "gated" ones that Meta/Facebook does for their
            "open" models.
            
            Still, if vanity metrics wasn't so important, it'd be a great
            option. I've even thought of creating my own torrent mirror of HF
            to provide as a public service, as eventually access to models will
            be restricted, and it would be nice to be prepared for that moment
            a bit better.
       
              Barbing wrote 1 day ago:
              That would be a very nice service. I think folks might rely on it
              for a number of reasons, including that we'll want to see how
              biases changed over time. What got sloppier, shillier...
       
              jimbob45 wrote 2 days ago:
              Wouldnât it still provide massive benefits if they could
              convince/coerce their most popular downloaded models to move to
              torrenting?
       
                intrasight wrote 1 day ago:
                Benefit to you, but great downside to the three letter agencies
                that inject their goods into these models.
       
              homarp wrote 2 days ago:
              how are all the private trackers tracking ratios?
       
              taminka wrote 2 days ago:
              most of the traffic is probably from open weights, just seed
              those, host private ones as is
       
              sowbug wrote 2 days ago:
              I thought of the tracking and gate questions, too, when I vibed
              up an HF torrent service a few nights ago. (Super annoying BTW to
              have to download the files just to hash the parts, especially
              when webseeds exist.) Model owners could disable or gate torrents
              the same way they gate the models, and HF could still measure
              traffic by .torrent downloads and magnet clicks.
              
              It's a bit like any legalization question -- the black market
              exists anyway, so a regulatory framework could bring at least
              some of it into the sunlight.
       
                embedding-shape wrote 2 days ago:
                > Model owners could disable or gate torrents the same way they
                gate the models, and HF could still measure traffic by .torrent
                downloads and magnet clicks.
                
                But that'll only stop a small part, anyone could share the
                infohash and if you're using the dht/magnet without .torrent
                files or clicks on a website, no one can count those downloads
                unless they too scrape the dht for peers who are reporting
                they've completed the download.
       
                  fc417fc802 wrote 1 day ago:
                  > unless they too scrape the dht for peers who are reporting
                  they've completed the download.
                  
                  Which can be falsified. Head over to your favorite tracker
                  and sort by completed downloads to see what I mean.
       
                  sowbug wrote 2 days ago:
                  Right, but that's already happening today. That's the
                  black-market point.
       
          data-ottawa wrote 2 days ago:
          Can we toss in the work unsloth does too as an unsung hero?
          
          They provide excellent documentation and theyâre often very quick
          to get high quality quants up in major formats. Theyâre a very
          trustworthy brand.
       
            swyx wrote 1 day ago:
            not that unsung! we've given them our biggest workshop spot every
            single year we've been able to and will do until they are tired of
            us
            
  HTML      [1]: https://www.youtube.com/@aiDotEngineer/search?query=unslot...
       
              danielhanchen wrote 1 day ago:
              Appreciate it immensely haha :) Never tired - always excited and
              pumped for this year!
       
            danielhanchen wrote 1 day ago:
            Oh thank you - appreciate it :)
       
            disiplus wrote 2 days ago:
            Yeah, they're the good guys. I suspect the open source work is
            mostly advertisements for them to sell consulting and services to
            enterprises. Otherwise, the work they do doesn't make sense to
            offer for free.
       
              danielhanchen wrote 1 day ago:
              Haha for now our primary goal is to expand the market for local
              AI and educate people on how to do RL, fine-tuning and running
              quants :)
       
                WanderPanda wrote 1 day ago:
                Amazing work and people should really appreciate that the
                opportunity costs of your work are immense (given the hype).
                
                On another note: I'm a bit paranoid about quantization. I know
                people are not good at discerning model quality at these levels
                of "intelligence" anymore, I don't think a vibe check really
                catches the nuances. How hard would it be to systematically
                evaluate the different quantizations? E.g. on the Aider
                benchmark that you used in the past?
                
                I was recently trying Qwen 3 Coder Next and there are benchmark
                numbers in your article but they seem to be for the official
                checkpoint, not the quantized ones. But it is not even really
                clear (and chatbots confuse them for benchmarks of the
                quantized versions btw.)
                
                I think systematic/automated benchmarks would really bring the
                whole effort to the next level. Basically something like the
                bar chart from the Dynamic Quantization 2.0 article but always
                updated with all kinds of recent models.
       
                  danielhanchen wrote 1 day ago:
                  Thanks! Yes we actually did think about that - it can get
                  quite expensive sadly - perplexity benchmarks over short
                  context lengths with small datasets are doable, but it's not
                  an accurate measure sadly. We're actually investigating
                  currently what would be the best efficient course of action
                  on evaluating quants - will keep you posted!
       
                  jychang wrote 1 day ago:
                  > How hard would it be to systematically evaluate the
                  different quantizations? E.g. on the Aider benchmark that you
                  used in the past?
                  
                  Very hard. $$$
                  
                  The benchmarks are not cheap to run. It'll cost a lot to run
                  them for each quant of each model.
       
                    danielhanchen wrote 1 day ago:
                    Yes sadly very expensive :( Maybe a select few quants could
                    happen - we're still figuring out what is the most
                    economical and most efficient way to benchmark!
       
                      illusive4080 wrote 1 day ago:
                      Roughly how much does it cost to run one of the popular
                      benchmarks? Are we talking $1,000, $10,000, or $100k?
       
                        danielhanchen wrote 10 hours 1 min ago:
                        Oh it's more time that's the issue - each benchmark
                        takes 1-3 hours ish to run on 8 GPUs, so running on all
                        quants per model release can be quite painful.
                        
                        Assume AWS spot say $20/hr B200 for 8 GPUs, then $20
                        ish per quant, so assuming benchmark is on BF16, 8bit,
                        6, 5, 4, 3, 2 bits then 7 ish tests so $140 per model
                        ish to $420 ish/hr. Time wise 7 hours to 1 day ish.
                        
                        We could run them after a model release which might
                        work as well.
                        
                        This is also on 1 benchmark.
       
                  Zetaphor wrote 1 day ago:
                  This would be amazing
       
                    danielhanchen wrote 1 day ago:
                    Working on it! :)
       
              arcanemachiner wrote 2 days ago:
              I hope that is exactly what is happening. It benefits them, and
              it benefits us.
       
            cubie wrote 2 days ago:
            I'm a big fan of their work as well, good shout.
       
              danielhanchen wrote 1 day ago:
              Thank you!
       
          zozbot234 wrote 2 days ago:
          > We still need good value hardware to run Kimi/GLM in-house
          
          If you stream weights in from SSD storage and freely use swap to
          extend your KV cache it will be really slow (multiple seconds per
          token!) but run on basically anything.    And that's still really good
          for stuff that can be computed overnight, perhaps even by batching
          many requests simultaneously.  It gets progressively better as you
          add more compute, of course.
       
            Aurornis wrote 2 days ago:
            > it will be really slow (multiple seconds per token!)
            
            This is fun for proving that it can be done, but that's 100X slower
            than hosted models and 1000X slower than GPT-Codex-Spark.
            
            That's like going from real time conversation to e-mailing someone
            who only checks their inbox twice a day if you're lucky.
       
              zozbot234 wrote 1 day ago:
              You'd need real rack-scale/datacenter infrastructure to properly
              match the hosted models that are keeping everything in fast VRAM
              at all times, and then you only get reasonable utilization on
              that by serving requests from many users.  The ~100X slower tier
              is totally okay for experimentation and non-conversational use
              cases (including some that are more agentic-like!), and you'd
              reach ~10X (quite usable for conversation) by running something
              like a good homelab.
       
            HPsquared wrote 2 days ago:
            At a certain point the energy starts to cost more than renting some
            GPUs.
       
              fc417fc802 wrote 1 day ago:
              Aren't decent GPU boxes in excess of $5 per hour? At $0.20 per
              kWhr (which is on the high side in the US) running a 1 kW
              workstation 24/7 would work out to the same price as 1 hour of
              GPU time.
              
              The issue you'll actually run into is that most residential
              housing isn't wired for more than ~2kW per room.
       
              vardalab wrote 2 days ago:
              Yeah, that is hard to argue with because I just go to OpenRouter
              and play around with a lot of models before I decide which ones I
              like. But there's something special about running it locally in
              your basement
       
                dotancohen wrote 1 day ago:
                I'd love to hear more about this. How do you decide that you
                like a model? For which use cases?
       
        beoberha wrote 2 days ago:
        Seems like a great fit - kinda surprised it didnât happen sooner. I
        think we are deep in the valley of local AI, but Iâd be willing to
        bet it breaks out in the next 2-3 years. Hereâs hoping!
       
          breisa wrote 1 day ago:
          I mean they already supported the project quite a bit. @ngxson and
          maybe others? from Huggingface are big contributors to llama.cpp.
       
        dmezzetti wrote 2 days ago:
        This is really great news. I've been one of the strongest supporters of
        local AI dedicating thousands of hours towards building a framework to
        enable it. I'm looking forward to seeing what comes of it!
       
          logicallee wrote 2 days ago:
          >I've been one of the strongest supporters of local AI, dedicating
          thousands of hours towards building a framework to enable it.
          
          Sounds like you're very serious about supporting local AI. I have a
          query for you (and anyone else who feels like donating) about whether
          you'd be willing to donate some memory/bandwidth resources p2p to
          hosting an offline model:
          
          We have a local model we would like to distribute but don't have a
          good CDN.
          
          As a user/supporter question, would you be willing to donate some
          spare memory/bandwidth in a simple dedicated browser tab you keep
          open on your desktop that plays silent audio (to not be put in the
          background and deloaded) and then allocates 100mb -1 gb of RAM and
          acts as a webrtc peer, serving checksumed models?[1] (Then our server
          only has to check that you still have it from time to time, by
          sending you some salt and a part of the file to hash and your tab
          proves it still has it by doing so). This doesn't require any trust,
          and the receiving user will also hash it and report if there's a
          mismatch.
          
          Our server federates the p2p connections, so when someone downloads
          they do so from a trusted peer (one who has contributed and passed
          the audits) like you.  We considered building a binary for people to
          run but we consider that people couldn't trust our binaries, or would
          target our build process somehow, we are paranoid about trust,
          whereas a web model is inherently untrusted and safer.    Why do all
          this?
          
          The purpose of this would be to host an offline model: we
          successfully ported a 1 GB model from C++ and Python to WASM and
          WebGPU (you can see Claude doing so here, we livestreamed some of
          it[2]), but the model weights at 1 GB are too much for us to host.
          
          Please let us know whether this is something you would contribute a
          background tab to hosting on your desktop. It wouldn't impact you
          much and you could set how much memory to dedicate to it, but you
          would have the good feeling of knowing that you're helping people run
          a trusted offline model if they want - from their very own browser,
          no download required. The model we ported is fast enough for anyone
          to run on their own machines.  Let me know if this is something you'd
          be willing to keep a tab open for. [1] filesharing over webrtc works
          like this: [1] you can try it in 2 browser tabs. [2]  and some other
          videos
          
  HTML    [1]: https://taonexus.com/p2pfilesharing/
  HTML    [2]: https://www.youtube.com/watch?v=tbAkySCXyp0and
       
            HanClinto wrote 2 days ago:
            Hosting model weights for projects like this I think is something
            that you could upload to a space in Hugging Face?
            
            What services would you need that Hugging Face doesn't provide?
       
            echoangle wrote 2 days ago:
            Maybe stupid question but why not just put it in a torrent?
       
              liuliu wrote 2 days ago:
              It is very simple. Storage / bandwidth is not expensive.
              Residential bandwidth is. If you can convince people to install a
              bandwidth-related software on their residential homes, you can
              then charge other people $5 to $10 per 1GiB bandwidth (useful for
              botnet mostly, get around DDOS protections and other reCAPTCHA
              tasks).
       
                logicallee wrote 2 days ago:
                Thank you for your suggestion.    Below is only our
                plans/intentions, we welcome feedback about it:
                
                We are not going to do what you suggest. Instead, our approach
                is to use the RAM people aren't using at the moment for a fast
                edge cache close to their area.
                
                We've tried this architecture and get very low latency and high
                bandwidth. People would not be contributing their resources to
                anything they don't know about.
       
              logicallee wrote 2 days ago:
              Torrents require users to download and install a torrent client!
              In addition, we would like to retain the possibility of giving
              live updates to the latest version of a sovereign fine-tuned
              file, torrents don't autoupdate. We want to keep improving what
              people get.
              
              Finally, we would like the possibility of setting up market
              dynamics in the future: if you aren't currently using all your
              ram, why not rent it out? This matches the p2p edge architecture
              we envision.
              
              In addition, our work on WebGPU would allow you to rent out your
              gpu to a background tab whenever you're not using it. Why have
              all that silicon sit idle when you could rent it out?
              
              You could also donate it to help fine tune our own sovereign
              model.
              
              All of this will let us bootstrap to the point where we could be
              trusted with a download.
              
              We have a rather paranoid approach to security.
       
            liuliu wrote 2 days ago:
            > We have a local model we would like to distribute but don't have
            a good CDN.
            
            That is not true. I am serving models off Cloudflare R2. It is 1
            petabyte per month in egress use and I basically pay peanuts (~$200
            everything included).
       
              logicallee wrote 2 days ago:
              1 petabyte per month is 1 million downloads of a 1 GB file. We
              intend to scale to more than 1 million downloads per month. We
              have a specific scaling architecture in mind. We're qualified to
              say this because we've ported a billion parameter model to run in
              your browser - fast - on either webgpu or wasm. (You can see us
              doing it live at the youtube link in my comment above.) There is
              a lot of demand for that.
       
                dirasieb wrote 1 day ago:
                how about you work on achieving 1 million downloads per month
                first? talk about putting the horse before the carriage
       
                liuliu wrote 2 days ago:
                The bandwidth is free on Cloudflare R2. I paid money for
                storage (~10TiB storage of different models). If you only host
                1GiB file there, you are only paying $0.01 per month I believe.
       
        HanClinto wrote 2 days ago:
        I'm regularly amazed that HuggingFace is able to make money. It does so
        much good for the world.
        
        How solid is its business model? Is it long-term viable? Will they ever
        "sell out"?
       
          bityard wrote 2 days ago:
          Their business model is essentially the same as GitHub. Host lots of
          stuff for free and build a community around it, sell the
          upscaled/private version to businesses. They are already profitable.
       
            HanClinto wrote 2 days ago:
            This is what Sourceforge did too, and they still had the DevShare
            adware thing didn't they?
            
            GitHub is great -- huge fan. To some degree they "sold out" to
            Microsoft and things could have gone more south, but thankfully
            Microsoft has ruled them with a very kind hand, and overall I'm
            extremely happy with the way they've handled it.
            
            I guess I always retain a bit of skepticism with such things, and
            the long-term viability and goodness of such things never feels
            totally sure.
       
          heliumtera wrote 2 days ago:
          >Will they ever "sell out"?
          
          Oh no, never. Don't worry, the usual investors are very well known
          for fighting for user autonomy (AMD, Nvidia, Intel,IBM, Qualcomm)
          
          They are all very pro consumers and all backers are certainly here
          for your enjoyment only
       
            zozbot234 wrote 2 days ago:
            These are all big hardware firms, which makes a lot of sense as a
            classic 'commoditize the complement' play. Not exactly
            pro-consumer, but not quite anti-consumer either!
       
              smallerize wrote 2 days ago:
              heliumtera is being sarcastic.
       
              5o1ecist wrote 2 days ago:
              > AMD, Nvidia, Intel, IBM, Qualcomm
              
              > but not quite anti-consumer either!
              
              All of them are public companies, which means that their default
              state is anti-consumer and pro-shareholder. By law they are
              required to do whatever they can to maximize profits. History
              teaches that shareholders can demand whatever they want, with the
              respective companies following orders, since nobody ever really
              has to suffer consequences and any and all potential fines are
              already priced in, in advance, anyway.
              
              Conversely, this is why Valve is such a great company. Valve is
              probably one of the only few actual pro-consumer companies out
              there.
              
              Fun Fact! Rarely is it ever mentioned anywhere, but Valve is not
              a public company! Valve is a private company! That's why they can
              operate the way they do! If Valve was a public company, then
              greedy, crooked billionaire shareholders would have managed to
              get rid of Gabe a long time ago.
       
                RussianCow wrote 1 day ago:
                > By law they are required to do whatever they can to maximize
                profits.
                
                I know it's a nit-pick, but I hate that this always gets
                brought up when it's not actually true. Public corporations
                face pressure from investors to maximize returns, sure, but
                there is no law stating that they have to maximize profits at
                all costs. Public companies can (and often do) act against the
                interest of immediate profits for some other gain. The only
                real leverage that investors have is the board's ability to
                fire executives, but that assumes that they have the necessary
                votes to do so. As a counter-example, Mark Zuckerberg still
                controls the majority of voting power at Meta, so he can
                effectively do whatever he wants with the company without major
                consequence (assuming you don't consider stock price
                fluctuations "major").
                
                But I say this not to take away from your broader point, which
                I agree with: the short-term profit-maximizing culture is
                indeed the default when it comes to publicly traded
                corporations. It just isn't something inherent in being
                publicly traded, and in the inverse, private companies often
                have the same kind of culture, so that's not a silver bullet
                either.
       
                  chucksmash wrote 1 day ago:
                  It's a worthwhile point to make because if people believe
                  that misconception then it lets companies wash their hands of
                  flagrantly bad behavior. "Gosh, we should really get around
                  to changing the law that makes them act that way."
       
                  5o1ecist wrote 1 day ago:
                  You're perfectly right and I don't consider it a nitpick. I
                  really should be more precise about this, instead of
                  spreading inaccuracies. Thank you!
       
                HanClinto wrote 2 days ago:
                Great points.
                
                Valve is one of my top favorite companies right now. Love the
                work they're doing, and their products are amazing.
                
                Can hardly wait for the Steam Frame.
       
          microsoftedging wrote 2 days ago:
          FT had a solid piece a few weeks back: "Why AI start-up Hugging Face
          turned down a $500mn Nvidia deal"
          
  HTML    [1]: https://giftarticle.ft.com/giftarticle/actions/redeem/9b4eca...
       
            jackbravo wrote 2 days ago:
            sounds very interesting, but even though it says giftarticle.ft, I
            got blocked by a paywall.
       
              culi wrote 2 days ago:
              find the Bypass Paywalls Clean extension. Never worry about a
              paywall again
       
              nerevarthelame wrote 2 days ago:
               [1] To summarize, they rejected Nvidia's offer because they
              didn't want one outsized investor who could sway decisions. And
              "the company was also able to turn down Nvidia due to its stable
              finances. Hugging Face operates a 'freemium' business model.
              Three per cent of customers, usually large corporations, pay for
              additional features such as more storage space and the ability to
              set up private repositories."
              
  HTML        [1]: https://archive.is/zSyUc
       
                bee_rider wrote 2 days ago:
                Freemium seems to be working pretty well for themâwhatâs
                the alternative website, after all. They seem to command their
                niche.
       
          dmezzetti wrote 2 days ago:
          They have paid hosting - [1] and paid accounts. Also consulting
          services. Seems like a pretty good foundation to me.
          
  HTML    [1]: https://huggingface.co/enterprise
       
            julien_c wrote 2 days ago:
            and a lot of traction on paid (private in particular) storage these
            days; sneak peek at new landing page:
            
  HTML      [1]: https://huggingface.co/storage
       
          I_am_tiberius wrote 2 days ago:
          I once tried hugging face because I wanted I worked through some
          tutorial. They wanted my credit card details during the registration
          as far as I remember. After a month they invoiced me some amount of
          money and I had no idea what it was. To be honest, I don't understand
          what exactly they do and what services I was paying for, but I
          cancelled my account and never touched it again. For me that was a
          totally intransparent process.
       
            in-silico wrote 19 hours 25 min ago:
            Sounds like a personal skill issue
       
            shafyy wrote 2 days ago:
            Their pricing seems pretty transparent:
            
  HTML      [1]: https://huggingface.co/pricing
       
        geooff_ wrote 2 days ago:
        As someone who's been in the "AI" space for a while its strange how
        Hugging Face went from one of the biggest name to not a part of the
        discussion at all.
       
          segmondy wrote 2 days ago:
          part of what discussion?   anyone in the AI space knows and uses HF,
          but the public doesn't give a care and why should they?  It's just an
          advanced site were nerds download AI stuff.   HF is super valuable
          with their transformers library, their code, tutorials, smol-models,
          etc, but how does it translate to investor dollars?
       
          LatencyKills wrote 2 days ago:
          It isn't necessary to be part of the discussion if you are truly
          adding value (which HF continues to do). It's nice to see a company
          doing what it does best without constantly driving the hype train.
       
          r_lee wrote 2 days ago:
          I think that's because there's less local AI usage now since there's
          all kinds of image models by the big labs, so there's really no rush
          of people self hosting stable diffusion etc anymore
          
          the space moved from Consumer to Enterprise pretty fast due to models
          getting bigger
       
            zozbot234 wrote 2 days ago:
            Today's free models are not really bigger when you account for the
            use of MoE (with ever increasing sparsity, meaning a smaller
            fraction of active parameters), and better ways of managing KV
            caching. You can do useful things with very little RAM/VRAM, it
            just gets slower and slower the more you try to squeeze it where it
            doesn't quite belong.  But that's not a problem if you're willing
            to wait for every answer.
       
              r_lee wrote 2 days ago:
              yeah, but I mean more like the old setups where you'd just load a
              model on a 4090 or something, even with MoE it's a lot more
              complex and takes more VRAM, right? like it just seems not
              justifiable for most hobbyists
              
              but maybe I'm just slightly out of the loop
       
                zozbot234 wrote 2 days ago:
                With sparse MoE it's worth running the experts in system RAM
                since that allows you to transparently use mmap and inactive
                experts can stay on disk.  Of course that's also a slowdown
                unless you have enough RAM for the full set, but it lets you
                run much larger models on smaller systems.
       
        mnewme wrote 2 days ago:
        Huggingface is the silent GOAT of the AI space, such a great community
        and platform
       
          lairv wrote 2 days ago:
          Truly amazing that they've managed to build an open and profitable
          platform without shady practices
       
            al_borland wrote 2 days ago:
            Itâs such a sad state of affairs when shady practices are so
            normal that finding a company without them is noteworthy.
       
        jimmydoe wrote 2 days ago:
        Amazing. I like the openness of both project and really excited for
        them.
        
        Hopefully this does not mean consolidation due to resource dry up but
        true fusion of the bests.
       
        rvz wrote 2 days ago:
        This acquisition is almost the same as the acquisition of Bun by
        Anthropic.
        
        Both $0 revenue "companies", but have created software that is
        essential to the wider ecosystem and has mindshare value; Bun for
        Javascript and Ggml for AI models.
        
        But of course the VCs needed an exit sooner or later. That was
        inevitable.
       
          andsoitis wrote 2 days ago:
          I believe ggml.ai was funded by angel investors, not VC.
       
       
   DIR <- back to front page