URI:
        _______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
  HTML Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
  HTML   Making a vintage LLM from scratch
       
       
        dennysora-main wrote 23 hours 27 min ago:
        Recently, I started a personal project to build an LLM from zero.
        
        I've spent a ton of time reading up on math, ML, and DL through books,
        open courses, and papers, while also studying all the major open-source
        LLM architectures.
        
        Since I only have one DGX Spark machine to run experiments, I can't
        train a massive LLM from the get-go. Instead, I'm experimenting with an
        auto-scaling parameter mechanism, which has led me to create a pretty
        unconventional and fun architecture!
        
        Why go through all this effort when modern LLMs can basically write
        simple LLMs themselves, and I clearly can't out-compute the big tech
        giants?
        
        Honestly, it's because I'm obsessed with the core mechanics of LLMs. I
        want to build something exclusively for myself and hopefully discover
        some completely undiscovered mechanisms along the way.
        
        Just keeping a record and sharing my progress—having fun with it is
        truly the biggest reward!
        
        I'll share it when I get a chance!
       
          croqaz wrote 20 hours 41 min ago:
          Do share! I read all the blog posts where people share their
          experiences of building small scale LLMs "from scratch".
       
          charcircuit wrote 22 hours 29 min ago:
          Most hobbyists rent the compute for training models instead of
          needing to purchase it all out right.
       
            dennysora-main wrote 14 hours 55 min ago:
            It's mainly just my personal preference to run a local machine. It
            gives me better privacy and security, and I can keep all my heavy
            data and projects right there.
            
            Cloud rentals are usually billed hourly. Since I constantly tweak
            the architecture and run it again, having a local rig completely
            kills any cost anxiety—it's just a one-off payment.
            
            Plus, regular users can't even get access to H100s anyway. I
            applied on AWS and GCP before and couldn't get them.
       
        macwhisperer wrote 1 day ago:
        super inspiring! thanks for sharing!
       
          croqaz wrote 20 hours 39 min ago:
          Thank you very much! It is humbling and motivating to see other
          people interested in this.
       
        HexPhantom wrote 1 day ago:
        Instead of always trying to make models more current and general, there
        may be value in making them deliberately narrow, historically
        constrained and weird in a well-defined way
       
        tancop wrote 1 day ago:
        > These samples have very good scores overall, but they are useless. I
        am guessing it's not English text... I counted a few hundred examples
        mostly from LOC-PD and other few hundred in the OTA datasets. Imagine
        if I feed that crap to my LLM, what will it learn?
        
        im pretty sure its a real text in Welsh. there might be typos from ocr
        but yeah thats what the language really looks like, i dont speak it but
        its easy to recognize.
       
          HexPhantom wrote 1 day ago:
          Yeah, that seems like an important distinction
       
          croqaz wrote 1 day ago:
          It looks like ROT13 text to me, I hope it's not Welsh. Don't want to
          offend anyone if that's their actual language :)
       
            throw310822 wrote 1 day ago:
            It's actually Welsh, and the funny thing is that one of the
            sentences in the example "gibberish" text (although with some
            further OCR errors) means:
            
            "It will be easy for the knowledgeable to fix the few errors that
            remain [in the text]". (Bydd yn rwydd iawn i'r cyfarwydd ddiwygio'r
            ychydig.")
            
            Which is exactly what the OP is doing.
       
        mg794613 wrote 1 day ago:
        "The code is semi-vibe-coded with whatever LLM I had with VS-Code and
        PI (OpenRouter models)."
        
        I appreciate the honesty, but now there's no journey, and that's what
        I'm interested in.
        I can ask a LLM myself.
       
          croqaz wrote 20 hours 42 min ago:
          That's a fair point TBH. I said in my post that this LLM is first of
          all a learning project and I skipped an important step: the training
          loop.  But on the other hand, how many data scientists are writing
          their own training loops? Is it even worth it?    And how much learning
          do you want for one project, I mean, where do you stop? Why use
          "Huggingface Transformers" when you can write it from scratch, for
          learning? Why use Torch when you can write it from scratch, for
          learning? Why use Python when you can write in C, etc. It's cheating,
          right?
          In my case, I decided to skip the training loop and focus on the data
          processing and the hyper params and the rest of the higher level
          steps that took a ton of time anyway, and I reduced the friction.
          I do get your point tho. Now that I know how to train an LLM, maybe
          I'll write a training loop from scratch as a project, to learn how to
          do it.
       
          skerit wrote 1 day ago:
          I've been creating my own little from-scratch LLM for months now with
          Claude's help. I can safely say I learned a thing or two along the
          way.
       
          abetusk wrote 1 day ago:
          This is like a modern form of "I could do that in a weekend". Try
          reading the article before making such statements.
          
          There's a lot of pre-processing, experimentation and validation that
          went into this project. The training data collection and sanitization
          alone is a big undertaking.
          
          As for the blog post itself, from the article:
          
          > Note: This blog post is 100% written by me. No AI has been used
          whatsoever.
          
          Put another way: You can ask the LLM yourself to do this project?
          Please do, share your prompt, I'd like to see it.
       
          JayNitram wrote 1 day ago:
          I get what you are saying, but at the same time I was bored on a
          Saturday and 'vibe coded' a small VR game, nothing special, but I had
          the LLM throw down a structure, and then I walked through it looking
          at and thinking about why placement of code was how it was and how
          different things were handled.    It was basically exactly like my job,
          jump into some okay working legacy app, code I have never actually
          seen, try to get my brain around it, then personally tweak things
          until the app performs the way I fully want.
       
        cyberge99 wrote 1 day ago:
        There are certain things you can only truly learn by doing.  I remember
        doing Linux From Scratch over a weekend and the depth of linux that I
        still understand to this day.
        
        Thanks for the writeup.  A more granular followup would be cool too.
       
          charcircuit wrote 20 hours 20 min ago:
          The depth of running configure, make, make install? If you want depth
          in Linux I recommend looking at its source repository and reading the
          documentation or code. Or in the current times asking AI to help
          explain it to you.
       
          croqaz wrote 1 day ago:
          "A more granular followup would be cool too"
          
          Do you mind expanding this question? More granular in what way? what
          would you like to know that is missing from the post?
       
          HexPhantom wrote 1 day ago:
          You may not build your daily system that way afterwards, but the
          mental model sticks
       
          breezybottom wrote 1 day ago:
          Except in this case he vibe-coded it
       
        rxm wrote 1 day ago:
        Nice project. I’m curious to see how it writes after instruct.
       
        croqaz wrote 2 days ago:
        I am creating my tiny Llama 340M base model from scratch. If you're
        curious about the steps, challenges and cost, read on. I am still
        working on the instruct model.
       
          giancarlostoro wrote 1 day ago:
          I feel like this is the true frontier, making smaller models that can
          do more than their predecessors. If we can crack this space to where
          you can get reasonable outputs from "mediocre hardware" it would be
          worthwhile, even if its somewhat inferior to frontier models, we
          can't forget that not long ago, frontier models are nowhere near as
          good as they are today, and tomorrow's models will likely be even
          better.
       
            LoganDark wrote 1 day ago:
            Qwen seems to be going in a good direction -- hundreds of experts
            on their MoE models. Extremely low active-weight counts while still
            performing quite admirably. I look forward to models with many,
            many more experts, to the point where anyone with enough random
            access can generate hundreds or thousands of tokens per second.
            Because right now, 80–120t/s is pretty slow.
       
            croqaz wrote 1 day ago:
            That's exactly what I had in mind. When I started this, I was
            jumping back and forth between this thought: "Can this model size
            actually generate logical English text?" and I played with a few
            different models of the same size and I was really really depressed
            when seeing how bad they are.... but then I discovered more and
            more tiny models and LaMini-125M, LaMini-256M, and nanowhale-100m,
            and SmolLM2-135M-Instruct are very very decent. So I decided to
            give it a try.
       
              skerit wrote 1 day ago:
              I've been working on something like this too, for quite a while!
              Though I'm trying to get a non-quadratic-attention LLM (or SLM)
              up and running.
              
              And anyway, I think the most important thing is dataset quality.
              Dumping in whatever dataset you find on Huggingface is a recipe
              for mediocrity, so I'm also spending a lot of time on that.
       
              giancarlostoro wrote 1 day ago:
              In my case, I have a local branch where I'm experimenting with
              BitNet since it can run on a CPU too.
       
       
   DIR <- back to front page