URI:
        _______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
  HTML Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
  HTML   How Taalas “prints” LLM onto a chip?
       
       
        albert_e wrote 22 hours 2 min ago:
        Does this offer truly "deterministic" responses when temperature is set
        to zero?
        
        (Of course excluding any cosmic rays / bit flips)?
        
        I didnt see a editable temperature parameter on their chatjimmy
        demosite -- only a topK.
       
        konaraddi wrote 22 hours 3 min ago:
        Imagine a Framework* laptop with these kinds of chips that could be
        swapped out as models get better over time
        
        *Framework sells laptops and parts such that in theory users can own a
        ~~ship~~ laptop of Theseus over time without having to buy a whole new
        laptop when something breaks or needs upgrade.
       
        throwaway85825 wrote 22 hours 22 min ago:
        Few customers value tokens anywhere near what it costs the big API
        vendors. When the bubble pops the only survivors will be whoever can
        offer tokens at as close to zero cost as possible. Also whoever is
        selling hardware for local AI.
       
          ramraj07 wrote 22 hours 13 min ago:
          To those who use AI to get real work done in real products we build,
          we very much appreciate the value of each token given how much
          operational overhead it offsets. A bubble pop, if one does indeed
          happen, would at best be as disruptive as the dot-com bust.
       
        ramshanker wrote 22 hours 22 min ago:
        I can imagine, where this becomes a mainstream PCIe extension card.
        Like back in days we had separate graphics card, audio card etc. Now AI
        card. So to upgrade the PC to latest model, we could buy a new card,
        load up the drivers and boom, intelligence upgrade of the PC. This
        would be so cool.
       
        trebligdivad wrote 22 hours 33 min ago:
        Hmm I guess you'll get this pile of used boards which hmm is not a
        great source of waste; but I guess they will get reused for a few
        generations.
        A problem is it doesn't seem to be just the chips that would be thrown
        but the whole board which gets silly.
       
        qoez wrote 1 day ago:
        > It took them two months, to develop chip for Llama 3.1 8B. In the AI
        world where one week is a year, it's super slow. But in a world of
        custom chips, this is supposed to be insanely fast.
        
        LLama 3.1 is like 2 years at this point. Taking two months to convert a
        model that only updates every 2 years is very fast
       
        bsenftner wrote 1 day ago:
        I'm surprised people are surprised. Of course this is possible, and of
        course this is the future. This has been demonstrated already: why do
        you think we even have GPUs at all?! Because we did this exact same
        transition from running in software to largely running in hardware for
        all 2D and 3D Computer Graphics. And these LLMs are practically the
        same math, it's all just obvious and inevitable, if you're paying
        attention to what we have, what we do to have what we have.
       
          pwarner wrote 21 hours 48 min ago:
          I'd be kind of shocked if Nvidia isn't playing with this.
          
          I don't expect it's like super commercially viable today, but for
          sure things need to trend to radically more efficient AI solutions.
       
          JKCalhoun wrote 23 hours 43 min ago:
          "This has been demonstrated already…"
          
          I think burning the weights into the gates is kinda new.
          
          ("Weights to gates." "Weighted gates"? "Gated weights"?)
       
            dogma1138 wrote 23 hours 12 min ago:
            Not really new, this is 80’s-90’s Neuron MOS Transistor.
            
            It’s also not that different than how TPUs work where they have
            special registers in their PEs for weights.
       
          IshKebab wrote 1 day ago:
          > Because we did this exact same transition from running in software
          to largely running in hardware for all 2D and 3D Computer Graphics.
          
          We transitioned from software on CPUs to fixed GPU hardware... But
          then we transitioned back to software running on GPUs! So there's no
          way you can say "of course this is the future".
       
          the__alchemist wrote 1 day ago:
          I believe this is a CPU/GPU vs ASIC comparison, rather than CPU vs
          GPU. They have always(ish) coexisted, being optimized for different
          things:  ASICs have cost/speed/power advantages, but the design is
          more difficult than writing a computer program, and you can't
          reprogram them.
          
          Generally, you use an ASIC to perform a specific task. In this case,
          I think the takeaway is the LLM functionality here is
          performance-sensitive, and has enough utility as-is to choose ASIC.
       
            RobotToaster wrote 22 hours 54 min ago:
            It reminds me of the switch from GPUs to ASICs in bitcoin mining. 
            I've been expecting this to happen.
       
            GTP wrote 23 hours 44 min ago:
            The middle ground here would be an FPGA, but I belive you would
            need a very expensive one to implement an LLM on it.
       
              dogma1138 wrote 23 hours 18 min ago:
              FPGAs would be less efficient than GPUs.
              
              FPGAs don’t scale if they did all GPUs would’ve been replaced
              by FPGAs for graphics a long time ago.
              
              You use an FPGA when spinning a custom ASIC doesn’t makes
              financial sense and generic processor such as a CPU or GPU is
              overkill.
              
              Arguably the middle ground here are TPUs, just taking the most
              efficient parts of a “GPU” when it comes to these workloads
              but still relying on memory access in every step of the
              computation.
       
                jgalt212 wrote 22 hours 2 min ago:
                I thought it was because the number logic elements in a GPU is
                orders of magnitude higher than in a FPGA, rather than just
                processing speed.  And GPU processing is inherently parallel so
                the GPU beats the FPGA just based on transistor count.
       
        peteforde wrote 1 day ago:
        I would appreciate some clarification on the "store 4 bits of data with
        one transistor" part.
        
        This doesn't sound remotely possible, but I am here to be convinced.
       
          ajb wrote 1 day ago:
          They declined to say: [1] Except they say it's fully digital, so not
          an analog multiplier
          
  HTML    [1]: https://www.eetimes.com/taalas-specializes-to-extremes-for-e...
       
        kioku wrote 1 day ago:
        I’m just wondering how this translates to computer manufacturers like
        Apple. Could we have these kinds of chips built directly into computers
        within three years? With insanely fast, local on-demand performance
        comparable to today’s models?
       
          arisAlexis wrote 1 day ago:
          and run an outdated model for 3 years while progress is exponential?
          what is the point of that
       
            selcuka wrote 22 hours 38 min ago:
            > what is the point of that
            
            Planned obsolescence? /s
            
            Jokes aside, they can make the "LLM chip" removable. I know almost
            nothing is replaceable in MacBooks, but this could be an exception.
       
            padjo wrote 1 day ago:
            Is progress still exponential? Feels like its flattening to me, it
            is hard to quantify but if you could get Opus 4.2 to work at the
            speed of the Taalas demo and run locally I feel like I'd get an
            awful lot done.
       
            ivan_gammel wrote 1 day ago:
            When output is good enough, other considerations become more
            important. Most people on this planet cannot afford even an AI
            subscription, and cost of tokens is prohibitive to many low margin
            businesses. Privacy and personalization matter too, data
            sovereignty is a hot topic. Besides, we already see how focus has
            shifted to orchestration, which can be done on CPU and is cheap -
            software optimizations may compensate hardware deficiencies, so
            it’s not going to be frozen. I think the market for local
            hardware inference is bigger than for clouds, and it’s going to
            repeat Android vs iOS story.
       
            r0b05 wrote 1 day ago:
            Yeah, the space moves so quickly that I would not want to couple
            the hardware with a model that might be outdated in a month. There
            are some interesting talking points but a general purpose
            programmable asic makes more sense to me.
       
            RobertDeNiro wrote 1 day ago:
            It won’t stay exponential forever.
       
          xattt wrote 1 day ago:
          Is it possible to supplement the model with a diff for updates on
          modular memory, or would severely impact perf?
       
            mips_avatar wrote 23 hours 13 min ago:
            I imagine you could do something like a LORA
       
            baq wrote 1 day ago:
            this design at 7 transistors per weight is 99.9% burnt in the
            silicon forever.
       
        MarcLore wrote 1 day ago:
        The form factor discussion is fascinating but I think the real unlock
        is latency. Current cloud inference adds 50-200ms of network overhead
        before you even start generating tokens. A dedicated ASIC sitting on
        PCIe could serve first token in microseconds.
        
        For applications like real-time video generation or interactive agents
        that need sub-100ms response loops, that difference is everything. The
        cost per inference might be higher than a GPU cluster at scale, but the
        latency profile opens up use cases that simply aren't possible with
        current architectures.
        
        Curious whether Taalas has published any latency benchmarks beyond the
        throughput numbers.
       
          cedws wrote 22 hours 43 min ago:
          The network latency bit deserves more attention. I’ve been trying
          to find out where AI companies are physically serving LLMs from but
          it’s difficult to find information about this. If I’m sitting in
          London and use Claude, where are the requests actually being served?
          
          The ideal world would be an edge network like Cloudflare for LLMs so
          a nearby POP serves your requests. I’m not sure how viable this is.
          On classic hardware I think it would require massive infra buildout,
          but maybe ASICs could be the key to making this viable.
       
          muyuu wrote 1 day ago:
          latency and control, and reliability of bandwidth and associated
          costs - however this isn't just the pull for specialised hardware but
          for local computing in general, specialised hardware is just the most
          extreme form of it
          
          there are tasks that inherently benefit from being centralised away,
          like say coordination of peers across a large area - and there are
          tasks that strongly benefit from being as close to the user as
          possible, like low latency tasks and privacy/control-centred tasks
          
          simultaneously, there's an overlapping pull to either side caused by
          the monetary interests of corporations vs users - corporations want
          as much as possible under their control, esp. when it's monetisable
          information but most things are at volume, and users want to be the
          sole controller of products esp. when they pay for them
          
          we had dumb terminals already being pushed in the 1960s, the "cloud",
          "edge computing" and all forms of consolidation vs segregation
          periods across the industry, it's not going to stop because there's
          money to be made from the inherent advantages of those models and
          even the industry leaders cannot prevent these advantages from
          getting exploited by specialist incumbents
          
          once leaders consolidates, inevitably they seek to maximise profit
          and in doing so they lower the barrier for new alternatives
          
          ultimately I think the market will never stop demanding just having
          your own *** computer under your control and hopefully own it, and
          only the removal of this option will stop this demand; while
          businesses will never stop trying to control your computing, and
          providing real advantages in exchange for that, only to enter cycles
          of pushing for growing profitability to the point average users keep
          going back and forth
       
        briansm wrote 1 day ago:
        I wonder if you could use the same technique (RAM models as ROM) for
        something like Whisper Speech-to-text, where the models are much
        smaller (around a Gigabyte) for a super-efficient single-chip speech
        recognition solution with tons of context knowledge.
       
        coppsilgold wrote 1 day ago:
        How feasible would it be to integrate a neural video codec into the
        SoC/GPU silicon?
        
        There would be model size constraints and what quality they can achieve
        under those constraints.
        
        Would be interesting if it didn't make sense to develop traditional
        video codecs anymore.
        
        The current video<->latents networks (part of the generative AI model
        for video) don't optimize just for compression. And you probably
        wouldn't want variable size input in an actual video codec anyway.
       
        708145_ wrote 1 day ago:
        Is Taalas' approach scalable to larger models?
       
        m101 wrote 1 day ago:
        So if we assume this is the future, the useful life of many
        semiconductors will fall substantially. What part of the semiconductor
        supply chain would have pricing power in a world of producing many more
        different designs?
        
        Perhaps mask manufacturers?
       
          ivan_gammel wrote 1 day ago:
          It might be not that bad. “Good enough” open-weight models are
          almost there, the focus may shift to agentic workflows and effective
          prompting. The lifecycle of a model chip will be comparable to
          smartphones, getting longer and longer, with orchestration software
          being responsible for faster innovation cycles.
       
            m101 wrote 1 day ago:
            If you’re running at 17k tokens / s what is the point of multiple
            agents?
       
              ivan_gammel wrote 1 day ago:
              Different skills and context. Llama 3.1 8B has just 128k context
              length, so packing everything in it may be not a great idea. You
              may want one agent analyzing the requirements and designing
              architecture, one writing tests, another one writing
              implementation and the third one doing code review. With LLMs
              it’s also matters not just what you have in context, but also
              what is absent, so that model will not overthink it.
              
              EDIT: just in case, I define agent as inference unit with
              specific preloaded context, in this case, at this speed they
              don’t have to be async - they may run in sequence in multiple
              iterations.
       
        brainless wrote 1 day ago:
        If we can print ASIC at low cost, this will change how we work with
        models.
        
        Models would be available as USB plug-in devices. A dense < 20B model
        may be the best assistant we need for personal use. It is like graphic
        cards again.
        
        I hope lots of vendors will take note. Open weight models are abundant
        now. Even at a few thousand tokens/second, low buying cost and low
        operating cost, this is massive.
       
        lm28469 wrote 1 day ago:
        Who's going to pay for custom chips when they shit out new models every
        two weeks and their deluded CEOs keep promising AGI in two release
        cycles?
       
          casey2 wrote 22 hours 32 min ago:
          Probably the datacenters that serve those models?
       
          spyder wrote 1 day ago:
          It all depends on how cheap they can get. 
          And another interesting thought: what if you could stack them? For
          example you have a base model module, then new ones come out that can
          work together with the old ones and expanding their capabilities.
       
          amelius wrote 1 day ago:
          I'm guessing this development will make the fabrication of custom
          chips cheaper.
          
          Exciting times.
       
          imtringued wrote 1 day ago:
          Almost all LLM companies have some sort of free tier that does
          nothing but lose them money.
       
          lancebeet wrote 1 day ago:
          You obviously don't believe that AGI is coming in two release cycles,
          and you also don't seem to have much faith in the new models
          containing massive improvements over the last ones. So the answer to
          who is going to pay for these custom chips seems to be you.
       
            lm28469 wrote 1 day ago:
            Why would I buy chips to run handicapped models when the 10+ llms
            players all offer free tier access to their 1t+ parameters models ?
       
              grosswait wrote 23 hours 57 min ago:
              Do you think the free gravy train will run forever?
       
              K0balt wrote 1 day ago:
              Not all applications are chatbots. Many potential uses for
              LLMs/VLAMs are latency constrained.
       
          NinjaTrance wrote 1 day ago:
          To run Llama 3.1 8B locally, you would need a GPU with a minimum of
          16 GB of VRAM, such as an NVIDIA RTX 3090.
          
          Talas promises a 10x higher throughtput, being 10x cheaper and using
          10x less electricity.
          
          Looks like a good value proposition.
       
            lm28469 wrote 1 day ago:
            What do you do with 8b models ? They can't even reliably create a
            .txt file or do any kind of tool calling
       
          brainless wrote 1 day ago:
          New GPUs come out all the time. New phones come out (if you count all
          the manufacturers) all the time. We do not need to always buy the new
          one.
          
          Current open weight models < 20B are already capable of being useful.
          With even 1K tokens/second, they would change what it means to
          interact with them or for models to interact with the computer.
       
            lm28469 wrote 1 day ago:
            hm yeah I guess if they stick to shitty models it works out, I was
            talking about the models people use to actually do things instead
            of shitposting from openclaw and getting reminders about their next
            dentist appointment.
       
              imtringued wrote 1 day ago:
              Considering that enamel regrowth is still experimental (only
              curodont exists as a commercial product), those dentist
              appointments are probably the most important routine healthcare
              appointments in your life. Pick something that is actually
              useless.
       
              brainless wrote 1 day ago:
              The trick with small models is what you ask them to do. I am
              working on a data extraction app (from emails and files) that
              works entirely local. I applied for Taalas API because it would
              be awesome fit.
              
              dwata: Entirely Local Financial Data Extraction from Emails Using
              Ministral 3 3B with Ollama: [1]
              
  HTML        [1]: https://youtu.be/LVT-jYlvM18
  HTML        [2]: https://github.com/brainless/dwata
       
        thesz wrote 1 day ago:
        8B coefficients are packed into 53B transistors, 6.5 transistors per
        coefficient. Two-inputs NAND gate takes 4 transistors and register
        takes about the same. One coefficient gets processed (multiplied by and
        result added to a sum) with less than two two-inputs NAND gates.
        
        I think they used block quantization: one can enumerate all possible
        blocks for all (sorted) permutations of coefficients and for each layer
        place only these blocks that are needed there. For 3-bit coefficients
        and block size of 4 coefficients only 330 different blocks are needed.
        
        Matrices in the llama 3.1 are 4096x4096, 16M coefficients. They can be
        compressed into only 330 blocks, if we assume that all coefficients'
        permutations are there, and network of correct permutations of inputs
        and outputs.
        
        Assuming that blocks are the most area consuming part, we have block's
        transistor budget of about 250 thousands of transistors, or 30
        thousands of 2-inputs NAND gates per block.
        
        250K transistors per block * 330 blocks / 16M transistors = about 5
        transistors per coefficient.
        
        Looks very, very doable.
        
        It does look doable even for FP4 - these are 3-bit coefficients in
        disguise.
       
          amelius wrote 1 day ago:
          I'm looking forward to the model.toVHDL() method in PyTorch.
       
            Simboo wrote 21 hours 52 min ago:
            Deep Differentiable Logic Gate Networks
       
            androiddrew wrote 22 hours 26 min ago:
            Is this a thing?
       
        punnerud wrote 1 day ago:
        Could we all get bigger FPGAs and load the model onto it using the same
        technique?
       
          generuso wrote 1 day ago:
          You could [1], but it is not very cheap -- the 32GB development board
          with the FPGA used in the article used to cost about $16K.
          
  HTML    [1]: https://arxiv.org/abs/2401.03868
       
          wmf wrote 1 day ago:
          FPGAs have really low density so that would be ridiculously
          inefficient, probably requiring ~100 FPGAs to load the model. You'd
          be better off with Groq.
       
            menaerus wrote 1 day ago:
            Not sure what you're on but I think what you said is incorrect. You
            can use hi-density HBM-enabled FPGA with (LP)DDR5 with sufficient
            number of logic elements to implement the inference. Reason why we
            don't see it in action is most likely in the fact that such FPGAs
            are insanely expensive and not so available off-the-shelf as the
            GPUs are.
       
          fercircularbuf wrote 1 day ago:
          I thought about this exact question yesterday. Curious to know why we
          couldn't, if it isn't feasible. Would allow one to upgrade to the
          next model without fabricating all new hardware.
       
        cpldcpu wrote 1 day ago:
        I wonder how well this works with MoE architectures?
        
        For dense LLMs, like llama-3.1-8B, you profit a lot from having all the
        weights available close to the actual multiply-accumulate hardware.
        
        With MoE, it is rather like a memory lookup. Instead of a 1:1 pairing
        of MACs to stored weights, you suddenly are forced to have a large
        memory block next to a small MAC block. And once this mismatch becomes
        large enough, there is a huge gain by using a highly optimized memory
        process for the memory instead of mask ROM.
        
        At that point we are back to a chiplet approach...
       
          brainless wrote 1 day ago:
          If each of the Expert models were etched in Silicon, it would still
          have massive speed boost, isn't it?
          
          I feel printing ASIC is the main block here.
       
          pests wrote 1 day ago:
          For comparison I wanted to write on how Google handles MoE archs with
          its TPUv4 arch.
          
          They use Optical Circuit Switches, operating via MEMS mirrors, to
          create highly reconfigurable, high-bandwidth 3D torus topologies. The
          OCS fabric allows 4,096 chips to be connected in a single pod, with
          the ability to dynamically rewire the cluster to match the
          communication patterns of specific MoE models.
          
          The 3D torus connects 64-chip cubes with 6 neighbors each. TPUv4 also
          contains 2 SparseCores which specialize handling high-bandwidth,
          non-contiguous memory accesses.
          
          Of course this is a DC level system,  not something on a chip for
          your pc, but just want to express the scale here.
          
          *ed: SpareCubes to SparseCubes
       
        moralestapia wrote 1 day ago:
        >HOW NVIDIA GPUs process stuff? (Inefficiency 101)
        
        Wow. Massively ignorant take. A modern GPUs is an amazing feat of
        engineering, particularly about making computation more efficient (low
        power/high throughput).
        
        Then proceeds to explain, wrongly, how inference is supposssedly
        implemented and draws conclusions from there ...
       
          imtringued wrote 1 day ago:
          The way modern Nvidia GPUs perform inference is that they have a
          processor (tensor memory accelerator) that directly performs tensor
          memory operations which directly concedes that GPGPU as a paradigm is
          too inefficient for matrix multiplication.
       
          wmf wrote 1 day ago:
          Arguably DRAM-based GPUs/TPUs are quite inefficient for inference
          compared to SRAM-based Groq/Cerebras. GPUs are highly optimized but
          they still lose to different architectures that are better suited for
          inference.
       
          beAroundHere wrote 1 day ago:
          Hey, Can you please point out explain the inaccuracies in the
          article?
          
          I had written this post to have a higher level understanding of
          traditional vs Taalas's inference. So it does abstracts lots of
          things.
       
        londons_explore wrote 1 day ago:
        So why only 30,000 tokens per second?
        
        If the chip is designed as the article says, they should be able to do
        1 token per clock cycle...
        
        And whilst I'm sure the propagation time is long through all that
        logic, it should still be able to do tens of millions of tokens per
        second...
       
          menaerus wrote 1 day ago:
          Reading from and to memory alone takes much more than a clock cycle.
       
          wmf wrote 1 day ago:
          You still need to do a forward pass per token. With massive batching
          and full pipelining you might be able to break the dependencies and
          output one token per cycle but clearly they aren't doing that.
       
            amelius wrote 1 day ago:
            More aggressive pipelining will probably be the next step.
       
        villgax wrote 1 day ago:
        This read itself is slop lol, literally dances around the term printing
        as if its some inkjet printer
       
        kinduff wrote 1 day ago:
        Very nice read, thank you for sharing this so well written.
       
        abrichr wrote 1 day ago:
        ChatGPT Deep Research dug through Taalas' WIPO patent filings and
        public reporting to piece together a hypothesis. Next Platform notes at
        least 14 patents filed [1]. The two most relevant:
        
        "Large Parameter Set Computation Accelerator Using Memory with
        Parameter Encoding" [2] "Mask Programmable ROM Using Shared
        Connections" [3] The "single transistor multiply" could be
        multiplication by routing, not arithmetic. Patent [2] describes an
        accelerator where, if weights are 4-bit (16 possible values), you
        pre-compute all 16 products (input x each possible value) with a shared
        multiplier bank, then use a hardwired mesh to route the correct result
        to each weight's location. The abstract says it directly: multiplier
        circuits produce a set of outputs, readable cells store addresses
        associated with parameter values, and a selection circuit picks the
        right output. The per-weight "readable cell" would then just be an
        access transistor that passes through the right pre-computed product.
        If that reading is correct, it's consistent with the CEO telling EE
        Times compute is "fully digital" [4], and explains why 4-bit matters so
        much: 16 multipliers to broadcast is tractable, 256 (8-bit) is not.
        
        The same patent reportedly describes the connectivity mesh as
        configurable via top metal masks, referred to as "saving the model in
        the mask ROM of the system." If so, the base die is identical across
        models, with only top metal layers changing to encode
        weights-as-connectivity and dataflow schedule.
        
        Patent [3] covers high-density multibit mask ROM using shared drain and
        gate connections with mask-programmable vias, possibly how they hit the
        density for 8B parameters on one 815mm2 die.
        
        If roughly right, some testable predictions: performance very sensitive
        to quantization bitwidth; near-zero external memory bandwidth
        dependence; fine-tuning limited to what fits in the SRAM sidecar.
        
        Caveat: the specific implementation details beyond the abstracts are
        based on Deep Research's analysis of the full patent texts, not my own
        reading, so could be off. But the abstracts and public descriptions
        line up well. [1] [2] [3] [4]
        
  HTML  [1]: https://www.nextplatform.com/2026/02/19/taalas-etches-ai-model...
  HTML  [2]: https://patents.google.com/patent/WO2025147771A1/en
  HTML  [3]: https://patents.google.com/patent/WO2025217724A1/en
  HTML  [4]: https://www.eetimes.com/taalas-specializes-to-extremes-for-ext...
       
          generuso wrote 1 day ago:
          LSI Logic and VLSI Systems used to do such things in 1980s -- they
          produced a quantity of "universal" base chips, and then relatively
          inexpensively and quickly customized them for different uses and
          customers, by adding a few interconnect layers on top. Like hardwired
          FPGAs. Such semi-custom ASICs were much less expensive than full
          custom designs, and one could order them in relatively small lots.
          
          Taalas of course builds base chips that are already closely tailored
          for a particular type of models. They aim to generate the final chips
          with the model weights baked into ROMs in two months after the
          weights become available. They hope that the hardware will be
          profitable for at least some customers, even if the model is only
          good enough for a year. Assuming they do get superior speed and
          energy efficiency, this may be a good idea.
       
          cpldcpu wrote 1 day ago:
          It could simply be bit serial. With 4 bit weights you only need four
          serial addition steps, which is not an issue if the weight are stored
          nearby in a rom.
       
        sargun wrote 1 day ago:
        Isn’t the highly connected nature of the model layers problematic to
        build into physical layer?
       
        rustybolt wrote 1 day ago:
        Note that this doesn't answer the question in the title, it merely asks
        it.
       
          alcasa wrote 1 day ago:
          Frankly the most critical question is if they can really take
          shortcuts on DV etc, which are the main reasons nobody else tapes out
          new chips for every model. Note that their current architecture only
          allows some LORA-Adapter based fine-tuning, even a model with an
          updated cutoff date would require new masks etc. Which is kind of
          insane, but props to them if they can make it work.
          
          From some announcements 2 years ago, it seems like they missed their
          initial schedule by a year, if that's indicative of anything.
          
          For their hardware to make sense a couple of things would need to be
          true:
          1. A model is good enough for a given usecase that there is no need
          to update/change it for 3-5 years. Note they need to redo their
          HW-Pipeline if even the weights change.
          2. This application is also highly latency-sensitive and benefits
          from power efficiency.
          3. That application is large enough in scale to warrant doing all
          this instead of running on last-gen hardware.
          
          Maybe some edge-computing and non-civilian use-cases might fit that,
          but given the lifespan of models, I wonder if most companies wouldn't
          consider something like this too high-risk.
          
          But maybe some non-text applications, like TTS, audio/video gen,
          might actually be a good fit.
       
            K0balt wrote 1 day ago:
            TTS, speech recognition, ocr/document parsing,
            Vision-language-action models, vehicle control, things like that do
            seem to be the ideal applications. Latency constraints limit the
            utility of larger models in many applications.
       
          beAroundHere wrote 1 day ago:
          Yeah, I had written the blog to wrap my head around the idea of 'how
          would someone even be printing Weights on a chip?' 'Or how to even
          start to think in that direction?'.
          
          I didn't explore the actual manufacturing process.
       
            pixelmelt wrote 1 day ago:
            You should add an RSS feed so I can follow it!
       
              beAroundHere wrote 1 day ago:
              I don't post blogs often, so haven't added RSS there, but will
              do. I mostly post to my linkblog[1], hence have RSS there.
              
  HTML        [1]: https://www.anuragk.com/linkblog
       
        owenpalmer wrote 1 day ago:
        > Kinda like a CD-ROM/Game cartridge, or a printed book, it only holds
        one model and cannot be rewritten.
        
        Imagine a slot on your computer where you physically pop out and
        replace the chip with different models, sort of like a Nintendo DS.
       
          kilroy123 wrote 1 day ago:
          This is what I've been wanting! Just like those eGPUs you would plug
          into your Mac. You would have a big model or device capable of
          running a top-tier model under your desk. All local, completely
          private.
       
          Someone wrote 1 day ago:
          Would somewhat work except for the power usage.
          
          I doubt it would scale linearly, but for home use 170 tokens/s at
          2.5W would be cool; 17 tokens/s at 0,25W would be awesome.
          
          On the other hand, this may be a step towards positronic brains ( [1]
          )
          
  HTML    [1]: https://en.wikipedia.org/wiki/Positronic_brain
       
          roncesvalles wrote 1 day ago:
          That slot is called USB-C. I can fully imagine inference ASICs coming
          in powerbank form factor that you'd just plug and play.
       
            bagful wrote 22 hours 55 min ago:
            [delayed]
       
            amelius wrote 1 day ago:
            > USB-C
            
            With these speeds you can run it over USB2, though maybe power is
            limiting.
       
              GTP wrote 23 hours 41 min ago:
              You would likely need external power anyway.
       
              Hendrikto wrote 1 day ago:
              USB-C is just a form factor and has nothing to do with which
              protocol you run at which speeds.
       
                amelius wrote 23 hours 44 min ago:
                I wasn't talking about the form factor.
       
            ekianjo wrote 1 day ago:
            Not if you need 200w power to run inference.
       
              stavros wrote 1 day ago:
              USB-C can do up to 240W. These days I power all my devices with a
              USB hub, even my Lipo charger.
       
            zupa-hu wrote 1 day ago:
            This would be a hell of a hot power bank. It uses about as much
            power as my oven. So probably more like inside a huge cooling
            device outside the house. Or integrated into the heating system of
            the house.
            
            (Still compelling!)
       
              fennecbutt wrote 1 day ago:
              *the whole server uses 2.2kw or whatever, not a single board. I
              think that was for 8 boards or something.
       
            XorNot wrote 1 day ago:
            Pretty sure it'd just be a thumbdrive. Are the Taalas chips
            particularly large in surface area?
       
              thesz wrote 1 day ago:
              800 mm2, about 90mm per side, if imagined as a square. Also, 250
              W of power consumption.
              
              The form factor should be anything but thumbdrive.
       
                pfortuny wrote 1 day ago:
                mmmhhhhh 800mm2 ~= (30mm)2, which is more like a (biggish)
                thumb drive.
       
                  baq wrote 1 day ago:
                  the radiator wouldn't be though
       
                  thesz wrote 1 day ago:
                  Thanks!
                  
                  I haven't had my coffee yet. ;)
       
                    pfortuny wrote 22 hours 17 min ago:
                    Shit happens :D
       
                      bdangubic wrote 22 hours 15 min ago:
                      always after the coffee :)
       
              dmurray wrote 1 day ago:
              The only product they've announced at the moment [0] is a PCI-e
              card. It's more like a small power bank than a big thumb drive.
              
              But sure, the next generation could be much smaller. It doesn't
              require battery cells, (much) heat management, or ruggedization,
              all of which put hard limits on how much you can miniaturise
              power banks.
              
              [0]
              
  HTML        [1]: https://taalas.com/the-path-to-ubiquitous-ai/
       
                yonatan8070 wrote 23 hours 39 min ago:
                I wouldn't call that size a small power bank. That chip is in
                the same ballpark as gaming GPUs, and based on the VRMs in the
                picture it probably draws about as much power.
                
                But as you said, the next generations are very likely to shrink
                (especially with them saying they want to do top of the line
                models in 2 generations), and with architecture improvements it
                could probably get much smaller.
       
                ChrisMarshallNY wrote 1 day ago:
                I’m old enough to remember your typical computer filling
                warehouse-sized buildings.
                
                Nowadays, your average cellphone has more computing power than
                those behemoths.
                
                I have a micro SD card with 256GB capacity, and I think they
                are up to 2TB. On a device the size of a fingernail.
       
          Onavo wrote 1 day ago:
          Yeah maybe you can call it PCIe.
       
          8cvor6j844qw_d6 wrote 1 day ago:
          A cartridge slot for models is a fun idea. Instead of one chip
          running any model, you get one model or maybe a family of models per
          chip at (I assume) much better perf/watt. Curious whether the
          economics work out for consumer use or if this stays in the
          embedded/edge space.
       
            sixtyj wrote 1 day ago:
            Plug it into skull bone. Neuralink + slot for a model that you can
            buy in s grocery store instead of prepaid Netflix card.
       
          beAroundHere wrote 1 day ago:
          That's the kind of hardware am rooting for. Since it'll encourage
          Open weighs models, and would be much more private.
          
          Infact, I was thinking, if robots of future could have such slots,
          where they can use different models, depending on the task they're
          given. Like a Hardware MoE.
       
            NitpickLawyer wrote 1 day ago:
            > Since it'll encourage Open weighs models
            
            Is this accurate? I don't know enough about hardware, but perhaps
            someone could clarify: how hard would it be to reverse engineer
            this to "leak" the model weights? Is it even possible?
            
            There are some labs that sell access to their models (mistral,
            cohere, etc) without having their models open. I could see a world
            where more companies can do this if this turns out to be a viable
            way. Even to end customers, if reverse engineering is deemed
            impossible. You could have a device that does most of the inference
            locally and only "call home" when stumped (think alexa with local
            processing for intent detection and cloud processing for the rest,
            but better).
       
              yonatan8070 wrote 23 hours 32 min ago:
              It's likely possible to extract model weights from the chip's
              design, but you'd need tooling at the level of an Intel R&D lab,
              not something any hobbyist could afford.
              
              I doubt anyone would have the skills, wallet, and tools to RE one
              of these and extract model weights to run them on other hardware.
              Maybe state actors like the Chinese government or similar could
              pull that off.
       
        Hello9999901 wrote 1 day ago:
        This would be a very interesting future. I can imagine Gemma 5 Mini
        running locally on hardware, or a hard-coded "AI core" like an ALU or
        media processor that supports particular encoding mechanisms like
        H.264, AV1, etc.
        
        Other than the obvious costs (but Taalas seems to be bringing back the
        structured ASIC era so costs shouldn't be that low [1]), I'm curious
        why this isn't getting much attention from larger companies. Of course,
        this wouldn't be useful for training models but as the models further
        improve, I can totally see this inside fully local + ultrafast + ultra
        efficient processors.
        
  HTML  [1]: https://en.wikipedia.org/wiki/Structured_ASIC_platform
       
          RobotToaster wrote 22 hours 41 min ago:
          > I'm curious why this isn't getting much attention from larger
          companies.
          
          I can see two potential reasons:
          
          1) Most of the big players seem convinced that AI is going to
          continue to improve at the rate it did in 2025, if their assumption
          is somehow correct by the time any chip entered mass production it
          would be obsolete.
          
          2) The business model of the big players is to sell expensive
          subscriptions, and train on and sell the data you give it.  Chips
          that allow for relatively inexpensive offline AI aren't conducive to
          that.
       
          JKCalhoun wrote 23 hours 42 min ago:
          Apple should have done this yesterday. A local AI on my phone/Macbook
          is all I really want from this tech.
          
          The cloud-based AI (OpenAI, etc.) are todays AOL.
       
            post-it wrote 22 hours 36 min ago:
            The hardware isn't there yet. Apple's neural engine is neat and has
            some uses but it just isn't in the same league as Claude right now.
            We'll get there.
       
          roncesvalles wrote 1 day ago:
          Well even programmable ASICs like Cerebras and Groq give
          many-multiples speedup over GPUs and the market has hardly reacted at
          all.
       
            mips_avatar wrote 23 hours 14 min ago:
            The problem with groq was they only allowed LORA on llama 8b and
            70b, and you had to have an enterprise contract it wasn't self
            service.
       
            IshKebab wrote 1 day ago:
            Cerebras gives a many multiple speedup but it's also many multiples
            more expensive.
       
            brainless wrote 1 day ago:
            Seems both Nvidia (Groq) and OpenAI (Codex Spark) are now invested
            in the ASIC route one way or another.
       
            fooker wrote 1 day ago:
            > market has hardly reacted at all
            
            Guess who acqui-hired Groq to push this into GPUs?
            
            The name GPU has been an anachronism for a couple of years now.
       
        rustyhancock wrote 1 day ago:
        Edit: reading the below it looks like I'm quite wrong here but I've
        left the comment...
        
        The single transistor multiply is intriguing.
        
        Id assume they are layers of FMA operating in the log domain.
        
        But everything tells me that would be too noisy and error prone to
        work.
        
        On the other hand my mind is completely biased to the digital world.
        
        If they stay in the log domain and use a resistor network for
        multiplication, and the transistor is just exponentiating for the
        addition that seems genuinely ingenious.
        
        Mulling it over, actually the noise probably doesn't matter. It'll
        average to 0.
        
        It's essentially compute and memory baked together.
        
        I don't know much about the area of research so can't tell if it's
        innovative but it does seem compelling!
       
          jsjdjrjdjdjrn wrote 1 day ago:
          I'd expect this is analog multiplication with voltage levels being
          ADC'd out for the bits they want. If you think about it, it makes the
          whole thing very analog.
       
            jsjdjrjdjdjrn wrote 1 day ago:
            Note: reading further down, my speculation is wrong.
       
          generuso wrote 1 day ago:
          The document referenced in the blog does not say anything about the
          single transistor multiply.
          
          However, [1] provides the following description: "Taalas’ density
          is also helped by an innovation which stores a 4-bit model parameter
          and does multiplication on a single transistor, Bajic said (he
          declined to give further details but confirmed that compute is still
          fully digital)."
          
  HTML    [1]: https://www.eetimes.com/taalas-specializes-to-extremes-for-e...
       
            londons_explore wrote 1 day ago:
            It'll be different gates on the transistor for the different bits,
            and you power only one set depending on which bit of the result you
            wish to calculate.
            
            Some would call it a multi-gate transistor, whilst others would
            call it multiple transistors in a row...
       
              hagbard_c wrote 1 day ago:
              That, or a resistor ladder with 4 bit branches connected to a
              single gate, possibly with a capacitor in between, representing
              the binary state as an analogue voltage, i.e. an analogue-binary
              computer. If it works for flash memory it could work for this
              application as well.
       
            rustyhancock wrote 1 day ago:
            That's much more informative, I think my original comment is quite
            off the mark then.
       
       
   DIR <- back to front page