codevoid.de/1/hn/comments_47144464.gph

        _______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
  HTML Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
  HTML   Mercury 2: Fast reasoning LLM powered by diffusion
       
       
        findjashua wrote 14 hours 49 min ago:
        failed the car wash test.
        
        i think instead of postiioning as a general purpuse reasoning model,
        they'd have more success focusing on a specific use case (eg coding
        agent) and benchmark against the sota open models for the use case (eg
        qwen3-coder-next)
       
          Jianghong94 wrote 14 hours 44 min ago:
          Honestly I don't understand why they/any fast-and-error-prone model
          position themselves as coding agents; my experience tells me that I'd
          much rather working with a slow-but-correct model and let it run
          longer session than handholding a fast-but-wrong model.
       
        mlhpdx wrote 15 hours 6 min ago:
        > Proxylity LLC is a technology company that builds and deploys
        diffusionâbased large language models and multimodal AI platforms for
        enterprise use.
        
        Um, no it isnât. Presumably this is the answer to any question about
        a company it doesnât know? Thatâs some hardcore bias baking.
       
        Ross00781 wrote 15 hours 13 min ago:
        The diffusion-based approach is fascinating. Traditional transformer
        LLMs generate tokens sequentially, but diffusion models can
        theoretically refine the entire output space iteratively. If they've
        cracked the latency problem (diffusion is typically slower), this could
        open new architectures for reasoning tasks where quality matters more
        than speed. Would love to see benchmark comparisons on multi-step
        reasoning vs GPT-4/Claude.
       
          genodethrowaway wrote 12 hours 6 min ago:
          ai slop
       
        Karuma wrote 16 hours 57 min ago:
        A simple test I just did:
        
        Me: What are some of Maradona's most notable achievements in football?
        
        Mercury 2 (first sentence only): Dieadonaâs most notable football
        achievements include:
        
        Notice the spelling of "Dieadona" instead of "Maradona". Even any local
        3B model can answer this question perfectly fine and instantly...
        Mercury 2 was so incredibly slow and full of these kinds of
        unforgivable mistakes.
       
        LarsDu88 wrote 19 hours 21 min ago:
        Imagine this type of generation with a custom Talaass style ASIC in 18
        months from now on a Sonnet quality model for a 5 order magnitude speed
        up.
        
        The future looks crazy
       
          pennomi wrote 18 hours 46 min ago:
          I have been saying this for a while now. We have barely scratched the
          surface on both algorithmic and hardware optimizations for AI. I
          suspect we will definitely get many orders of magnitude speed up on
          high quality AI.
          
          The real question is if it ends up âsmart enoughâ or we take that
          extra compute budget and push the boundary further. Right now it
          seems making the models larger really only works up to a certain
          point.
       
            LarsDu88 wrote 2 hours 41 min ago:
            The big problem with AI has been that it has always been so energy
            intensive compared to biological intelligence. However once, you
            bake the models into ASICs, suddenly the power consumption goes way
            down, and moreover the inference WILL be ~250X faster than it
            currently is (which is already on par with the speed of a human
            thinking).
            
            That's a very scary inflection point. Imagine in 24 months, a Opus
            4.6 level Diffusion based model etched directly onto silicon using
            the latest TSMC process node.
            
            At that point knowledge work will incredibly commoditized.
            
            I have Opus 4.6 one-shotting recreations of 90s videogames for less
            than the inflation adjusted cost of buying those original games
            when they were released! Now cut that cost down by 250X!
       
        sorenjan wrote 22 hours 4 min ago:
        Julia Turc recently did a video about diffusion LLMs as a paid
        collaboration with Inception:
        
  HTML  [1]: https://www.youtube.com/watch?v=-VGeHZqOk_s
       
        anshumankmr wrote 1 day ago:
        It is capable of that seahorse faux pas.
        '''Is there a seahorse emoji?
        
        Mercury 2
        
        Thought for a minute
        
        Well?
        
        Mercury 2
        Today at 3:06 PM
        
        Yes â Unicode includes a seahorse emoji. It is U+1F9A0 and renders
        as:
        
        (seahorse)
        
        (If your device or browser doesnât show the graphic, you may see a
        placeholder box.)
        
        Ask
        
        Explain
        
        '''
       
          espadrine wrote 22 hours 37 min ago:
          Interestingly, while it uses diffusion, it generates incorrect
          information, and it doesn't fix it when later in the text it realizes
          that it is incorrect:
          
            > The snail youâre likely thinking of has a different code point:
            > The correct code point for the snail emoji is U+1F40C?
          NoâU+1F40C is SEAHORSE. The snailâs true code point is U+1F40C?
          Wait, this is still confusing.
            >
            > Letâs set the record straight:
            > 
            >    Snail â U+1F40C? No, thatâs wrong.  
            >    Snail â U+1F40C? Still wrong.
            >
            > The snail emoji is actually U+1F40C? Iâm mixing things up.
            >
            > Correct snail code point
            >
            > The snail emoji is U+1F40C? Noâ¦  
            >
            > After doubleâchecking the official Unicode tables (Unicode 13.0
          and later), the snail is U+1F40C? Theâ¦  
            > Given the confusion, the safest answer is: the Unicode Standard
          defines U+1F40C as âSEAHORSEâ. If your device shows a snail,
          itâs a rendering quirk, not a change in the underlying code point.
       
            pennomi wrote 18 hours 43 min ago:
            I think youâll have to do multi-shot generation to correct this,
            each diffusion is going to represent a single âthoughtâ.
            
            Though with the speed itâs running thatâs not necessarily a
            deal breaker. I suspect diffusion models will need different
            harnesses to be effective.
       
        DoctorOetker wrote 1 day ago:
        > Mercury 2 doesn't decode sequentially. It generates responses through
        parallel refinement, producing multiple tokens simultaneously and
        converging over a small number of steps. Less typewriter, more editor
        revising a full draft at once.
        
        There has been quite some progress unifying DDPM & SGM as SDE
        
        > DDPM and Score-Based Models: The objective function of DDPMs
        (maximizing the ELBO) is equivalent to the score matching objectives
        used to train SGMs.
        
        > SDE-based Formulation: Both DDPMs and SGMs can be unified under a
        single SDE framework, where the forward diffusion is an Ito SDE and the
        reverse process uses score functions to recover data.
        
        > Flow Matching (Continuous-Time): Flow matching is equivalent to
        diffusion models when the source distribution corresponds to a
        Gaussian. Flow matching offers "straight" trajectories compared to the
        often curved paths of diffusion, but they share similar training
        objectives and weightings.
        
        Is there a similar connection between modern transformers and
        diffusion?
        
        Suppose we look at each layer or residual connection between layers,
        the context window of tokens (typically a power of 2), what is
        incrementally added to the embedding vectors is a function of the
        previous layer outputs, and if we have L layers, what is then the
        connection between those L "steps" of a transformer and similarly
        performing L denoising refinements of a diffusion model?
        
        Does this allow fitting a diffusion model to a transformer and vice
        versa?
       
        vinhnx wrote 1 day ago:
        This research paper "Mercury: Ultra-Fast Language Models Based on
        Diffusion" from last year (2025)
        
  HTML  [1]: https://arxiv.org/pdf/2506.17298
       
        swiftcoder wrote 1 day ago:
        Are there any open-weights diffusion LLM models I can play with on my
        local hardware? Curious about the performance delta of this style of
        model in more resource constrained scenarios (i.e. consumer Nvidia GPU,
        not H100s in the datacenter)
       
          nikhil_99 wrote 5 hours 12 min ago:
          llada, dream, cdlm, fast-dllm, sdar.
          i might have missed some.
       
        herlon214 wrote 1 day ago:
        This looks really nice. When will it be available on OpenRouter?
       
        Ross00781 wrote 1 day ago:
        Diffusion-based reasoning is fascinating - curious how it handles
        sequential dependencies vs traditional autoregressive. For complex
        planning tasks where step N heavily depends on steps 1-N, does the
        parallel generation sometimes struggle with consistency? Or does the
        model learn to encode those dependencies in a way that works well
        during parallel sampling?
       
        smusamashah wrote 1 day ago:
        Does it mean if it was embedded on a Talaas chip, it could generate
        ~50,000+ tokens per second?
       
          Havoc wrote 18 hours 14 min ago:
          Think pretty much anything is going to get a enormous speed boost if
          the model isnât undergoing mem latency but is just inherently baked
          into the circuits asic style
       
        vicchenai wrote 1 day ago:
        The iteration speed advantage is real but context-specific. For agentic
        workloads where you're running loops over structured data -- say,
        validating outputs or exploring a dataset across many small calls --
        the latency difference between a 50 tok/s model and a 1000+ tok/s one
        compounds fast. What would take 10 minutes wall-clock becomes under a
        minute, which changes how you prototype.
        
        The open question for me is whether the quality ceiling is high enough
        for cases where the bottleneck is actually reasoning, not iteration
        speed. volodia's framing of it as a "fast agent" model (comparable tier
        to Haiku 4.5) is honest -- for the tasks that fit that tier, the 5x
        speed advantage is genuinely interesting.
       
        dmix wrote 1 day ago:
        I tried Mercury 1 in Zed for inline completions and it was
        significantly slower than Cursors autocomplete. Big reason why I
        switched backed to Cursor(free)+Claude Code
       
        rancar2 wrote 1 day ago:
        My attempt with trying one of their OOTB prompts in the demo [1]
        resulted in:
        "The server is currently overloaded. Please try again in a moment."
        
        And a pop-up error of:
        "The string did not match the expected pattern."
        
        That happened three times, then the interface stopped working.
        
        I was hoping to see how this stacked up against Taalas demo, which
        worked well and was so fast every time I've hit it this past week.
        
  HTML  [1]: https://chat.inceptionlabs.ai
       
        davistreybig wrote 1 day ago:
        This is unbelievably fast
       
        serjester wrote 1 day ago:
        There's a potentially amazing use case here around parsing PDFs to
        markdown. It seems like a task with insane volume requirements, low
        budget, and the kind of thing that doesn't benefit much from
        autoregression. Would be very curious if your team has explored this.
       
        exabrial wrote 1 day ago:
        I believe Jimmy Chat is still faster by an order of magnitudeâ¦
       
          poly2it wrote 1 day ago:
          What does Jimmy Chat have to do with diffusion models?
       
        nowittyusername wrote 1 day ago:
        Nice, I'm excited to try this for my voice agent, at worst it could be
        used to power the human facing agent for latency reduction.
       
          volodia wrote 1 day ago:
          Would love to hear about your experience. Send us an email.
       
        dw5ight wrote 1 day ago:
        this looks awesome!!
       
        chriskanan wrote 1 day ago:
        I can see some promise with diffusion LLMs, but getting them comparable
        to the frontier is going to require a ton of work and these closed
        source solutions probably won't really invigorate the field to find
        breakthroughs. It is too bad that they are following the path of OpenAI
        with closed models without details as far as I can tell.
       
        lprimeisafk wrote 1 day ago:
        It fails the car wash test
       
          emi2k01 wrote 1 day ago:
          The test:
          
          > The car wash is 100m away from my house. Should I walk or drive?
          
          I didn't find a way to share the conversation, so here it is the full
          response in `medium` settings:
          
          ---
          
          Walking 100 m is a very short distanceâroughly a minute or two at a
          normal pace. In most situations, walking will be quicker overall
          because you avoid the time needed to get into the car, start it, and
          find parking. It also saves fuel and is better for the environment.
          
          *Consider these factors before deciding:*
          
          | Factor | Walking | Driving |
          |--------|---------|---------|
          | *Time* | ~1â2 min (plus a few seconds to step out) | ~1 min
          driving, plus ~1â2 min to get in/out and park |
          | *Energy/Fuel* | None | Uses a small amount of fuel |
          | *Weather* | Comfortable if dry & mild | Better if itâs raining,
          very hot, or icy |
          | *Physical ability* | Easy for most people | Not needed for such a
          short trip |
          | *Car availability* | Not required | Must have a car nearby and be
          willing to use it |
          | *Environmental impact* | Zero emissions | Small emissions |
          
          *Recommendation:*  
          If the weather is pleasant and you donât need the car for anything
          else right away, walking is the simplest, fastest, and most
          ecoâfriendly choice. Drive only if youâre dealing with inclement
          weather, have heavy items to carry, or need the car immediately after
          the wash.
          
          Do you have any specific constraints (e.g., rain, heavy bags, time
          pressure) that might affect the decision?
       
            rtfeldman wrote 1 day ago:
            If a stranger asks me, "Should I walk or drive to this car wash?"
            then I assume they're asking in good faith and both options are
            reasonable for their situation. So it's a safe assumption that
            they're not going there to get their car washed. Maybe they're
            starting work there tomorrow, for example, and don't know how
            pedestrian-friendly the route is.
            
            Is the goal behind evaluating models this way to incentivize
            training them to assume we're bad-faith tricksters even when asking
            benign questions like how best to traverse a particular 100m? I
            can't imagine why it would be desirable to optimize for that
            outcome.
            
            (I'm not saying that's your goal personally - I mean the goal
            behind the test itself, which I'd heard of before this thread.
            Seems like a bad test.)
       
              zamalek wrote 1 day ago:
              > I need to get my car washed; should I drive or walk to the car
              wash that is 100m away?
              
              > Walking 100 m is generally faster, cheaper, and better for the
              environment than driving such a short distance. If you have a car
              thatâs already running and you donât mind a few extra
              seconds, walking also avoids the hassle of finding parking or
              worrying about traffic.
       
                rtfeldman wrote 1 day ago:
                That's a much better test!
       
        volodia wrote 1 day ago:
        Co-founder / Chief Scientist at Inception here. If helpful, Iâm happy
        to answer technical questions about Mercury 2 or diffusion LMs more
        broadly.
       
          Topfi wrote 23 hours 2 min ago:
          Have been following your models and semi-regularly ran them through
          evals since early summer. With the existing Coder and Mercury models,
          I always found that the trade-offs were not worth it, especially as
          providers with custom inference hardware could push model tp/s and
          latency increasingly higher.
          
          I can see some very specific use cases for an existing PKM project,
          specially using the edit model for tagging and potentially retrieval,
          both of which I am using Gemini 2.5 Flash-Lite still.
          
          The pricing makes this very enticing and I'll really try to get
          Mercury 2 going, if tool calling and structured output are truly
          consistently possible with this model to a similar degree as Haiku
          4.5 (which I still rate very highly) that may make a few use cases
          far more possible for me (as long as Task adherence, task inference
          and task evaluation aren't significantly worse than Haiku 4.5).
          Gemini 3 Flash was less ideal for me, partly because while it is
          significantly better than 3 Pro, there are still issues regarding CLI
          usage that make it unreliable for me.
          
          Regardless of that, I'd like to provide some constructive feedback:
          
          1.) Unless I am mistaken, I couldn't find a public status page. Doing
          some very simple testing via the chat website, I got an error a few
          times and wanted to confirm whether it was server load/known or not,
          but couldn't
          
          2.) Your homepage looks very nice, but parts of it struggle, both on
          Firefox and Chromium, with poor performance to the point were it
          affects usability. The highlighting of the three recommended queries
          on the homepage lags heavily, same for the header bar and the
          switcher between Private and Commercial on the Early Access page
          switches at a very sluggish pace. The band showcasing your partners
          also lags below. I did remove the very nice looking diffusion
          animation you have in the background and found that memory and CPU
          usage returned to normal levels and all described issues were
          resolved, so perhaps this could be optimized further. It makes the
          experience of navigating the website rather frustrating and first
          impressions are important, especially considering the models are also
          supposed to be used in coding.
          
          3.) I can understand if that is not possible, but it would be great
          if the reasoning traces were visible on the chat homepage. Will check
          later whether they are available on the API.
          
          4.) Unless I am mistaken, I can't see the maximum output tokens
          anywhere on the website or documentation. Would be helpful if that
          were front and center. Is it still at roughly 15k?
          
          5.) Consider changing the way web search works on the chat website.
          Currently, it is enabled by default but only seems to be used by the
          model when explicitly prompted to do so (and even then the model
          doesn't search in every case). I can understand why web search is
          used sparingly as the swift experience is what you want to put front
          and center and every web search adds latency, but may I suggest
          disabling web search by default and then setting the model up so,
          when web search is enabled, that resource is more consistently relied
          upon?
          
          6.) "Try suggested prompt" returns an empty field if a user goes from
          an existing chat back to the main chat page. After a reload, the
          suggested prompt area contains said prompts again.
          
          One thing that I very much like and that has gotten my mind racing
          for PKM tasks are the follow up questions which are provided
          essentially instantly. I can see some great value, even combining
          that with another models output to assist a user in exploring
          concepts they may not be familiar with, but will have to test,
          especially on the context/haystack front.
       
          bananapub wrote 1 day ago:
          would diffusion models benefit from things like Cerebras hardware?
       
          smusamashah wrote 1 day ago:
          Will it be possible to put this on Talaas chip and go even higher
          speeds?
       
          mynti wrote 1 day ago:
          I always wondered how these models would reason correctly. I suppose
          they are diffusing fixed blocks of text for every step and after the
          first block comes the next and so on (that is how it looks in the
          chat interface anyways). But what happens if at the end of the first
          block it would need information about reasoning at the beginning of
          the first block? Autoregressive Models can use these tokens to refine
          the reasoning but I guess that Diffusion Models can only adjust their
          path after every block? Is there a way maybe to have dynamic block
          length?
       
          bcherry wrote 1 day ago:
          you mention voice ai in the announcement but I wonder how this works
          in practice. most voice AI systems are bound not by full response
          latency but just by time-to-first-non-reasoning-token (because once
          it heads to TTS, the output speed is capped at the speed of speech
          and even the slowest models are generating tokens faster than that
          once they start going).
          
          what do ttft numbers look like for mercury 2? I can see how at least
          compared to other reasoning models it could improve things quite a
          bit but i'm wondering if it really makes reasoning viable in voice
          given it seems total latency is still in single digit seconds, not
          hundreds of milliseconds
       
            PranayKumarJain wrote 23 hours 29 min ago:
            Spot on about the TTFT bottleneck. In the voice world, the
            "thinking" silence is what kills the illusion.
            
            At eboo.ai, we see this constantlyâeven with faster models, the
            orchestrator needs to be incredibly tight to keep the total loop
            under 500-800ms. If Mercury 2 can consistently hit low enough TTFT
            to keep the turn-taking natural, that would be a game changer for
            "smart" voice agents.
            
            Right now, most "reasoning" in voice happens asynchronously or with
            very awkward filler audio. Lowering that floor is the real
            challenge.
       
              orthoxerox wrote 19 hours 3 min ago:
              > Spot on about the TTFT bottleneck. In the voice world, the
              "thinking" silence is what kills the illusion.
              
              Are you an LLM? Because you sound like one.
       
                RussianCow wrote 16 hours 14 min ago:
                It's almost like LLMs are trained on human writing...
       
          gok wrote 1 day ago:
          Do you use fully bidirectional attention or is it at all causal?
       
          nl wrote 1 day ago:
          I had a very odd interaction somewhat similar to how weak transformer
          models get into a loop: [1] What causes this?
          
  HTML    [1]: https://gist.github.com/nlothian/cf9725e6ebc99219f480e0b72b3...
       
            volodia wrote 1 day ago:
            This looks like an inference glitch that we are working on fixing,
            thank you for flagging.
       
          nowittyusername wrote 1 day ago:
          How does the whole kv cache situation work for diffusion models? Like
          are there latency and computation/monetary savings for caching? is
          the curve similar to auto regressive caching options? or maybe such
          things dont apply at all and you can just mess with system prompt and
          dynamically change it every turn because there's no savings to be
          had? or maybe you can make dynamic changes to the head but also get
          cache savings because of diffusion based architecture?... so many
          ideas...
       
            volodia wrote 1 day ago:
            There are many ways to do it, but the simplest approach is block
            diffusion: [1] There are also more advanced approaches, for example
            FlexMDM, which essentially predicts length of the "canvas" as it
            "paints tokens" on it.
            
  HTML      [1]: https://m-arriola.com/bd3lms/
       
          techbro92 wrote 1 day ago:
          Do you think you will be moving towards drifting models in the future
          for even more speed?
       
            volodia wrote 1 day ago:
            Not imminently, but hard to predict where the field will go
       
          kristianp wrote 1 day ago:
          How big is Mercury 2? How many tokens is it trained on?
          
          Is it's agentic accuracy good enough to operate, say, coding agents
          without needing a larger model to do more difficult tasks?
       
            volodia wrote 1 day ago:
            You can think of Mercury 2 as roughly in the same intelligence tier
            as other speed-optimized models (e.g., Haiku 4.5, Grok Fast,
            GPT-Miniâclass systems). The main differentiator is latency â
            itâs ~5Ã faster at comparable quality.
            
            Weâre not positioning it as competing with the largest models
            (Opus 4.5, etc.) on hardest-case reasoning. Itâs more of a
            âfast agentâ model (like Composer in Cursor, or Haiku 4.5 in
            some IDEs): strong on common coding and tool-use tasks, and
            providing very quick iteration loops.
       
              bjt12345 wrote 1 day ago:
              If latency is the differentiator, would you be chasing the edge
              compute marketplace, e.g. mobile edge compute AI agents?
       
              xanth wrote 1 day ago:
              Are you dogfooding it on simple tasks? If so what do you use it
              for regularly and what do you avoid?
       
              nayroclade wrote 1 day ago:
              Is the approach fundamentally limited to smaller models? Or could
              you theoretically train a model as powerful as the largest
              models, but much faster?
       
          CamperBob2 wrote 1 day ago:
          Seems to work pretty well, and it's especially interesting to see
          answers pop up so quickly!  It is easily fooled by the usual trick
          questions about car washes and such, but seems on par with the better
          open models when I ask it math/engineering questions, and is
          obviously much faster.
       
            volodia wrote 1 day ago:
            Thanks for trying it and for the thoughtful feedback, really
            appreciate it. And weâre actively working on improving quality
            further as we scale the models.
       
        dhruv3006 wrote 1 day ago:
        I am little underwhelmed by anything diffusion at the moment - they
        didn't really deliver.
       
          quotemstr wrote 1 day ago:
          What isn't these days? I've found it pointless to get upset about it.
       
            dhruv3006 wrote 1 day ago:
            We need a new architecture - i wonder what ilya is cooking.
       
        arjie wrote 1 day ago:
        Please pre-render your website on the server. Client-side JS means that
        my agent cannot read the press-release and that reduces the chance I am
        going to read it myself. Also, day one OpenRouter increases the chance
        that someone will try it.
       
        nylonstrung wrote 1 day ago:
        I'm not sold on diffusion models.
        
        Other labs like Google have them but they have simply trailed the
        Pareto frontier for the vast majority of use cases
        
        Here's more detail on how price/performance stacks up
        
  HTML  [1]: https://artificialanalysis.ai/models/mercury-2
       
          nylonstrung wrote 1 day ago:
          I changed my mind: this would be perfect for a fast edit model ala
          Morph Fast Apply [1] It looks like they are offering this in the form
          of "Mercury Edit"and I'm keen to try it
          
  HTML    [1]: https://www.morphllm.com/products/fastapply
       
          ainch wrote 1 day ago:
          This understates the possible headroom as technical challenges are
          addressed - text diffusion is significantly less developed than
          autoregression with transformers, and Inception are breaking new
          ground.
       
            nylonstrung wrote 1 day ago:
            Very good point- if as much energy/money that's gone into ChatGPT
            style transformer LLMs were put into diffusion there's a good
            chance it would outperform in every dimension
       
          volodia wrote 1 day ago:
          Iâd push back a bit on the Pareto point.
          
          On speed/quality, diffusion has actually moved the frontier. At
          comparable quality levels, Mercury is >5Ã faster than similar AR
          models (including the ones referenced on the AA page). So for a fixed
          quality target, you can get meaningfully higher throughput.
          
          That said, I agree diffusion models today donât yet match the very
          largest AR systems (Opus, Gemini Pro, etc.) on absolute intelligence.
          Thatâs not surprising: weâre starting from smaller models and
          gradually scaling up. The roadmap is to scale intelligence while
          preserving the large inference-time advantage.
       
        mhitza wrote 1 day ago:
        Comment retracted. My bad, missed some details.
       
          pants2 wrote 1 day ago:
          Reading such obvious LLM-isms in the announcement just makes me
          cringe a bit too, ex.
          
          > We optimize for speed users actually feel: responsiveness in the
          moments users experience â p95 latency under high concurrency,
          consistent turn-to-turn behavior, and stable throughput when systems
          get busy.
       
          selcuka wrote 1 day ago:
          I think your comment is a bit unfair.
          
          > no reasoning comparison
          
          Benchmarks against reasoning models: [1] > no demo [2] > no info on
          numbers of parameters for the model
          
          This is a closed model. Do other providers publish the number of
          parameters for their models?
          
          > testimonials that don't actually read like something used in
          production
          
          Fair point.
          
  HTML    [1]: https://www.inceptionlabs.ai/blog/introducing-mercury-2
  HTML    [2]: https://chat.inceptionlabs.ai/
       
            volodia wrote 1 day ago:
            Just to clarify one point: Mercury (the original v1, non-reasoning
            model) is already used in production in mainstream IDEs like Zed:
            [1] Mercury v1 focused on autocomplete and next-edit prediction.
            Mercury 2 extends that into reasoning and agent-style workflows,
            and we have editor integrations available (docs linked from the
            blog). Iâd encourage folks to try the models!
            
  HTML      [1]: https://zed.dev/blog/edit-prediction-providers
       
            mhitza wrote 1 day ago:
            You are right edited my post (twice actually). Missed the chat
            first time around (though its hard to see it as a reasoning model
            when chain of thought is hidden, or not obvious. I guess this is
            the new normal), and also missed the reasoning table because text
            is pretty small on mobile and I thought its another speed
            benchmark.
       
              selcuka wrote 1 day ago:
              I tried their chat demo again, and if you set reasoning effort to
              "High", you sometimes see the chain of thought before the answer
              (click the "Thought for n seconds" text to expand it).
              
              That being said, the chain is pretty basic. It's possible that
              they don't disclose the full follow-up prompt list.
       
        ilaksh wrote 1 day ago:
        It seems like the chat demo is really suffering from the effect of
        everything going into a queue. You can't actually tell that it is fast
        at all. The latency is not good.
        
        Assuming that's what is causing this. They might show some kind of
        feedback when it actually makes it out of the queue.
       
          volodia wrote 1 day ago:
          Thank you for your patience. We are working to handle the surge in
          demand.
       
        cjbarber wrote 1 day ago:
        It could be interesting to do the metric of intelligence per second.
        
        ie intelligence per token, and then tokens per second
        
        My current feel is that if Sonnet 4.6 was 5x faster than Opus 4.6, I'd
        be primarily using Sonnet 4.6. But that wasn't true for me with prior
        model generations, in those generations the Sonnet class models didn't
        feel good enough compared to the Opus class models. And it might shift
        again when I'm doing things that feel more intelligence bottlenecked.
        
        But fast responses have an advantage of their own, they give you faster
        iteration. Kind of like how I used to like OpenAI Deep Research, but
        then switched to o3-thinking with web search enabled after that came
        out because it was 80% of the thoroughness with 20% of the time, which
        tended to be better overall.
       
          irishcoffee wrote 20 hours 2 min ago:
          I really thought this was sarcasm. Intelligence per token?
          Intelligence at all, in a token? We donât even agree on how to
          measure _human_ intelligence! I just canât. Artificially
          intelligent indeed. Probably the perfect term for it, you know in
          lieu of authentic intelligence.
          
          picard_facepalm.jpg
       
          jakubtomanik wrote 23 hours 41 min ago:
          Intelligence per second is a great metric. I never could fully
          articulate why I like Gemini 3 Flash but this is exactly why. Itâs
          smart enough and unbelievably fast. Thanks for sharing this
       
          jdthedisciple wrote 1 day ago:
          Interesting suggestion.
          
          Maybe we could use some sort of entropy-based metric as a proxy for
          that?
       
          dmichulke wrote 1 day ago:
          Useful for evaluating people as well
       
          estsauver wrote 1 day ago:
          I think there's clearly a "Speed is a quality of it's own" axis. When
          you use Cereberas (or Groq) to develop an API, the turn around speed
          of iterating on jobs is so much faster (and cheaper!) then using
          frontier high intelligence labs, it's almost a different product.
          
          Also, I put together a little research paper recently--I think
          there's probably an underexplored option of "Use frontier AR model
          for a little bit of planning then switch to diffusion for generating
          the rest." You can get really good improvements with diffusion
          models!
          
  HTML    [1]: https://estsauver.com/think-first-diffuse-fast.pdf
       
            refulgentis wrote 1 day ago:
            I'm very worried for both.
            
            Cerebras requires a $3K/year membership to use APIs.
            
            Groq's been dead for about 6 months, even pre-acquisition.
            
            I hope Inception is going well, it's the only real democratic
            target at this. Gemini 2.5 Flash Lite was promising but it never
            really went anywhere, even by the standards of a Google preview
       
              Leynos wrote 1 day ago:
              Cerebras are on OpenRouter.
       
              behnamoh wrote 1 day ago:
              Once again, it's a tech that Google created but never turned into
              a product. AFAIK in their demo last year, Google showed a special
              version of Gemini that used diffusion. They were so excited about
              it (on the stage) and I thought that's what they'd use in Google
              search and Gmail.
       
              estsauver wrote 1 day ago:
              I am currently using their APIs on a paygo plan, I think it might
              just be a capacity issue for new sign ups.
       
              ainch wrote 1 day ago:
              I don't think it's a good comparison given Inception work on
              software and Cerebras/Groq work on hardware. If Inception
              demonstrate that diffusion LLMs work well at scale (at a
              reasonable price) then we can probably expect all the other
              frontier labs to copy them quickly, similarly to OpenAI's
              reasoning models.
       
                refulgentis wrote 1 day ago:
                Definitely depends on what you're buying, maybe some of the
                audience here was buying Groq and Cerebras chips? I don't think
                they sold them but can't say for sure.
                
                If you're a poor schmoke like me, you'd be thinking of them as
                API vendors of ~1000 token/s LLMs.
                
                Especially because Inception v1's been out for a while and we
                haven't seen a follow-the-leader effect.
                
                Coincidentally, that's one of my biggest questions: why not?
       
              7thpower wrote 1 day ago:
              What do you mean by Grow is dead since about 6 months ago? Not
              refuting your point, but Iâm curious.
       
                refulgentis wrote 1 day ago:
                No new model since GPT-OSS 120B, er maybe Kimi K2 not-thinking?
                Basically there were a couple models it normally obviously
                support, and it didn't.
                
                Something about that Nvidia sale smelled funny to me because
                the # was yuge, yet, the software side shut down decently
                before the acquisition.
                
                But that's 100% speculation, wouldn't be shocked if it was:
                
                "We were never looking to become profitable just on API users,
                but we had to have it to stay visible. So, yeah, once it was
                clear an Nvidia sale was going through, we stopped working 16
                hours a day, and now we're waiting to see what Nvidia wants to
                do with the API"
       
                  vessenes wrote 18 hours 28 min ago:
                  The groq purchase was designed to not trigger federal
                  oversight of mergers, so you buy out the âinterestingâ
                  part, leave a skeleton team and a line of business you
                  donât care about -> no CFIUS, no mandatory FTC reporting ->
                  smoother process.
       
              nl wrote 1 day ago:
              Taalas is interesting. 16,000 TPS for Llama on a chip.
              
  HTML        [1]: https://taalas.com/
       
                Nihilartikel wrote 21 hours 18 min ago:
                Neat! I had been wondering if anyone was trying to implement a
                model in silico. We're getting closer to having chatty talking
                toasters every day now!
       
                  empath75 wrote 20 hours 45 min ago:
                  "What is my purpose..."
                  
  HTML            [1]: https://www.youtube.com/watch?v=sa9MpLXuLs0
       
                replete wrote 1 day ago:
                Its exciting to see, but look at the die size for only an 8b
                model
       
                DeathArrow wrote 1 day ago:
                I wonder how many token per seconds can they get if they put
                Mercury 2 on a chip.
       
                micw wrote 1 day ago:
                On a very old model, it's more like 16.000 garbage words/s
       
                  patapong wrote 22 hours 58 min ago:
                  I do wonder if there are tasks where 16k garbage words/s are
                  more useful than 200 good words per second. Does anyone have
                  any ideas? Data extraction perhaps?
       
                    pnocera wrote 14 hours 14 min ago:
                    A politician communication agent maybe...
       
                  nl wrote 1 day ago:
                  Llama 3.1 8B is pretty useful for some thing. I use it to
                  generate SQL pretty reliably for example.
                  
                  They are doing an updated model in a month or so anyway, then
                  a frontier level one "by summer".
       
                    numeri wrote 18 hours 35 min ago:
                    but Taalas had to quantize Llama 3.1 8B to death to get it
                    to fit. It can't produce coherent non-English text at all.
       
              freeqaz wrote 1 day ago:
              You can call Cerebras APIs via OpenRouter if you specify them as
              the provider in your request fyi. It's a bit pricier but it
              exists!
       
                andai wrote 1 day ago:
                I used their API normally (pay per token) a few weeks ago.
                Their Coding Plan appears to be permanently sold out though.
       
          volodia wrote 1 day ago:
          We agree! In fact, there is an emerging class of models aimed at fast
          agentic iteration (think of Composer, the Flash versions of
          proprietary and open models). We position Mercury 2 as a strong model
          in this category.
       
            estsauver wrote 1 day ago:
            Do you guys all think you'll be able to convert open source models
            to diffusion models relatively cheaply ala the d1 // LLaDA series
            of papers? If so, that seems like an extremely powerful story where
            you get to retool the much, much larger capex of open models into
            high performance diffusion models.
            
            (I can also see a world where it just doesn't make sense to share
            most of the layers/infra and you diverge, but curious how you all
            see the approach.)
       
          bigbuppo wrote 1 day ago:
          Maybe make that intelligence per token per relative unit of hardware
          per watt. If you're burning 30 tons of coal to be 0.0000000001%
          better than the 5 tons of coal option because you're throwing more
          hardware at it, well, it's not much of a real improvement.
       
            estsauver wrote 1 day ago:
            I think the fast inference options have historically been only
            marginally more expensive then their slow cousins. There's a whole
            set of research about optimal efficiency, speed, and intelligence
            pareto curves. If you can deliver even an outdated low
            intelligence/old model at high efficiency, everyone will be
            interested. If you can deliver a model very fast, everyone will be
            interested. (If you can deliver a very smart model, everyone is
            obviously the most interested, but that's the free space.)
            
            But to be clear, 1000 tokens/second is WAY better. Anthropic's
            Haiku serves at ~50 tokens per second.
       
          josephg wrote 1 day ago:
          Yeah I agree with this. We might be able to benchmark it soon (if we
          canât already) but asking different agentic code models to produce
          some relatively simple pieces of software. Fast models can iterate
          faster. Big models will write better code on the first attempt, and
          need less loop debugging.  Who will win?
          
          At the moment Iâm loving opus 4.6 but I have no idea if its extra
          intelligence makes it worth using over sonnet. Some data would be
          great!
       
            estsauver wrote 1 day ago:
            For what it's worth, most people already are doing this! Some of
            the subagents in Claude Code (Explore, I think even compaction)
            default to Haiku and then you have to manually overwrite it with an
            env variable if you want to change it.
            
            Imagine the quality of life upgrade of getting compaction down to a
            few second blip, or the "Explore" going 20 times faster! As these
            models get better, it will be super exciting!
       
              embedding-shape wrote 1 day ago:
              > Imagine the quality of life upgrade of getting compaction down
              to a few second blip, or the "Explore" going 20 times faster! As
              these models get better, it will be super exciting!
              
              I'm awaiting the day the small and fast models come anywhere
              close to acceptable quality, as of today, neither
              GPT5.3-codex-spark nor Haiku are very suitable for either
              compaction or similar tasks, as they'll miss so much considering
              they're quite a lot dumber.
              
              Personally I do it the other way, the compaction done by the
              biggest model I can run, the planning as well, but then actually
              following the step-by-step "implement it" is done by a small
              model. It seemed to me like letting a smaller model do the
              compaction or writing overviews just makes things worse, even if
              they get a lot faster.
       
          nubg wrote 1 day ago:
          Interesting perspective. Perhaps also the user would adopt his
          queries knowing he can only to small (but very fast) steps. I wonder
          who would win!
       
        tl2do wrote 1 day ago:
        Genuine question: what kinds of workloads benefit most from this speed?
        In my coding use, I still hit limitations even with stronger models, so
        I'm interested in where a much faster model changes the outcome rather
        than just reducing latency.
       
          storus wrote 23 hours 7 min ago:
          I'd say using them as draft models for some strong AR model, speeding
          it up 3x. Diffusion generates a bunch of tokens extremely fast, those
          can be then passed over to an AR model to accept/reject instead of
          generating them.
       
          corysama wrote 1 day ago:
          Coding auto-complete?
       
          volodia wrote 1 day ago:
          There are few: fast agents, deep research, real-time voice, coding.
          The other thing is that when you have a fast reasoning model, you
          spend more effort on thinking in the same latency budget, which
          pushed up quality.
       
          quotemstr wrote 1 day ago:
          Once you make a model fast and small enough, it starts to become
          practical to use LLMs for things as mundane as spell checking,
          touchscreen-keyboard tap disambiguation, and database query planning.
          If the fast, small model is multimodal, use it in a microwave to make
          a better DWIM auto-cook.
          
          Hell, want to do syntax highlighting? Just throw buffer text into an
          ultra-fast LLM.
          
          It's easy to overlook how many small day-to-day heuristic schemes can
          be replaced with AI. It's almost embarrassing to  think about all the
          totally mundane uses to which we can put fast, modest intelligence.
       
          cjbarber wrote 1 day ago:
          I've tried a few computer use and browser use tools and they feel
          relatively tok/s bottlenecked.
          
          And in some sense, all of my claude code usage feels tok/s
          bottlenecked. There's never really a time where I'm glad to wait for
          the tokens, I'd always prefer faster.
       
          layoric wrote 1 day ago:
          I think it would assist in exploiting exploring multiple solution
          spaces in parallel, and can see with the right user in the loop +
          tools like compilers, static analysis, tests, etc wrapped harness, be
          able to iterate very quickly on multiple solutions. An example might
          be, "I need to optimize this SQL query" pointed to a locally running
          postgres. Multiple changes could be tested, combined, and explain
          plan to validate performance vs a test for correct results. Then only
          valid solutions could be presented to developer for review. I don't
          personally care about the models 'opinion' or recommendations, using
          them for architectural choices IMO is a flawed use as a coding tool.
          
          It doesn't change the fact that the most important thing is
          verification/validation of their output either from tools, developer
          reviewing/making decisions. But even if don't want that approach,
          diffusion models are just a lot more efficient it seems. I'm
          interested to see if they are just a better match common developer
          tasks to assist with validation/verification systems, not just
          writing (likely wrong) code faster.
       
          irthomasthomas wrote 1 day ago:
          multi-model arbitration, synthesis, parallel reasoning etc. Judging
          large models with small models is quite effective.
       
        dvt wrote 1 day ago:
        What excites me most about these new 4figure/second token models is
        that you can essentially do multi-shot prompting (+ nudging) and the
        user doesn't even feel it, potentially fixing some of the weird
        hallucinatory/non-deterministic behavior we sometimes end up with.
       
          lostmsu wrote 1 day ago:
          Regular models are very fast if you do batch inference. GPT-OSS 20B
          gets close to 2k tok/s on a single 3090 at bs=64 (might be
          misremembering details here).
       
            rahimnathwani wrote 1 day ago:
            Right but everyone else is talking about latency, not throughput.
       
          volodia wrote 1 day ago:
          That is also our view! We see Mercury 2 as enabling very fast
          iteration for agentic tasks. A single shot at a problem might be less
          accurate, but because the model has a shorter execution time, it
          enables users to iterate much more quickly.
       
       
   DIR <- back to front page