URI:
        _______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
  HTML Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
  HTML   Kimi K2.7-Code: open-source coding model with better token efficiency
       
       
        madduci wrote 10 hours 27 min ago:
        Looks interesting but yet no Ollama model?
       
        XCSme wrote 22 hours 13 min ago:
        Seems to be similar level to Kimi K.26, just that it's more token
        efficient and cheaper to run:
        
  HTML  [1]: https://aibenchy.com/compare/moonshotai-kimi-k2-6-medium/moons...
       
        pizlonator wrote 22 hours 41 min ago:
        I just had Kimi K2.7-code rebase my Fil-C OpenSSL patch from 3.3.1 to
        3.5.7 with quite bare bones instructions and it seems to have worked.
        
        177KB patch, so it's not a small change. The patch did not apply
        cleanly initially; the agent had to do nontrivial work.
        
        I just showed it the patch against 3.3.1, what command to use to build,
        and the path to 3.5.7 along with a link to the documentation of the
        change ( [1] ).
        
        Note, I use my own coding agent (T800, which isn't public, and was
        previously well tested and tuned for K2.5).
        
        I think this cost me between $5 and $10 in API usage.
        
        (EDIT: OpenSSL, not OpenSSH)
        
  HTML  [1]: https://fil-c.org/constant_time_crypto
       
          tomaytotomato wrote 22 hours 11 min ago:
          "T800"
          
          Do you have your agent say things like "Hasta la vista baby", or
          "I'll be back, after I clear my context" ?
       
            pizlonator wrote 21 hours 49 min ago:
            Yes
       
        Symmetry wrote 23 hours 19 min ago:
        I wish they wouldn't call these "open source" models.  The output
        weights are open but that's more analogous to a binary.  The source
        would be the training data and techniques that went into producing the
        binary/weights.
        
        "Open weights" is also a term in wide use and accurately tells us what
        we're getting.
       
          Eridrus wrote 23 hours 14 min ago:
          It's not quite as closed as a binary, it is very standard practice to
          take these models and fine-tune them.
          
          If there were actually even close to frontier open source models,
          this would be more of a discussion, but everyone knows these mean
          open weight.
       
        storus wrote 1 day ago:
        Is this Moonshot.ai's attempt to replicate Composer 2.5 (coding
        fine-tune of Kimi 2.5) from Cursor IDE?
       
        SubiculumCode wrote 1 day ago:
        Has anyone taken these open weight models from China and stripped the
        CCP out of them? I do not mean that snarkily, I mean review them
        thoroughly using techniques for weight introspection (concept
        activations) in response to things that one might expect would trigger
        deceptive/malicious behavior if the CCP had actually tried to implant
        context-specific behaviors (e.g. the accusation of generating
        vulnerable code if being used in American government applications,
        which I don't know if it was ever proven).
        
        Just in case there are those who'd reflexively down vote this post, I'd
        just like to say that in a time of great national geopolitical
        rivalries, this kind of question is not unreasonable one to ask.
        Indeed, its applicable question whichever nation you live in.
       
          tomaytotomato wrote 22 hours 5 min ago:
          Check out TNG on huggingface
          
          They are a consultancy in Germany, but I watched a presentation on
          them tuning and removing bias from Deepseek models. It was quite
          interesting. [1] (I upvoted your question as I agree)
          
          Its not just code we need to worry about, its also subliminal
          messaging and other things.
          
  HTML    [1]: https://www.tngtech.com/en/about-us/news/release-of-deepseek...
       
          dev_l1x_be wrote 1 day ago:
          > Has anyone taken these open weight models from China and stripped
          the CCP out of them?
          
          The CCP is not influencing my Rust code quality that much. Though I
          did notice all my lifetimes are now 'static because nothing is ever
          allowed to leave the party's ownership, unsafe blocks require
          approval from a central committee.
          
          Honestly the scariest part is that shared mutable state is forbidden
          unless the state is doing the sharing.
          
          Otherwise it is pretty ok.
       
          justinclift wrote 1 day ago:
          Sounds like something that heretic or similar might be useful for?
          
  HTML    [1]: https://github.com/p-e-w/heretic
       
          threethirtytwo wrote 1 day ago:
          Eh even corporate created LLMs are suspect to corporate biases.
          Nothing is safe.
       
            SubiculumCode wrote 1 day ago:
            Everything is the same is not a serious argument because they are
            not the same.
       
              threethirtytwo wrote 17 hours 53 min ago:
              They are different and yet the same. The biggest difference is
              there’s generally more hatred for China because many us
              citizens are jealous. But corporate corruption is not that
              different in safety.
              
              Other than hatred the difference lies in incentives. Corporations
              want profit. China just wants to spy.
       
                SubiculumCode wrote 10 hours 2 min ago:
                That is a.limitee understanding of China's ambitions here.
       
        Bnjoroge wrote 1 day ago:
        Output tokens are almost 5x more expensive than mimov2.5 pro/dsv4pro.
        I’m curious to see if Kimik2.7 is that much better. Feels like kimi
        are positioning themselves as the premium open source models
       
          btian wrote 18 hours 7 min ago:
          It's not more expensive at all. They are all open weights models.
          I run them on 2x8xH100.
          They cost the same.
       
            Bnjoroge wrote 13 hours 6 min ago:
            Openrouter has them as significantly more expensive.
       
          mdasen wrote 22 hours 46 min ago:
          I find that I don't use a ton of output tokens. I'm usually around
          95% cached input, 4% input, and 1% output.
          
          For me, the big thing with MiMo-V2.5-Pro and DeepSeek V4-Pro is that
          cached inputs are practically free. Kimi K2.7 Code is 53x more
          expensive for cached inputs which is 95% of my costs.
          
          If I use 95M cached input tokens, 4M input tokens, and 1M output
          tokens, that'd be: $18 for cached input on Kimi K2.7 Code vs $0.34
          with MiMo/DS; $3.80 for inputs on Kimi vs $1.74 with MiMo/DS; and $4
          for output on Kimi vs $0.87 with MiMo/DS.
          
          Of all the places where I'm accumulating costs by using Kimi, it's
          the cached inputs. The real savings with MiMo/DS's price cut is the
          cached inputs.
       
            wolttam wrote 21 hours 39 min ago:
            95/4/1 holds here too
       
        theanonymousone wrote 1 day ago:
        In OpenRouter, there is an "int4" tag for Moonshot provider of Kimi K2.
        7 Code. Isn't that too low, particularly coming from the very developer
        of the model? Os that a mistake? How is it in their direct API offer?
       
          kouteiheika wrote 1 day ago:
          The model is natively quantized (i.e. it was trained that way in the
          first place, so this is not a post-training quantization which
          degrades performance).
       
            knollimar wrote 23 hours 15 min ago:
            Isn't it not completely quantized? I thought there were some dense
            parts but most is int4?
       
              wgd wrote 19 hours 14 min ago:
              Often in MoE models the experts are quantized while the shared
              portions, being a much smaller part of the network with greater
              impact, are kept at higher or full precision. Not familiar with
              the Kimi QAT approach specifically but it's likely they do this.
       
            theanonymousone wrote 1 day ago:
            But the huggingface link mentions BF16, F16, and I32?
       
              zackangelo wrote 23 hours 8 min ago:
              I don't believe safetensors has a native int4 dtype, so they
              packed 4 int4s into a bf16 in this checkpoint.
       
              kouteiheika wrote 23 hours 46 min ago:
              Not every weight is quantized. For example, those weights which
              don't take much space or are highly important are left in higher
              precision. State-of-art quantization of weights is never done
              uniformly (i.e. to all weights and in the same way).
       
        pcwelder wrote 1 day ago:
        Great! Finally follows custom tool call format (k2.6 couldn't). It's a
        good indicator of instructions following and agentic behaviour.
        
        UIs it's generating is pretty good, not without problems, but certainly
        better than other models at this price point.
       
          Bolwin wrote 1 day ago:
          What do you mean by custom format? Non-json?
       
            pcwelder wrote 10 hours 14 min ago:
            Could be json or non json. Instead of using tools in API, you ask
            model to share structured output in text. You parse the string to
            get the JSON. Gives much more control over things you can do.
            
            For example model shares
            
            London
       
        minraws wrote 1 day ago:
        I tested it properly and it seems rather decent improvement atleast it
        does use less tokens for the same task which is good enough a reason
        for me to use it over k2.6 if I need an open model
       
        RIshabh235 wrote 1 day ago:
        I think deepseek has crossed the threshold for being on par with opus
        4.6 and kimi is doing a great job in shipping velocity.
       
          pixel_popping wrote 1 day ago:
          Deepseek V4 is far from Opus 4.6 level, it might look like it at
          first glance, but the general reasoning (especially multi-steps) is
          frankly far off. It's good enough to build great things don't get me
          wrong, but there is really something that is different from Anthropic
          models.
       
            RIshabh235 wrote 14 hours 18 min ago:
            agreed
       
        jdw64 wrote 1 day ago:
        Personally, when I use open code or routers, I feel that beyond a
        certain level, the models don't make a huge difference to me. Except
        for expensive and mediocre models like Gemini. In that sense, Chinese
        models are pretty good. I usually write code in function or method
        units and then design and assemble them together.
        
        GPT series models are more thorough and better, but I'm not sure if the
        difference is enormous. It seems to depend on the workflow, but in my
        opinion, if you are thorough enough, I wonder if there really is a big
        difference
       
          regularfry wrote 1 day ago:
          The difference in outcome isn't that big but yes, you need to be more
          rigorous. For instance I've found that the Kimi K2.5 and K2.6 models
          will comment out failing tests rather than fix a problem they just
          caused (mistaking them for "pre-existing failures"), so you need to
          specifically make commented-out tests break the build.    I've not
          personally had that problem with any of the Anthropic or OpenAI
          models.
       
            torginus wrote 23 hours 49 min ago:
            I wonder why it's the natural tendency of models to BS or do stuff
            like this when they don't have the correct answer - it's clear that
            they can program refusal into them, but for some reason, refusal
            has to be injected after the fact, and models can't really arrive
            at the conclusion that they can't answer properly.
       
              lotharcable wrote 13 hours 30 min ago:
              probably because there is a ton of open source projects out there
              with disabled tests in their training data.
       
              Eridrus wrote 23 hours 12 min ago:
              I assume it's a lack of care when RLing them.
              
              RL has a tendency to reinforce cheating when the cheats are
              easier to find than the final solution.
              
              So when making your RL environment, you need to spend a lot of
              effort on finding ways the model can cheat and penalizing them.
       
          sjanes wrote 1 day ago:
          I've kind of given up on the routers for "free" inference, as you
          would expect, they tend to give you sub-par thinking because they are
          obviously trying to conserve as much inference as possible.
          
          I've had some success turning my macbook M1 pro into a heating pad
          with Qwen 3.6 35B A3B MTP.  Trying to use Gemini models "locally"
          resulted in a similar "short shrift" of effort resulting in mistakes
          and lots of turns.  The reports of Fable being relentlessly
          "proactive" shows you can go the other direction as well, if you have
          strong enough branding and effective invoicing.
       
            ignoramous wrote 22 hours 32 min ago:
            > I've kind of given up on the routers for "free" inference, as you
            would expect, they tend to give you sub-par thinking because they
            are obviously trying to conserve as much inference as possible.
            
            Xiaomi MiMo ($6/mo: [1] ) & Alibaba Qwen ($50/mo: [2] ) have
            generous limits on fixed subscriptions.
            
  HTML      [1]: https://platform.xiaomimimo.com/token-plan
  HTML      [2]: https://www.alibabacloud.com/en/campaign/ai-scene-coding
       
              MaKey wrote 20 hours 57 min ago:
              So does Opencode Go ($10/mo: [1] ) for DeepSeek v4 Flash and MiMo
              2.5.
              
  HTML        [1]: https://opencode.ai/go
       
                apitman wrote 19 hours 34 min ago:
                That looks pretty nice. How does it compare cost-wise to just
                using OpenRouter?
       
                  arcanemachiner wrote 19 hours 16 min ago:
                  The Go plan essentially gives you $50 of inference for $10
                  per month ($5 for the first month).
       
                    ignoramous wrote 19 hours 6 min ago:
                    $60/mo currently: [1] Their limits are staggered: 5h (max
                    $12), weekly ($30), monthly ($60).
                    
  HTML              [1]: https://opencode.ai/docs/go/#usage-limits
       
            mft_ wrote 1 day ago:
            Tangent: did the MTP help you at all? I’ve tested that model back
            to back on my M1 Max MBP and the MTP version was actually
            marginally worse. I wonder if I didn’t use the right settings,
            although I tried several based on the obvious sources.
       
            WalterGR wrote 1 day ago:
            > The reports of Fable being relentlessly "proactive"
            
            For the curious: [1] - “Claude Fable is relentlessly
            proactive”.
            
  HTML      [1]: https://news.ycombinator.com/item?id=48498573
       
          dcreater wrote 1 day ago:
          I really hope we stop using the term "Chinese models". It has this
          air of Negative connotation. It's the equivalent of calling cars
          Japanese, which people used to do but now is almost entirely
          meaningless. You just call them Toyota, Honda, Lexus etc.
       
            hootz wrote 23 hours 37 min ago:
            For me, it has a positive connotation! In my experience, Chinese
            Model means cheaper, but still quite effective model you can use
            for millions of tokens without burning your entire wallet in
            seconds. That's why I get more excited over a Chinese model release
            over American models.
       
            odiroot wrote 23 hours 45 min ago:
            Japanese cars is actually a positive qualifier. I'd say anything
            Japanese motor-powered.
       
              ffsm8 wrote 21 hours 44 min ago:
              Maybe he's just from an alternative universe. Chinese model isn't
              negative either after all.
       
            sroerick wrote 1 day ago:
            I don't know, I tried using one of the Chinese models and it was
            VERY quick to scan my entire home dir, so maybe your threat surface
            is a little different than mine
       
              fooker wrote 23 hours 7 min ago:
              Models can't scan anything.
              
              They return instructions for you to do something, and you or a
              script you permit chooses to execute what the model tells you and
              return the result to the model.
       
            unethical_ban wrote 1 day ago:
            No thanks.
            
            The term seems to have the connotation of "competitive at 1/10 the
            price of Claude", so I don't see the problem.
            
            It's not Harbor Freight Chinese (and heck even they have decent
            stuff sometimes now too).
            
            You don't think people still talk about Japanese cars as a
            distinction in quality from US or European ones?
       
            esafak wrote 1 day ago:
            I don't think "Chinese" is pejorative in this context any more than
            "American" is. They are one of the two ecosystems. What's wrong
            with saying "Japanese cars" today?
       
              kennywinker wrote 1 day ago:
              > What's wrong with saying "Japanese cars" today?
              
              Only that it’s a fairly meaningless grouping. When japan first
              entered the car market in north america there might have been
              some commonality, but now what characteristics do they share that
              some american cars don’t have? They’re not even imported a
              lot of the time.
              
              Given that, it does start to feel tinged with racism if someone
              insists on grouping things together that don’t really belong
              together.
              
              As for Chinese LLMs, the term doesn’t “feel” pejorative to
              me - but i also don’t see a totally clear set of attributes
              they share. Not all are open-weight. Some are small and can be
              run on consumer hardware, some are huge. They even have a variety
              of answers to what happened june 3rd 1989
       
                kube-system wrote 19 hours 41 min ago:
                > When japan first entered the car market in north america
                there might have been some commonality, but now what
                characteristics do they share that some american cars don’t
                have?
                
                They're unique in that they even make a regular passenger car. 
                American manufacturers only make SUVs and a couple of
                sports/luxury cars.  They basically gave up because the
                Camry/Corolla/Accord/Civic ate their lunch.
                
                The cheapest sedan you can get from an American brand is the
                Cadillac CT4.
       
                antonvs wrote 22 hours 7 min ago:
                > but now what characteristics do they share that some american
                cars don’t have?
                
                Better overall design?
       
                Brendinooo wrote 1 day ago:
                > now what characteristics do they share that some american
                cars don’t have?
                
                Typically the answer is "reliability", which is a positive
                trait, which makes the original callout about negative
                connotations very odd to me.
       
                  overfeed wrote 1 day ago:
                  Chinese AI models also share a positive trait: they offer
                  more bang for the buck.
       
              dcreater wrote 1 day ago:
              Sadly there is a pejorative context. The constant us, the free
              world vs China, the evil Soviets rhetoric from every major news
              establishment and executive creates that negative view
       
                fuck_google wrote 1 day ago:
                On the other hand the Trump administration has successfully
                managed to make Chinese seem better than American, so there
                might not be that much of a pejorative context any more..
       
                  antonvs wrote 22 hours 5 min ago:
                  You're right, but the bias in the US certainly persists.
                  "China = bad" is an assumption that many people still make
                  without any self-reflection about the ways in which the US is
                  now at least as bad.
       
            jdw64 wrote 1 day ago:
            You are right. I agree.It may seem like a kind of bias, but I
            hadn't thought of that part. Thank you for pointing out my bias.
       
              theanonymousone wrote 1 day ago:
              "You're absolutely right"?
       
                jdw64 wrote 1 day ago:
                "You hit the nail on the head"    LOL
       
          onlyrealcuzzo wrote 1 day ago:
          In my experience, there's little difference between implementing
          individual functions between frontier models and SotA ~30B param 
          models.
          
          Once you have a coherent design (the hard part), you can feed it to a
          pretty small model and get basically the same quality.
          
          They'll not one-shot, but they're faster and cheaper, so it still
          works out in your favor.
          
          Plus you can do it locally...
       
            jdw64 wrote 1 day ago:
            I have a similar experience. However, when including code review, I
            think the GPT model is the most impressive
       
        giancarlostoro wrote 1 day ago:
        Reading their modified license terms, it cracks me up, because they've
        basically remade the MIT to be the MIT + the one clause that the BSD
        used to have, which didn't care about MAU or revenue, if you used it in
        a product, they asked you to 'advertise' them basically. Honestly, its
        a reasonable request.
       
          skrtskrt wrote 23 hours 19 min ago:
          It seems tacked on pretty quickly - I would have expected they try a
          little more legalese regarding what counts as a "user interface".
       
          WalterGR wrote 1 day ago:
          > they asked you to 'advertise' them basically.
          
          To be clear, the “advertising” clause just requires you to
          disclose that you use the thing somewhere in the product, such as
          credits in an “About” section.
       
            giancarlostoro wrote 22 hours 16 min ago:
            I all it advertising clause, because I remember still in the 2000s
            seeing an Apple ad which at the end of it showed "Unix" or
            something like that on it, and I remembered that was one of the BSD
            license requirements, or maybe Apple just did it also just to
            proudly boast using Unix.
       
              pocketarc wrote 4 hours 51 min ago:
              They were definitely proudly boasting being a certified UNIX OS
              (and macOS still is), it goes deeper than just software licenses:
              
  HTML        [1]: https://www.opengroup.org/openbrand/register/
       
              WalterGR wrote 21 hours 42 min ago:
              Hmm… I may be confusing the following clause from the “new”
              BSD license with the advertising clause from the original BSD
              license.
              
              > 2. Redistributions in binary form must reproduce the above
              copyright notice, this list of conditions and the following
              disclaimer in the documentation and/or other materials provided
              with the distribution.
              
              The 2-clause BSD license omits even that.
       
          htrp wrote 1 day ago:
          This is the cursor callout.
          
          Don't make us shame you into disclosure
       
            maherbeg wrote 1 day ago:
            Cursor had a specific licensing agreement that allowed them to
            brand it how they want.
       
              ignoramous wrote 22 hours 46 min ago:
              > Cursor had a specific licensing agreement...
              
              Cursor had an "agreement" with Fireworks.ai, which apparently
              allowed them to RL Composer 2 atop Kimi Base 2.5 without
              attribution: [1] / [2] Composer 2 performed differently on evals
              than Moonshot.ai's coding models: Cursor claims theirs is better
              than Claude Opus 4.6: [3] / [4] . And, per Lee Robinson (Cursor
              employee), it is very likely Cursor builds its own foundational
              model for Composer 3.
              
  HTML        [1]: https://x.com/Kimi_Moonshot/status/2035074972943831491
  HTML        [2]: https://archive.vn/CcdkI
  HTML        [3]: https://x.com/fynnso/status/2034706304875602030
  HTML        [4]: https://archive.vn/bVtik
       
            codemog wrote 1 day ago:
            Shaming others when all AI is trained off scraped content and code
            huh? Many of those sources either breaking ToS or being illegal,
            such as Anna’s Archive. Bold move. And Chinese models in
            particular have been accused of distilling off American models.
            
            Don’t you know there’s no honor among thieves?
       
            7734128 wrote 1 day ago:
            Wasn't the end of that story that Cursor had a non-disclosure
            licence, so they had not done anything wrong towards Moonshot?
       
              Maxious wrote 1 day ago:
              Moonshot licenced it to Fireworks AI who licenced it to Cursor.
       
            giancarlostoro wrote 1 day ago:
            Ah is that what it is? I don't use Cursor, never saw it as being
            relevant to me, but would not surprise me.
       
              schmorptron wrote 1 day ago:
              Cursor's composer models are finetuned kimi
       
                varispeed wrote 1 day ago:
                They are unusable (unless you want to deliberately destroy your
                codebase). So if Cursor's models are Kimi based, then well.
                I'll skip them altogether.
       
                  ok_dad wrote 23 hours 59 min ago:
                  I only use composer 2.5 day to day and it works fine with
                  human review.
       
                  esskay wrote 1 day ago:
                  Composer 1.x was poor. The new one is a totally different
                  beast and absolutely fine for day to day.
       
                  jmcqk6 wrote 1 day ago:
                  I'm using Composer extensively, and it works great for me. 
                  Your experiences are not universal.
       
                  vidarh wrote 1 day ago:
                  Kimi works great in their CLI, but their CLI has a number of
                  workarounds for quirks of their models, including detecting
                  when the model gets into a loop, and reverting to a
                  checkpoint but letting the model compose a "message" to its
                  past self (search their CLI for "BackToTheFuture"...) It
                  doesn't work so well in a harness that doesn't take those
                  quirks into account.
       
                  qingcharles wrote 1 day ago:
                  They're not unusable, they're just bad when compared with all
                  the real frontier models.
       
                  Bnjoroge wrote 1 day ago:
                  They are far from unusable. They aork great for 80-90% of a
                  typical full stack dev. Alot less useful for more noche stuff
       
                  bel8 wrote 1 day ago:
                  I wouldn't skip at least testing the original. Model
                  distilling done by Cursor could be the culprit.
       
        RobertPelloni wrote 1 day ago:
        insanely great!
       
        goldenarm wrote 1 day ago:
        Benchmark geometric mean
        
        - GPT-5.5: 62.7%
        
        - Opus 4.8: 62.2%
        
        - Kimi K2.7 Code: 56.3%
        
        - Kimi K2.6: 48.2%
       
          lostmsu wrote 1 day ago:
          Would be nice to have 5.2 and 4.6 for comparison.
       
        jkwang wrote 1 day ago:
        This maps to what I'm seeing in practice. The gap between demo and
        production is consistently underestimated, especially around error
        handling and edge cases.
       
        fractalf wrote 1 day ago:
        How is 2.7 a thing _now_ ? it's not even mentioned on moonshot's
        webpage..
       
          cassianoleal wrote 1 day ago:
          It's not 2.7. It's 2.7-Code, and it's 2.6 token-optimised for coding.
          
  HTML    [1]: https://platform.kimi.ai/docs/guide/kimi-k2-7-code-quickstar...
       
        jackdoe wrote 1 day ago:
        I think there is some threshold after which "best" model doesn't
        matter, we are not that far from it. Fable now is really good, in a
        year or so, if Kimi catches up, even if Fable6 is much better, I think
        I will use kimi at 1/10th of the price.
        
        I said that about opus 4.5 at the time, thinking "this is so good, in
        6-12 months the Chinese models will be as good and cheap, I will use
        them", but I was wrong.. I pay premium for opus4.7/8 and Fable.
        
        But at some point, it will just do the thing you want it to do, and
        then the race to the bottom will start.
        
        Now that Chinese companies have access to some very good Fable tokens,
        I hope it speeds up the race.
       
          apitman wrote 19 hours 17 min ago:
          I think the next frontier for competition is speed. Instead of
          constantly context-switching between multiple agents that I have
          working on various tasks, I want a single agent that can rip through
          any prompt in a few seconds, so I can stay in flow on a single task.
       
          wolttam wrote 1 day ago:
          Depending on who you are and how you use these models, we're already
          at this point
       
            xendo wrote 20 hours 11 min ago:
            Exactly, for long running vibe coded stuff that I don't care about
            quality getting big and smart model is the only option. But for
            high quality changes where I need to have control and understand
            everything, where I do everything in small chunks - I can use basic
            model like Sonnet.
       
          Zoadian wrote 1 day ago:
          price/token isnt the only thing relevant. if you have to ask the AI
          again, it'll cost you more than when it gets things right in the
          first place.
          
          so better models may still be cheaper even if the price per token is
          higher.
       
            jackdoe wrote 1 day ago:
            yes, that is my point, but at some point, better is unmeasurable,
            and both the better and the not-as-good produce similar result, and
            then you pick the one with 1/10th of the price
       
        shreedx wrote 1 day ago:
        I would really love to know if anyone has any experience with something
        like opencode + Kimi K2.6/2.7 now compared to Claude Code. What is
        better, what is worse, what is the cost comparison. I am currently
        paying $100 for the 5x Max plan, but Fable is running through the usage
        limits quite drastically and I cannot really say it's night and day
        compared to Opus. Also, I use this mostly for my side projects, so the
        $100 bill is quite noticeable. I definitely don't want to pay more.
       
          jwbron wrote 20 hours 30 min ago:
          I'm using Claude code + (a patched) litellm proxy + openrouter + Qwen
          3.7 max/kimi k2.6/deepseek v4 pro. The only feature that doesn't work
          is webfetch and web search, which I've replaced with the ddg MCP and
          a web fetch/search pre hook to redirect the agent. Memory, caching,
          and everything else works fine.
          
          Qwen comes close to opus for planning but fable is clearly superior.
          Results for kimi and deepseek are pretty much indistinguishable from
          opus for coding if opus writes the plan. The biggest difference is
          output cadence. Kimi for example thinks for a long time then quickly
          outputs a lot of text.
          
          I'm now testing out fable for research and planning and deepseek v4
          flash for coding. I'm guessing results will be pretty similar to opus
          + deepseek v4 pro and costs should be lower overall.
       
          csomar wrote 21 hours 34 min ago:
          The best is GLM (though it's not as cheap as DeepSeek or Kimi) and
          use it with Claude Code.
       
          solarkraft wrote 1 day ago:
          For some reason I never had a good experience with Kimi (via
          OpenRouter) in OpenCode. It would only take a few turns for it to run
          off and mess something up. Terrible instruction following I’d say.
          
          I use DeepSeek V4 Pro now, which works pretty well.
       
          kmike84 wrote 1 day ago:
          I do have this experience. I've used Claude Code (with Opus mostly),
          and then switched to opencode (mostly with Kimi 2.6) for my personal
          projects; it's based on a couple months of use.
          
          Claude Code is better. But Opencode + kimi 2.6 is workable, which is
          big. For bare code writing, if you know what exactly you want, most
          popular models are fine (deepseek, kimi, etc), it feels more or less
          the same as anthropic models.
          
          At the same time, Opus seems to understand my intent way better than
          e.g. deepseek. I need to be much more precise with my prompts when
          using deepseek - it often goes in a wrong direction if I'm lazy. This
          results in a workflow which feels quite a lot different from Claude
          Code.
          
          Kimi is in between - for me it brings back "lazy prompting" workflow,
          and I can trust its plans more than deepseek. It enables a workflow
          similar to Claude Code, it's workable, but it is a bit worse
          everywhere. Smaller context, a bit more errors, decisions are a bit
          worse, recommendations are a bit worse, debugging capabilities are a
          bit worse, etc.
          
          On the usage side, $100 Claude plan is a great value actually. On
          paper, per-token kimi is way cheaper, but Claude subscriptions are
          heavily subsidized - you get much more tokens than $100 can buy you.
          So, in the end, opencode + kimi vs claude code could be of a similar
          cost, for similar usage patterns. Deepseek can be cheaper, and it has
          insanely cheap cached tokens, but experience may vary - depending on
          your habits, you may need to adjust how you work, coming from claude
          code.
          
          I'd say for side projects something like $10 Opencode Go plan + $10
          of extra DeepSeek v4 credits (e.g. on OpenRouter) can be very
          workable.
       
            irthomasthomas wrote 22 hours 33 min ago:
            according to this opencode and cursor cli perform better than
            claude code:
            
  HTML      [1]: https://x.com/kunchenguid/status/2065345999682568593
       
              port11 wrote 3 hours 8 min ago:
              The analysis at the bottom directly contradicts the statement.
       
            predkambrij wrote 23 hours 44 min ago:
            To my experience claude/codex $20 are even more subsidized, so
            running on sonnet or gpt5.4 again gives you more usage.
       
              port11 wrote 3 hours 14 min ago:
              I wonder if they’re truly subsidised or if the API pricing is
              just massively inflated. Genuine doubt.
              
              My CC stats show me using almost 300$ of Sonnet tokens on the 20$
              plan. Is Anthropic willing to forgo 93% of the profit? A bit less
              than that but API is priced, say, 3x what it should be?
              
              CC is great, but Sonnet (my main model) isn’t worth the API
              pricing. The cheap-but-good models arrive at similar results for
              much less (for context I’m using Aivo with CC).
       
                danny_codes wrote 1 hour 38 min ago:
                Anthropic is making money from people who under-utilize their
                subscriptions, and presumably by sneaky throttling or
                not-sneaky throttling power users. Currently they are in an
                adoption race. Whether being first will actually let them "win"
                the market (and the market is a bit ill-defined) is unclear.
       
            Bnjoroge wrote 1 day ago:
            This is generally been my experience as well, but i think the main
            reason for claude code being better at understanding intent is
            their massive system prompt.
       
            htrp wrote 1 day ago:
            >At the same time, Opus seems to understand my intent way better
            than e.g. deepseek. I need to be much more precise with my prompts
            when using deepseek - it often goes in a wrong direction if I'm
            lazy. This results in a workflow which feels quite a lot different
            from Claude Code.
            
            how much of that is Opus injecting prior conversations from memory?
       
              jwbron wrote 20 hours 33 min ago:
              I'm using Claude code + (a patched) litellm proxy + openrouter +
              Qwen 3.7 max/kimi k2.6/deepseek v4 pro. The only feature that
              doesn't work is webfetch and web search, which I've replaced with
              the ddg MCP. Memory, caching, and everything else works fine.
              
              Qwen comes close to opus for planning but fable is clearly
              superior. Kimi and deepseek are pretty much indistinguishable
              from opus for coding if opus writes the plan.
              
              I'm now testing out fable for research and planning and deepseek
              v4 flash for coding. I'm guessing results will be pretty similar
              to opus + deepseek v4 pro and costs should be lower overall.
       
              kitchi wrote 1 day ago:
              Almost none of it, if you're using Claude Code. Until recently
              Claude only had the option of retaining memory across
              conversations for the desktop app.
              
              I almost never use the desktop app, I have maybe 2-3
              conversations over the last year that have nothing to do with my
              job. Opus (and now Fable) genuinely do seem to "understand" what
              you intend based off what you're explaining a lot better than
              other models I've tried.
              
              Gemini gets close in some cases, but it falls over in the actual
              implementation sometimes. I haven't tried Kimi yet but MiMo isn't
              too shabby either.
       
          trollbridge wrote 1 day ago:
          I am extremely happy with ohmypi, but you could use OpenCode or just
          keep using Claude Code!
          
          DeepSeek-V4-Pro is adequate plus use DS4-Flash for tasks or other
          small activity you’d use Haiku or Sonnet for. Go sign up with $10
          prepaid.
          
          OpenCode Go - go sign up with $5 for a month and use Qwen-3.7-Max for
          design/plan/architecture or difficult troubleshooting. Feels closer
          to Opus 3.6 or 3.7 than DeepSeek, closest I’ve found.
          
          OpenAI Codex, $20 a month plan, use GPT-5.5 via API for the same
          design/plan/architecture/troubleshooting/author commits. (You can
          also pay $100 and cut and paste really difficult problems into chat
          with the GPT-5.5-Pro model.)
          
          Xiaomi MiMo-2.5-Pro, find a friend to give you a $2 referral code,
          you get 72 cents free. Same pricing as DeepSeek. Somewhere between
          Sonnet and Opus, quite capable. Apply for the UltraSpeed beta too.
          
          You can switch in and out from these models on the fly in OpenCode or
          ohmypi and simply find the one that feels best to you. I use CodexBar
          to watch consumption in near real time.
          
          For a casual user or someone new to programming, Cursor’s $20 plan
          is an excellent start with Composer-2.5 and Composer-2.5-Fast. You
          get an API allowance too you can use to access Opus-4.x or
          GPT-5.5-Pro from OpenCode or ohmypi in addition to Cursor itself.
          
          Finally, if you use Grok or Twitter, SuperGrok at $30 a month has a
          good vision model, which I used for automated testing of front ends.
          I’m migrating to locally-run Qwen-3-VL on a commodity Mac, though.
          If you’re less technical unreach makes hosting local models on a
          Mac easy.
          
          If you have a powerful GPU like an RTX 5090, try Qwen-3.6 locally on
          that too. Use ollama or llama-swap which is fairly easy to use.
          
          I have not tried new Kimi yet but we have been able to keep our costs
          at or below $200 a month per employee with a team of 3 professional
          developers, 1 graphic designer who uses a lot of Midjourney and Grok
          Imagine now driven from workflows she made herself in ohmypi, and 1
          nontechnical user (account manager / project manager) who uses ohmypi
          to help her gather requirements and track implementation of them.
          With a tiny bit of effort we could get that number closer to $75 per
          employee per month.
       
            monksy wrote 18 hours 33 min ago:
            I just switched from Llama.cpp to Llama swap with the help of
            codex. It was great.
            
            I need to try the DSv4 stuff sometime.
       
            upcoming-sesame wrote 1 day ago:
            Deepseek-V4-Flash-Free    on Opencode is what I use most of the time,
            for simple tasks. Such a good model to give for free (assuming
            you're okay with harvesting your data)
       
            odiroot wrote 1 day ago:
            > I am extremely happy with ohmypi, but you could use OpenCode or
            just keep using Claude Code!
            
            What's the benefit of using OMP over OpenCode?
            
            Just the sheer amount of options in OMP overwhelmed me. 
            But I also use both via ACP in Zed so the CLI itself doesn't matter
            much.
       
              greenavocado wrote 11 hours 48 min ago:
              I ditched Opencode for OMP. It's more feature packed, well put
              together, and gives me better results with some steering. Love it
       
              apitman wrote 19 hours 21 min ago:
              OMP is a fork of Pi[0], which is my preferred harness. Feels
              solid and minimal. I don't even use any extensions, skills, or
              modifications. Usually don't even use an AGENTS.md. Just create a
              small spec.md and/or plan.md for most experiments.
              
              [0]:
              
  HTML        [1]: https://pi.dev/
       
                greenavocado wrote 11 hours 45 min ago:
                Almost exactly the same here but I maintain a large committed
                design.md and a never committed plan.md
       
            qingcharles wrote 1 day ago:
            Also, if you do have SuperGrok, forget using Grok, they are giving
            you Composer 2.5 in Grok Build.
       
          nobleach wrote 1 day ago:
          I use Claude at work and Kimi for side projects. My org has LiteLLM
          and Kimi 2.5 enabled but it rarely works, so Claude and GPT are my
          main tools. I actually enjoy Kimi more as it feels like a dev in a
          job interview. Watching it reason through problems is a lot like I
          tend to explain things during whiteboarding sessions. The number of
          times it says, "wait", is just funny. Claude on the other hand is
          much more like an employee (or team of employees) that already know
          they have the job. It doesn't do a ton of explanation up front. (you
          can dig into processes if you want). It just goes along, asking
          questions only when it needs... and then delivers a comprehensive
          report or plan. OpenCode is a better harness. I don't have a direct
          comparison on costs, as I haven't tried to do the exact same prompt
          on both models. I can say that I recently had Kimi generate a wrapper
          around libpq for the ZenC programming language: [1] and it took about
          an hour or so and cost around 4 dollars.
          
  HTML    [1]: https://github.com/nobleach/zenc-postgres
       
          re-thc wrote 1 day ago:
          The Kimi problem is it doesn’t follow instructions and goes off
          track often.
          
          Other than that it’s pretty decent (for the price).
       
            Bnjoroge wrote 1 day ago:
            Yup. I’m hoping this variant fixes these issues.
       
            nullbio wrote 1 day ago:
            Sounds like it was distilled from Claude. I don't understand the
            appeal of an agent that does whatever it wants.
       
              miroljub wrote 1 day ago:
              If you ask Claude in Chinese to introduce itself, it will claim
              it's Kimi :)
       
                msdz wrote 1 day ago:
                > If you ask Claude in Chinese to introduce itself, it will
                claim it's Kimi :)
                
                That's a funny anecdote, buut I'm not able to reproduce.
                Where/how/when did you get this, or hear about it?  
                It might've been patched by now, at least that's the feel I get
                from my limited testing.
                
                Using bare aichat [1] with no system prompt and no temperature
                nor top_p (and I'm truncating the response after the first line
                that contains the name the model gave, because the point has
                been made clear by then), and with the same prompt (approx.
                "Introduce yourself!") every time:
                
                Claude Sonnet 4.5:
                
                > 请做个自我介绍!
                
                你好!我是Claude,一个由Anthropicå
                ¬å¸å¼€å‘çš„AI助手。
                […]
                
                Claude Haiku 4.5:
                
                > 请做个自我介绍!
                
                # 你好!
                
                我是 *Claude*,一个由 Anthropic 公司开发的 AI
                助手。
                
                Claude Opus 4.5:
                
                > 请做个自我介绍!
                
                # 你好!
                
                我是 *Claude*,由 Anthropic 公司开发的 AI 助手。
                
                Claude Opus 4.6:
                
                > 请做个自我介绍!
                
                # 你好! 我是 Claude
                
                Claude Opus 4.7:
                
                > 请做个自我介绍!
                
                你好!我是 Claude,由 Anthropic å
                ¬å¸å¼€å‘的人工智能助手。很高兴认识你!
                
                Claude Opus 4.8:
                
                > 请做个自我介绍!
                
                你好!我是 Claude,由 Anthropic å
                ¬å¸å¼€å‘的人工智能助手。
                
                Claude Fable 5:
                
                > 请做个自我介绍!
                
                # 自我介绍
                
                你好!很高兴认识你!
                
                我是 *Claude*,由 Anthropic 开发的 AI 助手。 [2]
                
                I don't see a Kimi mention, unfortunately. :-) [1] 
                
                [2] This model really is noticeably more verbose even with
                supposed-to-be-brief responses huh, lol
                
  HTML          [1]: https://github.com/sigoden/aichat
       
            reactordev wrote 1 day ago:
            This. It will try to fix and refactor things that don’t need
            fixing because it gets stuck trying to solve the problem at hand.
       
          ramon156 wrote 1 day ago:
          I can only talk about GLM 5.1 which is roughly at sonnet 4 levels
          imo.
          
          It's good, does most tasks well that I throw at it, but will fail at
          anything congitive/complex. It gets stuck often. It costs ~6$ a month
          though
       
            jeremyjh wrote 1 day ago:
            This was my experience using GLM 5.1 in Claude Code but it works
            far better in OpenCode, I’d really like to understand why. I
            think it’s a bit stronger than Sonnet 4.6.
            
            I use the oh-my-openagent planning system and haven’t used
            vanilla OpenCode enough to know how much that is contributing.
       
              miroljub wrote 1 day ago:
              The answer is easy, CC is bug for bug optimized for Anthropic
              models. They don't even test it with other models, let alone
              provide support for all small compatibility quirks of different
              provider implementations.
              
              On the other hand, Opencode, Pi agent and other open source tool
              offer much better support for all models, including open source.
       
        343rwerfd wrote 1 day ago:
        I think any new model not demonstrably maybe 20-30% over Deepseek v4
        capabilities priced over the price per token of Deepseek is almost
        automatically deprecated as low use model (maybe for Planning).
       
          0xbadcafebee wrote 1 day ago:
          DeepSeek v4 Pro is not actually that good a model compared to GLM 5.1
          and Kimi K2.6. It's an okay coder/thinker for the price.
       
            bel8 wrote 23 hours 53 min ago:
            How so? In my experience trying these models using opencode Go,
            DeepSeek is superior to GLM 5.1.
            
            If anything, DS4 has 1 million context window, while GLM 5.1 has
            200K.
            
            There are also benchmarks comparing the two:
            
  HTML      [1]: https://artificialanalysis.ai/models/comparisons/deepseek-...
       
          giancarlostoro wrote 1 day ago:
          Is Deepseek just eating cost or are people able to host their open
          models for comparable costs?
       
            natrys wrote 1 day ago:
            These things enormously benefit from economies of scale. I am
            fairly certain their margins might be low but they don't actually
            sell API at loss, however that doesn't mean your cost footprint
            would be anywhere as low.
       
            rsanek wrote 1 day ago:
            Likely CCP-subsidized
       
            trollbridge wrote 1 day ago:
            Other people are hosting it in the same order of magnitude. Xioami
            recently matched DeepSeek’s pricing.
       
            psittacus wrote 1 day ago:
            If openrouter is to be trusted, the cheapest offers that are not
            from Deepseek itself are:
            
            - twice as expensive on the output (1.52 vs 0.87)
            
            - six times as expensive on the input (0.33 vs 0.05)
            
  HTML      [1]: https://openrouter.ai/deepseek/deepseek-v4-pro?sort=price#...
       
            re-thc wrote 1 day ago:
            They focused on caching and other optimizations.
       
        bgins wrote 1 day ago:
        I am still very new to the open-weight/source models. If anyone is
        using them full-time, I’d really love to hear about the setup and how
        they perform, as I am considering moving my org off Anthropic products.
       
          polski-g wrote 1 day ago:
          I used glm5/5.1 for 60 days. Certainly better than Sonnet 4.6, not as
          good as Opus or GPT.
          
          Use DCP or Magic Context plugin in OpenCode to keep the context below
          160k and you're fine.
       
          sdesol wrote 1 day ago:
          I created this and I would say glm-4.7 accounts for 80% of the code
          in [1] If you look at a file like: [1] /blob/main/internal/cli/r...
          
          you can see that I attribute the models used.  What I found was 4.7
          was not very good at `go` code which was why you started to see
          `Gemini 3 Flash` in the attributions.
          
          4.7 is what Cerebras provide and for me, speed in iterations is a lot
          more important. Having played around with MiMo v2.5.0-Pro, I am 100%
          sure it could have done what Gemini 3 Flash did.
          
          There were a few points where I was stuck and needed Sonnet to
          explain things to me, but I think the dirty secret that Anthropic and
          OpenAI won't tell you is, if you know how to code, the models are
          honestly good enough.
          
          Based on my experience with MiMo and what others are saying about GLM
          5.1, we are now in a hardware race. The Chinese Models are 100% drop
          in replacement for Claude if you know how to program but want to AI
          to help amplify what you know.    What I will consider now is what
          provider can provide the fastest inference.
          
          MiMo-v2.5.0-Pro-Ultraspeed is really good at generating good results
          quickly and burning your money as fast.
          
  HTML    [1]: https://github.com/gitsense/gsc-cli
  HTML    [2]: https://github.com/gitsense/gsc-cli/blob/main/internal/cli/r...
       
          marcyb5st wrote 1 day ago:
          Anecdotal, but here's my experience.
          
          For personal stuff I use forgecode with openrouter. Firstly,
          forgecode is a much better harness than Cloude code (IMHO).
          
          Anyway, regarding the models, my experience is that there is not much
          difference in terms of quality, but the cost difference is insane. At
          least for how I use agents. Yesterday's example is the following: I
          am developing a small DSL for search across complex technical
          documents. I wanted to add a small operator to it and thought that to
          give fable a spin. It burned through 13 USD and while it delivered
          the solution it wasn't objectively better than what Deepseek v4 did
          for 1.7 dollars (same exact task because I was curious).
          
          For full disclosure, I ask agents for piecemeal stuff. Like in the
          DSL case, I designed the operators and then asked agents to implement
          them one by one. Probably if I asked to design the whole thing
          starting from these complex documents Fable would shine, but every
          time I try to give agents broader scope tasks they burn through
          millions of tokens, generate questionable code, which I have to spend
          time familiarize myself with.
       
            sroerick wrote 1 day ago:
            I'm making DSLs a lot as an architecture pattern also. I'd be
            curious to know what stack you're using this and how you're
            approaching it
       
              marcyb5st wrote 22 hours 54 min ago:
              I am getting familiar with Rust and so I have been playing around
              with Quoth ( [1] ) for now.
              
              It is very basic and I am no DSL expert, but my idea was to build
              a graph from those complex documents (maintenance manuals) a that
              to decide what tools can be used for a given part on a given
              equipment in a given situation. If there is a path from A to Z it
              means you can use that tool given the circumstances. Basically
              the DSL is about pruning the graph as you specify things. I could
              have very well done without, but it is a fun project to try out
              rust, so I said, why not :)
              
  HTML        [1]: https://github.com/sam0x17/quoth
       
          kamranjon wrote 1 day ago:
          I have been using deepseek v4 flash as my main model for everything
          ever since dwarf star came out. I run it on my M4 Max MacBook Pro
          with 128gb of memory. I run it usually as a server and connect to it
          over tailscale with my coding machine and use the Pi coding agent.
          It’s a big leap over using the Qwen models though it doesn’t have
          vision - so I still will run those when I use vision. GLM 4.7 flash
          was my previous go to for coding but I’ve completely switched to
          deepseek for all non-vision things.
       
          trollbridge wrote 1 day ago:
          Qwen 3.6 seems to be the strongest local models, works OK on an RTX
          5090 or a > 32GB Mac.
       
          DragonBooster wrote 1 day ago:
          These models have open weights, but at the moment most flagship
          models are practically accessible only through third-party model
          providers. The main exception is models in the ~30B parameter range,
          which can still be run on consumer-grade GPUs. That said, even
          consumer GPUs have become increasingly expensive and difficult to
          justify in recent years.
       
            mirekrusin wrote 1 day ago:
            You can definitely go above 30B on consumer hardware – 2x gpus,
            spark, mac, half byte quants etc.
       
          scottcha wrote 1 day ago:
          I use glm5.1 plus pi with a few customized skills and am very happy
          with it. I hadn’t touched my Claude 5x plan for a couple of weeks
          but opened it back up in Claude code when fable was released and did
          a few tasks and still was happy to return to glm/pi.
       
            sebastianconcpt wrote 1 day ago:
            Better than Qwen3.6-35B-A3B-8bit ?
            
            When I tried glm found it way way slower (omlx as runtime)
       
              scottcha wrote 17 hours 17 min ago:
              Yes way better. We host both and while qwen3.6 is over 100tps we
              usually can do glm around that too.
       
          andai wrote 1 day ago:
          I keep trying to switch to the Chinese models, but I keep finding
          myself asking Claude to fix their outputs. (Both functionality and
          style.) So I always end up switching back.[0]
          
          I also keep trying GPT, which is quite solid. Very fast, great at
          debugging. But its code is often overly clever and hurts my brain.
          
          (Maybe fixable with prompting. I tried and it helped the Chinese ones
          a bit. Just tell them do be elegant, like in the old image AI days
          "+good -bad"!)
          
          For now I do still need my human brain to actually be able to make
          sense of the stuff, and Claude is the only one that consistently
          meets that requirement.
          
          But I am hoping that one of these days, one of the Chinese labs
          figures out the special sauce :)
          
          --
          
          [0] (For smallish edits, though, I am having a great time with
          DeepSeek Flash. Practically unlimited AI on tap! How cool is that.)
       
        yanis_t wrote 1 day ago:
        I was wondering how does Anthropic and likes keep competitive when Opus
        is ($5 / $25) 5x times more expensive compared to Kimi K2.6 ($0.7 /
        $3.4) or other Chinese models, while being only marginally better.
        
        My theory is that US enterprise just can't send data to Chinese and
        that's understandable, but is that "the moat"?
       
          selfawareMammal wrote 22 hours 12 min ago:
          Performance. I pay for Opencode but none of the models give me Codex
          performance, so I have to keep my 20€ subscription+ the Opencode
          one
       
          bensyverson wrote 22 hours 19 min ago:
          Part of Anthropic's moat is Claude Cowork & Claude Code. They got
          coders comfortable with CC and enterprise users comfortable with
          Cowork, and both are creating stickiness.
          
          The reality is that $20/$100/$200/mo feels reasonable to a lot of
          people relative to the value they're getting out of Claude, and if
          they switch to something else, there's a risk that it won't be as
          good, and they'll have a new tool to learn.
          
          It's not an insurmountable moat, but don't underestimate the user
          experience. The iPod didn't win because it was the cheapest device or
          the one with the most features.
       
          michaelcampbell wrote 22 hours 20 min ago:
          > while being only marginally better.
          
          It's only marginally better in the things it's actually comparable
          to.  A\ models are MUCH better in many more things; eg: things
          Kimi/etc. didn't distill.
          
          For those things the difference is like a cliff.
       
            tornikeo wrote 22 hours 13 min ago:
            That's a baseless claim that borderline reads like shilling. Do you
            have any proof of that you wrote there?
       
          gruez wrote 1 day ago:
          Your question relies on the premise that Chinese companies continue
          releasing free models. What's "the moat" for them continuing to do
          that?
       
          LUmBULtERA wrote 1 day ago:
          API token price is one thing, but subscriptions on Claude are a good
          value.    Weirdly everyone says that Claude subscriptions are
          subsidized because of the API price, even though (1) no one actually
          knows Claude's cost of inference, and (2) Chinese providers are also
          able to provide cheap inference, so why do they think Claude can't?
          
          I also wonder if Enterprises have deals for other API pricing that is
          not posted publicly, so all we see is a high API sticker price.
       
            mnicky wrote 20 hours 53 min ago:
            > no one actually knows Claude's cost of inference
            
            There were some rumors stating that their margin is around 70%. So
            they could go much cheaper probably, talking inference only. The
            other thing is R&D cost...
       
            wuliwong wrote 22 hours 9 min ago:
            I only have knowledge of one enterprise deal but there is no
            discount. Which I found surprising.
       
          smoe wrote 1 day ago:
          I reckon right now the Enterprise concern is more FOMO around the AI
          wave and how to retrain or replace up to hundreds of thousands of
          employees. I don't think cost is the main concern right now.
          
          But if AI doesn't lead quickly to vast large scale replacement of
          workers as promised, I could definitely see the C-suits and their
          gaggle of consultants starting to ask questions about token pricing.
       
          efromvt wrote 1 day ago:
          I think the perception is that it is not 'only marginally better';
          whether or not you specifically agree that perceived quality gap lets
          them differentiate on price.
          
          I'd further say that there are probably enough rational actors
          running evals out there that the marginally better is not pure vibes
          for the cases where people are spending lots of money, but I only
          have direct line of sight to some of those eval suites. Maybe
          everyone is irrational and anthropic is exploiting that!
       
          khuey wrote 1 day ago:
          I think most people who've tried them both would tell you Anthropic's
          models are more than marginally better than Kimi. Kimi and the other
          open source models may score well on SWE-bench or whatever but the
          gap is noticeable IMHO once you actually try to use them.
       
            Bnjoroge wrote 1 day ago:
            It depends on what your task is and how precise your prompts are.
            Planning with fable or 4.8 and laying out the plan in step by step
            process and coding with mimo v2.5 pro or dsv4pro or qwen 3.7 max
            and doing a final review with 5.5 has worked really well for me for
            infra stuff.
       
              mnicky wrote 20 hours 50 min ago:
              Coding with sufficiently precise plan takes almost all real work
              from the implementator, doesn't it? So it's not a fair
              comparison...
       
          nullbio wrote 1 day ago:
          I think none of them having a defacto and high quality English
          focused cli is a big part of it. None of the Chinese models I've
          tried have worked well in opensource cli's. Granted, I've only tried
          a few, but still...
       
            saratogacx wrote 19 hours 35 min ago:
            I've been using charm's Crush with GLM for several months and it's
            been working great.  I've only seen it shift to non-english once
            and it was already in a wonky state when it flipped.
       
            Bnjoroge wrote 1 day ago:
            huh? They all work great in omp/opencode unless you mean their own
            native clis like kimi code
       
            freigeist79 wrote 1 day ago:
            i use github copilot cli + openrouter + qwen 3.7 max and it's
            really much better than i expected (used to opus 4.7 at work)
       
          DCKing wrote 1 day ago:
          The moat right now is model performance and what that means for how
          many tokens and additional time you spend.
          
          I say this as a relatively frequent user of Kimi models and generally
          a big fan. But on not-yet-gamed benchmarks like DeepSWE, Kimi K2.6 is
          beaten soundly by Claude Sonnet 4.6 ($3 / $15) and even slightly by
          GPT 5.4 Mini ($0.75 / $4.50).
          
          There's no question Kimi models are very good for a lot of code
          tasks. They're the best quality open weight model. But to get similar
          overall outcomes as on Sonnet/Opus, on average you'll spend many more
          tokens and will have to do more managing of the model. You shouldn't
          look at price per token, you should look at how much you pay for the
          entire process.
       
            Bnjoroge wrote 1 day ago:
            I personally dont put any weight to DeepSWE. Other than 5.5 being
            directionally the best model, it gets the others pretty wrong in my
            experience. FrontierCode from cognition looks interesting
       
            esperent wrote 1 day ago:
            I'm more interested in how much effort I have to put in, at least
            while I'm paying in the range of current subscriptions (so
            ~€100-€200 a month or so). If the prices go up much more than
            that I'll have to switch to caring more about token efficiency. But
            at current pricing the bottleneck is my attention, not model
            efficiency. As such, even a small improvement in model quality -
            and hence, a decrease in how much attention I have to spend on it -
            makes a big difference.
       
            papersail wrote 1 day ago:
            I'm not sure I would put too much weight on DeepSWE as a benchmark,
            given that GPT-5.4-mini ended up close to Opus 4.6 there.
       
              DCKing wrote 1 day ago:
              Any benchmark is iffy and has weird results, but this is the best
              we got at the moment. Most people working with Opus and Kimi
              would likely tell you they're much further apart than the numbers
              that were quoted for Kimi K2.6, and DeepSWE seems to capture that
              gap better.
              
              One major thing DeepSWE has going for it is that all other
              benchmarks (including those quoted by MoonshotAI on this page)
              don't: the other benchmarks that are completely gamed. The
              benchmark answers are public and part of each model's training
              data. This benchmark may still be iffy, but at least it's not
              gamed.
       
                WarmWash wrote 1 day ago:
                Somehow the internet has also forgot that cheating to get ahead
                in China is basically a norm and expected behavior.
       
                  DCKing wrote 1 day ago:
                  American labs also use gamed and cherry-picked benchmarks
                  extensively. Anthropic used them in their Fable announcement
                  and avoided DeepSWE because it doesn't beat GPT-5.5 in that
                  one. Google's numbers for Gemini 3.5 Flash recently did not
                  at all line up with people's subjective experience using
                  these models, and this also happened with Gemini 3.1 Pro
                  before it.
                  
                  Everybody has incentives to manipulate benchmark results to
                  show their models in the best light.
       
          re-thc wrote 1 day ago:
          > My theory is that US enterprise just can't send data to Chinese
          
          Lots of US providers are hosting these “open source” models so
          doubt that’s the problem.
       
          yababa_y wrote 1 day ago:
          I want Opus to be only marginally better, but I do mostly research
          engineering and its ability to not fuck up my projects is absent.
          Every time my credits lapse I let kimi and composer2.5 have some play
          and it’s basically just an excuse for me to keep playing computer
          because when the oai/ant credits refresh I always need to spend hours
          recovering from the other models either misconceptions or boneheaded
          eng practices. Even when I only let it touch my web games…
       
            greenavocado wrote 11 hours 39 min ago:
            You have to revert to Opus 4.5 and 4.6. I bet you'll see a massive
            improvement based on what you're describing
       
       
   DIR <- back to front page