codevoid.de/1/hn/comments_48502347.gph

_______ __ _______
| | |.---.-..----.| |--..-----..----. | | |.-----..--.--.--..-----.
| || _ || __|| < | -__|| _| | || -__|| | | ||__ --|
|___|___||___._||____||__|__||_____||__| |__|____||_____||________||_____|
on Gopher (inofficial)
HTML Visit Hacker News on the Web

COMMENT PAGE FOR:
HTML Kimi K2.7-Code: open-source coding model with better token efficiency

madduci wrote 10 hours 27 min ago:
Looks interesting but yet no Ollama model?

XCSme wrote 22 hours 13 min ago:
Seems to be similar level to Kimi K.26, just that it's more token
efficient and cheaper to run:

HTML [1]: https://aibenchy.com/compare/moonshotai-kimi-k2-6-medium/moons...

pizlonator wrote 22 hours 41 min ago:
I just had Kimi K2.7-code rebase my Fil-C OpenSSL patch from 3.3.1 to
3.5.7 with quite bare bones instructions and it seems to have worked.

177KB patch, so it's not a small change. The patch did not apply
cleanly initially; the agent had to do nontrivial work.

I just showed it the patch against 3.3.1, what command to use to build,
and the path to 3.5.7 along with a link to the documentation of the
change ( [1] ).

Note, I use my own coding agent (T800, which isn't public, and was
previously well tested and tuned for K2.5).

I think this cost me between $5 and $10 in API usage.

(EDIT: OpenSSL, not OpenSSH)

HTML [1]: https://fil-c.org/constant_time_crypto

tomaytotomato wrote 22 hours 11 min ago:
"T800"

Do you have your agent say things like "Hasta la vista baby", or
"I'll be back, after I clear my context" ?

pizlonator wrote 21 hours 49 min ago:
Yes

Symmetry wrote 23 hours 19 min ago:
I wish they wouldn't call these "open source" models. The output
weights are open but that's more analogous to a binary. The source
would be the training data and techniques that went into producing the
binary/weights.

"Open weights" is also a term in wide use and accurately tells us what
we're getting.

Eridrus wrote 23 hours 14 min ago:
It's not quite as closed as a binary, it is very standard practice to
take these models and fine-tune them.

If there were actually even close to frontier open source models,
this would be more of a discussion, but everyone knows these mean
open weight.

storus wrote 1 day ago:
Is this Moonshot.ai's attempt to replicate Composer 2.5 (coding
fine-tune of Kimi 2.5) from Cursor IDE?

SubiculumCode wrote 1 day ago:
Has anyone taken these open weight models from China and stripped the
CCP out of them? I do not mean that snarkily, I mean review them
thoroughly using techniques for weight introspection (concept
activations) in response to things that one might expect would trigger
deceptive/malicious behavior if the CCP had actually tried to implant
context-specific behaviors (e.g. the accusation of generating
vulnerable code if being used in American government applications,
which I don't know if it was ever proven).

Just in case there are those who'd reflexively down vote this post, I'd
just like to say that in a time of great national geopolitical
rivalries, this kind of question is not unreasonable one to ask.
Indeed, its applicable question whichever nation you live in.

tomaytotomato wrote 22 hours 5 min ago:
Check out TNG on huggingface

They are a consultancy in Germany, but I watched a presentation on
them tuning and removing bias from Deepseek models. It was quite
interesting. [1] (I upvoted your question as I agree)

Its not just code we need to worry about, its also subliminal
messaging and other things.

HTML [1]: https://www.tngtech.com/en/about-us/news/release-of-deepseek...

dev_l1x_be wrote 1 day ago:
> Has anyone taken these open weight models from China and stripped
the CCP out of them?

The CCP is not influencing my Rust code quality that much. Though I
did notice all my lifetimes are now 'static because nothing is ever
allowed to leave the party's ownership, unsafe blocks require
approval from a central committee.

Honestly the scariest part is that shared mutable state is forbidden
unless the state is doing the sharing.

Otherwise it is pretty ok.

justinclift wrote 1 day ago:
Sounds like something that heretic or similar might be useful for?

HTML [1]: https://github.com/p-e-w/heretic

threethirtytwo wrote 1 day ago:
Eh even corporate created LLMs are suspect to corporate biases.
Nothing is safe.

SubiculumCode wrote 1 day ago:
Everything is the same is not a serious argument because they are
not the same.

threethirtytwo wrote 17 hours 53 min ago:
They are different and yet the same. The biggest difference is
thereâs generally more hatred for China because many us
citizens are jealous. But corporate corruption is not that
different in safety.

Other than hatred the difference lies in incentives. Corporations
want profit. China just wants to spy.

SubiculumCode wrote 10 hours 2 min ago:
That is a.limitee understanding of China's ambitions here.

Bnjoroge wrote 1 day ago:
Output tokens are almost 5x more expensive than mimov2.5 pro/dsv4pro.
Iâm curious to see if Kimik2.7 is that much better. Feels like kimi
are positioning themselves as the premium open source models

btian wrote 18 hours 7 min ago:
It's not more expensive at all. They are all open weights models.
I run them on 2x8xH100.
They cost the same.

Bnjoroge wrote 13 hours 6 min ago:
Openrouter has them as significantly more expensive.

mdasen wrote 22 hours 46 min ago:
I find that I don't use a ton of output tokens. I'm usually around
95% cached input, 4% input, and 1% output.

For me, the big thing with MiMo-V2.5-Pro and DeepSeek V4-Pro is that
cached inputs are practically free. Kimi K2.7 Code is 53x more
expensive for cached inputs which is 95% of my costs.

If I use 95M cached input tokens, 4M input tokens, and 1M output
tokens, that'd be: $18 for cached input on Kimi K2.7 Code vs $0.34
with MiMo/DS; $3.80 for inputs on Kimi vs $1.74 with MiMo/DS; and $4
for output on Kimi vs $0.87 with MiMo/DS.

Of all the places where I'm accumulating costs by using Kimi, it's
the cached inputs. The real savings with MiMo/DS's price cut is the
cached inputs.

wolttam wrote 21 hours 39 min ago:
95/4/1 holds here too

theanonymousone wrote 1 day ago:
In OpenRouter, there is an "int4" tag for Moonshot provider of Kimi K2.
7 Code. Isn't that too low, particularly coming from the very developer
of the model? Os that a mistake? How is it in their direct API offer?

kouteiheika wrote 1 day ago:
The model is natively quantized (i.e. it was trained that way in the
first place, so this is not a post-training quantization which
degrades performance).

knollimar wrote 23 hours 15 min ago:
Isn't it not completely quantized? I thought there were some dense
parts but most is int4?

wgd wrote 19 hours 14 min ago:
Often in MoE models the experts are quantized while the shared
portions, being a much smaller part of the network with greater
impact, are kept at higher or full precision. Not familiar with
the Kimi QAT approach specifically but it's likely they do this.

theanonymousone wrote 1 day ago:
But the huggingface link mentions BF16, F16, and I32?

zackangelo wrote 23 hours 8 min ago:
I don't believe safetensors has a native int4 dtype, so they
packed 4 int4s into a bf16 in this checkpoint.

kouteiheika wrote 23 hours 46 min ago:
Not every weight is quantized. For example, those weights which
don't take much space or are highly important are left in higher
precision. State-of-art quantization of weights is never done
uniformly (i.e. to all weights and in the same way).

pcwelder wrote 1 day ago:
Great! Finally follows custom tool call format (k2.6 couldn't). It's a
good indicator of instructions following and agentic behaviour.

UIs it's generating is pretty good, not without problems, but certainly
better than other models at this price point.

Bolwin wrote 1 day ago:
What do you mean by custom format? Non-json?

pcwelder wrote 10 hours 14 min ago:
Could be json or non json. Instead of using tools in API, you ask
model to share structured output in text. You parse the string to
get the JSON. Gives much more control over things you can do.

For example model shares

London

minraws wrote 1 day ago:
I tested it properly and it seems rather decent improvement atleast it
does use less tokens for the same task which is good enough a reason
for me to use it over k2.6 if I need an open model

RIshabh235 wrote 1 day ago:
I think deepseek has crossed the threshold for being on par with opus
4.6 and kimi is doing a great job in shipping velocity.

pixel_popping wrote 1 day ago:
Deepseek V4 is far from Opus 4.6 level, it might look like it at
first glance, but the general reasoning (especially multi-steps) is
frankly far off. It's good enough to build great things don't get me
wrong, but there is really something that is different from Anthropic
models.

RIshabh235 wrote 14 hours 18 min ago:
agreed

jdw64 wrote 1 day ago:
Personally, when I use open code or routers, I feel that beyond a
certain level, the models don't make a huge difference to me. Except
for expensive and mediocre models like Gemini. In that sense, Chinese
models are pretty good. I usually write code in function or method
units and then design and assemble them together.

GPT series models are more thorough and better, but I'm not sure if the
difference is enormous. It seems to depend on the workflow, but in my
opinion, if you are thorough enough, I wonder if there really is a big
difference

regularfry wrote 1 day ago:
The difference in outcome isn't that big but yes, you need to be more
rigorous. For instance I've found that the Kimi K2.5 and K2.6 models
will comment out failing tests rather than fix a problem they just
caused (mistaking them for "pre-existing failures"), so you need to
specifically make commented-out tests break the build. I've not
personally had that problem with any of the Anthropic or OpenAI
models.

torginus wrote 23 hours 49 min ago:
I wonder why it's the natural tendency of models to BS or do stuff
like this when they don't have the correct answer - it's clear that
they can program refusal into them, but for some reason, refusal
has to be injected after the fact, and models can't really arrive
at the conclusion that they can't answer properly.

lotharcable wrote 13 hours 30 min ago:
probably because there is a ton of open source projects out there
with disabled tests in their training data.

Eridrus wrote 23 hours 12 min ago:
I assume it's a lack of care when RLing them.

RL has a tendency to reinforce cheating when the cheats are
easier to find than the final solution.

So when making your RL environment, you need to spend a lot of
effort on finding ways the model can cheat and penalizing them.

sjanes wrote 1 day ago:
I've kind of given up on the routers for "free" inference, as you
would expect, they tend to give you sub-par thinking because they are
obviously trying to conserve as much inference as possible.

I've had some success turning my macbook M1 pro into a heating pad
with Qwen 3.6 35B A3B MTP. Trying to use Gemini models "locally"
resulted in a similar "short shrift" of effort resulting in mistakes
and lots of turns. The reports of Fable being relentlessly
"proactive" shows you can go the other direction as well, if you have
strong enough branding and effective invoicing.

ignoramous wrote 22 hours 32 min ago:
> I've kind of given up on the routers for "free" inference, as you
would expect, they tend to give you sub-par thinking because they
are obviously trying to conserve as much inference as possible.

Xiaomi MiMo ($6/mo: [1] ) & Alibaba Qwen ($50/mo: [2] ) have
generous limits on fixed subscriptions.

HTML [1]: https://platform.xiaomimimo.com/token-plan
HTML [2]: https://www.alibabacloud.com/en/campaign/ai-scene-coding

MaKey wrote 20 hours 57 min ago:
So does Opencode Go ($10/mo: [1] ) for DeepSeek v4 Flash and MiMo
2.5.

HTML [1]: https://opencode.ai/go

apitman wrote 19 hours 34 min ago:
That looks pretty nice. How does it compare cost-wise to just
using OpenRouter?

arcanemachiner wrote 19 hours 16 min ago:
The Go plan essentially gives you $50 of inference for $10
per month ($5 for the first month).

ignoramous wrote 19 hours 6 min ago:
$60/mo currently: [1] Their limits are staggered: 5h (max
$12), weekly ($30), monthly ($60).

HTML [1]: https://opencode.ai/docs/go/#usage-limits

mft_ wrote 1 day ago:
Tangent: did the MTP help you at all? Iâve tested that model back
to back on my M1 Max MBP and the MTP version was actually
marginally worse. I wonder if I didnât use the right settings,
although I tried several based on the obvious sources.

WalterGR wrote 1 day ago:
> The reports of Fable being relentlessly "proactive"

For the curious: [1] - âClaude Fable is relentlessly
proactiveâ.

HTML [1]: https://news.ycombinator.com/item?id=48498573

dcreater wrote 1 day ago:
I really hope we stop using the term "Chinese models". It has this
air of Negative connotation. It's the equivalent of calling cars
Japanese, which people used to do but now is almost entirely
meaningless. You just call them Toyota, Honda, Lexus etc.

hootz wrote 23 hours 37 min ago:
For me, it has a positive connotation! In my experience, Chinese
Model means cheaper, but still quite effective model you can use
for millions of tokens without burning your entire wallet in
seconds. That's why I get more excited over a Chinese model release
over American models.

odiroot wrote 23 hours 45 min ago:
Japanese cars is actually a positive qualifier. I'd say anything
Japanese motor-powered.

ffsm8 wrote 21 hours 44 min ago:
Maybe he's just from an alternative universe. Chinese model isn't
negative either after all.

sroerick wrote 1 day ago:
I don't know, I tried using one of the Chinese models and it was
VERY quick to scan my entire home dir, so maybe your threat surface
is a little different than mine

fooker wrote 23 hours 7 min ago:
Models can't scan anything.

They return instructions for you to do something, and you or a
script you permit chooses to execute what the model tells you and
return the result to the model.

unethical_ban wrote 1 day ago:
No thanks.

The term seems to have the connotation of "competitive at 1/10 the
price of Claude", so I don't see the problem.

It's not Harbor Freight Chinese (and heck even they have decent
stuff sometimes now too).

You don't think people still talk about Japanese cars as a
distinction in quality from US or European ones?

esafak wrote 1 day ago:
I don't think "Chinese" is pejorative in this context any more than
"American" is. They are one of the two ecosystems. What's wrong
with saying "Japanese cars" today?

kennywinker wrote 1 day ago:
> What's wrong with saying "Japanese cars" today?

Only that itâs a fairly meaningless grouping. When japan first
entered the car market in north america there might have been
some commonality, but now what characteristics do they share that
some american cars donât have? Theyâre not even imported a
lot of the time.

Given that, it does start to feel tinged with racism if someone
insists on grouping things together that donât really belong
together.

As for Chinese LLMs, the term doesnât âfeelâ pejorative to
me - but i also donât see a totally clear set of attributes
they share. Not all are open-weight. Some are small and can be
run on consumer hardware, some are huge. They even have a variety
of answers to what happened june 3rd 1989

kube-system wrote 19 hours 41 min ago:
> When japan first entered the car market in north america
there might have been some commonality, but now what
characteristics do they share that some american cars donât
have?

They're unique in that they even make a regular passenger car.
American manufacturers only make SUVs and a couple of
sports/luxury cars. They basically gave up because the
Camry/Corolla/Accord/Civic ate their lunch.

The cheapest sedan you can get from an American brand is the
Cadillac CT4.

antonvs wrote 22 hours 7 min ago:
> but now what characteristics do they share that some american
cars donât have?

Better overall design?

Brendinooo wrote 1 day ago:
> now what characteristics do they share that some american
cars donât have?

Typically the answer is "reliability", which is a positive
trait, which makes the original callout about negative
connotations very odd to me.

overfeed wrote 1 day ago:
Chinese AI models also share a positive trait: they offer
more bang for the buck.

dcreater wrote 1 day ago:
Sadly there is a pejorative context. The constant us, the free
world vs China, the evil Soviets rhetoric from every major news
establishment and executive creates that negative view

fuck_google wrote 1 day ago:
On the other hand the Trump administration has successfully
managed to make Chinese seem better than American, so there
might not be that much of a pejorative context any more..

antonvs wrote 22 hours 5 min ago:
You're right, but the bias in the US certainly persists.
"China = bad" is an assumption that many people still make
without any self-reflection about the ways in which the US is
now at least as bad.

jdw64 wrote 1 day ago:
You are right. I agree.It may seem like a kind of bias, but I
hadn't thought of that part. Thank you for pointing out my bias.

theanonymousone wrote 1 day ago:
"You're absolutely right"?

jdw64 wrote 1 day ago:
"You hit the nail on the head" LOL

onlyrealcuzzo wrote 1 day ago:
In my experience, there's little difference between implementing
individual functions between frontier models and SotA ~30B param
models.

Once you have a coherent design (the hard part), you can feed it to a
pretty small model and get basically the same quality.

They'll not one-shot, but they're faster and cheaper, so it still
works out in your favor.

Plus you can do it locally...

jdw64 wrote 1 day ago:
I have a similar experience. However, when including code review, I
think the GPT model is the most impressive

giancarlostoro wrote 1 day ago:
Reading their modified license terms, it cracks me up, because they've
basically remade the MIT to be the MIT + the one clause that the BSD
used to have, which didn't care about MAU or revenue, if you used it in
a product, they asked you to 'advertise' them basically. Honestly, its
a reasonable request.

skrtskrt wrote 23 hours 19 min ago:
It seems tacked on pretty quickly - I would have expected they try a
little more legalese regarding what counts as a "user interface".

WalterGR wrote 1 day ago:
> they asked you to 'advertise' them basically.

To be clear, the âadvertisingâ clause just requires you to
disclose that you use the thing somewhere in the product, such as
credits in an âAboutâ section.

giancarlostoro wrote 22 hours 16 min ago:
I all it advertising clause, because I remember still in the 2000s
seeing an Apple ad which at the end of it showed "Unix" or
something like that on it, and I remembered that was one of the BSD
license requirements, or maybe Apple just did it also just to
proudly boast using Unix.

pocketarc wrote 4 hours 51 min ago:
They were definitely proudly boasting being a certified UNIX OS
(and macOS still is), it goes deeper than just software licenses:

HTML [1]: https://www.opengroup.org/openbrand/register/

WalterGR wrote 21 hours 42 min ago:
Hmmâ¦ I may be confusing the following clause from the ânewâ
BSD license with the advertising clause from the original BSD
license.

> 2. Redistributions in binary form must reproduce the above
copyright notice, this list of conditions and the following
disclaimer in the documentation and/or other materials provided
with the distribution.

The 2-clause BSD license omits even that.

htrp wrote 1 day ago:
This is the cursor callout.

Don't make us shame you into disclosure

maherbeg wrote 1 day ago:
Cursor had a specific licensing agreement that allowed them to
brand it how they want.

ignoramous wrote 22 hours 46 min ago:
> Cursor had a specific licensing agreement...

Cursor had an "agreement" with Fireworks.ai, which apparently
allowed them to RL Composer 2 atop Kimi Base 2.5 without
attribution: [1] / [2] Composer 2 performed differently on evals
than Moonshot.ai's coding models: Cursor claims theirs is better
than Claude Opus 4.6: [3] / [4] . And, per Lee Robinson (Cursor
employee), it is very likely Cursor builds its own foundational
model for Composer 3.

HTML [1]: https://x.com/Kimi_Moonshot/status/2035074972943831491
HTML [2]: https://archive.vn/CcdkI
HTML [3]: https://x.com/fynnso/status/2034706304875602030
HTML [4]: https://archive.vn/bVtik

codemog wrote 1 day ago:
Shaming others when all AI is trained off scraped content and code
huh? Many of those sources either breaking ToS or being illegal,
such as Annaâs Archive. Bold move. And Chinese models in
particular have been accused of distilling off American models.

Donât you know thereâs no honor among thieves?

7734128 wrote 1 day ago:
Wasn't the end of that story that Cursor had a non-disclosure
licence, so they had not done anything wrong towards Moonshot?

Maxious wrote 1 day ago:
Moonshot licenced it to Fireworks AI who licenced it to Cursor.

giancarlostoro wrote 1 day ago:
Ah is that what it is? I don't use Cursor, never saw it as being
relevant to me, but would not surprise me.

schmorptron wrote 1 day ago:
Cursor's composer models are finetuned kimi

varispeed wrote 1 day ago:
They are unusable (unless you want to deliberately destroy your
codebase). So if Cursor's models are Kimi based, then well.
I'll skip them altogether.

ok_dad wrote 23 hours 59 min ago:
I only use composer 2.5 day to day and it works fine with
human review.

esskay wrote 1 day ago:
Composer 1.x was poor. The new one is a totally different
beast and absolutely fine for day to day.

jmcqk6 wrote 1 day ago:
I'm using Composer extensively, and it works great for me.
Your experiences are not universal.

vidarh wrote 1 day ago:
Kimi works great in their CLI, but their CLI has a number of
workarounds for quirks of their models, including detecting
when the model gets into a loop, and reverting to a
checkpoint but letting the model compose a "message" to its
past self (search their CLI for "BackToTheFuture"...) It
doesn't work so well in a harness that doesn't take those
quirks into account.

qingcharles wrote 1 day ago:
They're not unusable, they're just bad when compared with all
the real frontier models.

Bnjoroge wrote 1 day ago:
They are far from unusable. They aork great for 80-90% of a
typical full stack dev. Alot less useful for more noche stuff

bel8 wrote 1 day ago:
I wouldn't skip at least testing the original. Model
distilling done by Cursor could be the culprit.

RobertPelloni wrote 1 day ago:
insanely great!

goldenarm wrote 1 day ago:
Benchmark geometric mean

- GPT-5.5: 62.7%

- Opus 4.8: 62.2%

- Kimi K2.7 Code: 56.3%

- Kimi K2.6: 48.2%

lostmsu wrote 1 day ago:
Would be nice to have 5.2 and 4.6 for comparison.

jkwang wrote 1 day ago:
This maps to what I'm seeing in practice. The gap between demo and
production is consistently underestimated, especially around error
handling and edge cases.

fractalf wrote 1 day ago:
How is 2.7 a thing _now_ ? it's not even mentioned on moonshot's
webpage..

cassianoleal wrote 1 day ago:
It's not 2.7. It's 2.7-Code, and it's 2.6 token-optimised for coding.

HTML [1]: https://platform.kimi.ai/docs/guide/kimi-k2-7-code-quickstar...

jackdoe wrote 1 day ago:
I think there is some threshold after which "best" model doesn't
matter, we are not that far from it. Fable now is really good, in a
year or so, if Kimi catches up, even if Fable6 is much better, I think
I will use kimi at 1/10th of the price.

I said that about opus 4.5 at the time, thinking "this is so good, in
6-12 months the Chinese models will be as good and cheap, I will use
them", but I was wrong.. I pay premium for opus4.7/8 and Fable.

But at some point, it will just do the thing you want it to do, and
then the race to the bottom will start.

Now that Chinese companies have access to some very good Fable tokens,
I hope it speeds up the race.

apitman wrote 19 hours 17 min ago:
I think the next frontier for competition is speed. Instead of
constantly context-switching between multiple agents that I have
working on various tasks, I want a single agent that can rip through
any prompt in a few seconds, so I can stay in flow on a single task.

wolttam wrote 1 day ago:
Depending on who you are and how you use these models, we're already
at this point

xendo wrote 20 hours 11 min ago:
Exactly, for long running vibe coded stuff that I don't care about
quality getting big and smart model is the only option. But for
high quality changes where I need to have control and understand
everything, where I do everything in small chunks - I can use basic
model like Sonnet.

Zoadian wrote 1 day ago:
price/token isnt the only thing relevant. if you have to ask the AI
again, it'll cost you more than when it gets things right in the
first place.

so better models may still be cheaper even if the price per token is
higher.

jackdoe wrote 1 day ago:
yes, that is my point, but at some point, better is unmeasurable,
and both the better and the not-as-good produce similar result, and
then you pick the one with 1/10th of the price

shreedx wrote 1 day ago:
I would really love to know if anyone has any experience with something
like opencode + Kimi K2.6/2.7 now compared to Claude Code. What is
better, what is worse, what is the cost comparison. I am currently
paying $100 for the 5x Max plan, but Fable is running through the usage
limits quite drastically and I cannot really say it's night and day
compared to Opus. Also, I use this mostly for my side projects, so the
$100 bill is quite noticeable. I definitely don't want to pay more.

jwbron wrote 20 hours 30 min ago:
I'm using Claude code + (a patched) litellm proxy + openrouter + Qwen
3.7 max/kimi k2.6/deepseek v4 pro. The only feature that doesn't work
is webfetch and web search, which I've replaced with the ddg MCP and
a web fetch/search pre hook to redirect the agent. Memory, caching,
and everything else works fine.

Qwen comes close to opus for planning but fable is clearly superior.
Results for kimi and deepseek are pretty much indistinguishable from
opus for coding if opus writes the plan. The biggest difference is
output cadence. Kimi for example thinks for a long time then quickly
outputs a lot of text.

I'm now testing out fable for research and planning and deepseek v4
flash for coding. I'm guessing results will be pretty similar to opus
+ deepseek v4 pro and costs should be lower overall.

csomar wrote 21 hours 34 min ago:
The best is GLM (though it's not as cheap as DeepSeek or Kimi) and
use it with Claude Code.

solarkraft wrote 1 day ago:
For some reason I never had a good experience with Kimi (via
OpenRouter) in OpenCode. It would only take a few turns for it to run
off and mess something up. Terrible instruction following Iâd say.

I use DeepSeek V4 Pro now, which works pretty well.

kmike84 wrote 1 day ago:
I do have this experience. I've used Claude Code (with Opus mostly),
and then switched to opencode (mostly with Kimi 2.6) for my personal
projects; it's based on a couple months of use.

Claude Code is better. But Opencode + kimi 2.6 is workable, which is
big. For bare code writing, if you know what exactly you want, most
popular models are fine (deepseek, kimi, etc), it feels more or less
the same as anthropic models.

At the same time, Opus seems to understand my intent way better than
e.g. deepseek. I need to be much more precise with my prompts when
using deepseek - it often goes in a wrong direction if I'm lazy. This
results in a workflow which feels quite a lot different from Claude
Code.

Kimi is in between - for me it brings back "lazy prompting" workflow,
and I can trust its plans more than deepseek. It enables a workflow
similar to Claude Code, it's workable, but it is a bit worse
everywhere. Smaller context, a bit more errors, decisions are a bit
worse, recommendations are a bit worse, debugging capabilities are a
bit worse, etc.

On the usage side, $100 Claude plan is a great value actually. On
paper, per-token kimi is way cheaper, but Claude subscriptions are
heavily subsidized - you get much more tokens than $100 can buy you.
So, in the end, opencode + kimi vs claude code could be of a similar
cost, for similar usage patterns. Deepseek can be cheaper, and it has
insanely cheap cached tokens, but experience may vary - depending on
your habits, you may need to adjust how you work, coming from claude
code.

I'd say for side projects something like $10 Opencode Go plan + $10
of extra DeepSeek v4 credits (e.g. on OpenRouter) can be very
workable.

irthomasthomas wrote 22 hours 33 min ago:
according to this opencode and cursor cli perform better than
claude code:

HTML [1]: https://x.com/kunchenguid/status/2065345999682568593

port11 wrote 3 hours 8 min ago:
The analysis at the bottom directly contradicts the statement.

predkambrij wrote 23 hours 44 min ago:
To my experience claude/codex $20 are even more subsidized, so
running on sonnet or gpt5.4 again gives you more usage.

port11 wrote 3 hours 14 min ago:
I wonder if theyâre truly subsidised or if the API pricing is
just massively inflated. Genuine doubt.

My CC stats show me using almost 300$ of Sonnet tokens on the 20$
plan. Is Anthropic willing to forgo 93% of the profit? A bit less
than that but API is priced, say, 3x what it should be?

CC is great, but Sonnet (my main model) isnât worth the API
pricing. The cheap-but-good models arrive at similar results for
much less (for context Iâm using Aivo with CC).

danny_codes wrote 1 hour 38 min ago:
Anthropic is making money from people who under-utilize their
subscriptions, and presumably by sneaky throttling or
not-sneaky throttling power users. Currently they are in an
adoption race. Whether being first will actually let them "win"
the market (and the market is a bit ill-defined) is unclear.

Bnjoroge wrote 1 day ago:
This is generally been my experience as well, but i think the main
reason for claude code being better at understanding intent is
their massive system prompt.

htrp wrote 1 day ago:
>At the same time, Opus seems to understand my intent way better
than e.g. deepseek. I need to be much more precise with my prompts
when using deepseek - it often goes in a wrong direction if I'm
lazy. This results in a workflow which feels quite a lot different
from Claude Code.

how much of that is Opus injecting prior conversations from memory?

jwbron wrote 20 hours 33 min ago:
I'm using Claude code + (a patched) litellm proxy + openrouter +
Qwen 3.7 max/kimi k2.6/deepseek v4 pro. The only feature that
doesn't work is webfetch and web search, which I've replaced with
the ddg MCP. Memory, caching, and everything else works fine.

Qwen comes close to opus for planning but fable is clearly
superior. Kimi and deepseek are pretty much indistinguishable
from opus for coding if opus writes the plan.

I'm now testing out fable for research and planning and deepseek
v4 flash for coding. I'm guessing results will be pretty similar
to opus + deepseek v4 pro and costs should be lower overall.

kitchi wrote 1 day ago:
Almost none of it, if you're using Claude Code. Until recently
Claude only had the option of retaining memory across
conversations for the desktop app.

I almost never use the desktop app, I have maybe 2-3
conversations over the last year that have nothing to do with my
job. Opus (and now Fable) genuinely do seem to "understand" what
you intend based off what you're explaining a lot better than
other models I've tried.

Gemini gets close in some cases, but it falls over in the actual
implementation sometimes. I haven't tried Kimi yet but MiMo isn't
too shabby either.

trollbridge wrote 1 day ago:
I am extremely happy with ohmypi, but you could use OpenCode or just
keep using Claude Code!

DeepSeek-V4-Pro is adequate plus use DS4-Flash for tasks or other
small activity youâd use Haiku or Sonnet for. Go sign up with $10
prepaid.

OpenCode Go - go sign up with $5 for a month and use Qwen-3.7-Max for
design/plan/architecture or difficult troubleshooting. Feels closer
to Opus 3.6 or 3.7 than DeepSeek, closest Iâve found.

OpenAI Codex, $20 a month plan, use GPT-5.5 via API for the same
design/plan/architecture/troubleshooting/author commits. (You can
also pay $100 and cut and paste really difficult problems into chat
with the GPT-5.5-Pro model.)

Xiaomi MiMo-2.5-Pro, find a friend to give you a $2 referral code,
you get 72 cents free. Same pricing as DeepSeek. Somewhere between
Sonnet and Opus, quite capable. Apply for the UltraSpeed beta too.

You can switch in and out from these models on the fly in OpenCode or
ohmypi and simply find the one that feels best to you. I use CodexBar
to watch consumption in near real time.

For a casual user or someone new to programming, Cursorâs $20 plan
is an excellent start with Composer-2.5 and Composer-2.5-Fast. You
get an API allowance too you can use to access Opus-4.x or
GPT-5.5-Pro from OpenCode or ohmypi in addition to Cursor itself.

Finally, if you use Grok or Twitter, SuperGrok at $30 a month has a
good vision model, which I used for automated testing of front ends.
Iâm migrating to locally-run Qwen-3-VL on a commodity Mac, though.
If youâre less technical unreach makes hosting local models on a
Mac easy.

If you have a powerful GPU like an RTX 5090, try Qwen-3.6 locally on
that too. Use ollama or llama-swap which is fairly easy to use.

I have not tried new Kimi yet but we have been able to keep our costs
at or below $200 a month per employee with a team of 3 professional
developers, 1 graphic designer who uses a lot of Midjourney and Grok
Imagine now driven from workflows she made herself in ohmypi, and 1
nontechnical user (account manager / project manager) who uses ohmypi
to help her gather requirements and track implementation of them.
With a tiny bit of effort we could get that number closer to $75 per
employee per month.

monksy wrote 18 hours 33 min ago:
I just switched from Llama.cpp to Llama swap with the help of
codex. It was great.

I need to try the DSv4 stuff sometime.

upcoming-sesame wrote 1 day ago:
Deepseek-V4-Flash-Free on Opencode is what I use most of the time,
for simple tasks. Such a good model to give for free (assuming
you're okay with harvesting your data)

odiroot wrote 1 day ago:
> I am extremely happy with ohmypi, but you could use OpenCode or
just keep using Claude Code!

What's the benefit of using OMP over OpenCode?

Just the sheer amount of options in OMP overwhelmed me.
But I also use both via ACP in Zed so the CLI itself doesn't matter
much.

greenavocado wrote 11 hours 48 min ago:
I ditched Opencode for OMP. It's more feature packed, well put
together, and gives me better results with some steering. Love it

apitman wrote 19 hours 21 min ago:
OMP is a fork of Pi[0], which is my preferred harness. Feels
solid and minimal. I don't even use any extensions, skills, or
modifications. Usually don't even use an AGENTS.md. Just create a
small spec.md and/or plan.md for most experiments.

[0]:

HTML [1]: https://pi.dev/

greenavocado wrote 11 hours 45 min ago:
Almost exactly the same here but I maintain a large committed
design.md and a never committed plan.md

qingcharles wrote 1 day ago:
Also, if you do have SuperGrok, forget using Grok, they are giving
you Composer 2.5 in Grok Build.

nobleach wrote 1 day ago:
I use Claude at work and Kimi for side projects. My org has LiteLLM
and Kimi 2.5 enabled but it rarely works, so Claude and GPT are my
main tools. I actually enjoy Kimi more as it feels like a dev in a
job interview. Watching it reason through problems is a lot like I
tend to explain things during whiteboarding sessions. The number of
times it says, "wait", is just funny. Claude on the other hand is
much more like an employee (or team of employees) that already know
they have the job. It doesn't do a ton of explanation up front. (you
can dig into processes if you want). It just goes along, asking
questions only when it needs... and then delivers a comprehensive
report or plan. OpenCode is a better harness. I don't have a direct
comparison on costs, as I haven't tried to do the exact same prompt
on both models. I can say that I recently had Kimi generate a wrapper
around libpq for the ZenC programming language: [1] and it took about
an hour or so and cost around 4 dollars.

HTML [1]: https://github.com/nobleach/zenc-postgres

re-thc wrote 1 day ago:
The Kimi problem is it doesnât follow instructions and goes off
track often.

Other than that itâs pretty decent (for the price).

Bnjoroge wrote 1 day ago:
Yup. Iâm hoping this variant fixes these issues.

nullbio wrote 1 day ago:
Sounds like it was distilled from Claude. I don't understand the
appeal of an agent that does whatever it wants.

miroljub wrote 1 day ago:
If you ask Claude in Chinese to introduce itself, it will claim
it's Kimi :)

msdz wrote 1 day ago:
> If you ask Claude in Chinese to introduce itself, it will
claim it's Kimi :)

That's a funny anecdote, buut I'm not able to reproduce.
Where/how/when did you get this, or hear about it?
It might've been patched by now, at least that's the feel I get
from my limited testing.

Using bare aichat [1] with no system prompt and no temperature
nor top_p (and I'm truncating the response after the first line
that contains the name the model gave, because the point has
been made clear by then), and with the same prompt (approx.
"Introduce yourself!") every time:

Claude Sonnet 4.5:

> è¯·åä¸ªèªæä»ç»ï¼

ä½ å¥½ï¼ææ¯Claudeï¼ä¸ä¸ªç±Anthropicå
¬å¸å¼åçAIå©æã
[â¦]

Claude Haiku 4.5:

> è¯·åä¸ªèªæä»ç»ï¼

# ä½ å¥½ï¼

ææ¯ *Claude*ï¼ä¸ä¸ªç± Anthropic å¬å¸å¼åç AI
å©æã

Claude Opus 4.5:

> è¯·åä¸ªèªæä»ç»ï¼

# ä½ å¥½ï¼

ææ¯ *Claude*ï¼ç± Anthropic å¬å¸å¼åç AI å©æã

Claude Opus 4.6:

> è¯·åä¸ªèªæä»ç»ï¼

# ä½ å¥½ï¼ ææ¯ Claude

Claude Opus 4.7:

> è¯·åä¸ªèªæä»ç»ï¼

Claude Opus 4.8:

> è¯·åä¸ªèªæä»ç»ï¼

Claude Fable 5:

> è¯·åä¸ªèªæä»ç»ï¼

# èªæä»ç»

ä½ å¥½ï¼å¾é«å´è®¤è¯ä½ ï¼

I don't see a Kimi mention, unfortunately. :-) [1]

[2] This model really is noticeably more verbose even with
supposed-to-be-brief responses huh, lol

HTML [1]: https://github.com/sigoden/aichat

reactordev wrote 1 day ago:
This. It will try to fix and refactor things that donât need
fixing because it gets stuck trying to solve the problem at hand.

ramon156 wrote 1 day ago:
I can only talk about GLM 5.1 which is roughly at sonnet 4 levels
imo.

It's good, does most tasks well that I throw at it, but will fail at
anything congitive/complex. It gets stuck often. It costs ~6$ a month
though

jeremyjh wrote 1 day ago:
This was my experience using GLM 5.1 in Claude Code but it works
far better in OpenCode, Iâd really like to understand why. I
think itâs a bit stronger than Sonnet 4.6.

I use the oh-my-openagent planning system and havenât used
vanilla OpenCode enough to know how much that is contributing.

miroljub wrote 1 day ago:
The answer is easy, CC is bug for bug optimized for Anthropic
models. They don't even test it with other models, let alone
provide support for all small compatibility quirks of different
provider implementations.

On the other hand, Opencode, Pi agent and other open source tool
offer much better support for all models, including open source.

343rwerfd wrote 1 day ago:
I think any new model not demonstrably maybe 20-30% over Deepseek v4
capabilities priced over the price per token of Deepseek is almost
automatically deprecated as low use model (maybe for Planning).

0xbadcafebee wrote 1 day ago:
DeepSeek v4 Pro is not actually that good a model compared to GLM 5.1
and Kimi K2.6. It's an okay coder/thinker for the price.

bel8 wrote 23 hours 53 min ago:
How so? In my experience trying these models using opencode Go,
DeepSeek is superior to GLM 5.1.

If anything, DS4 has 1 million context window, while GLM 5.1 has
200K.

There are also benchmarks comparing the two:

HTML [1]: https://artificialanalysis.ai/models/comparisons/deepseek-...

giancarlostoro wrote 1 day ago:
Is Deepseek just eating cost or are people able to host their open
models for comparable costs?

natrys wrote 1 day ago:
These things enormously benefit from economies of scale. I am
fairly certain their margins might be low but they don't actually
sell API at loss, however that doesn't mean your cost footprint
would be anywhere as low.

rsanek wrote 1 day ago:
Likely CCP-subsidized

trollbridge wrote 1 day ago:
Other people are hosting it in the same order of magnitude. Xioami
recently matched DeepSeekâs pricing.

psittacus wrote 1 day ago:
If openrouter is to be trusted, the cheapest offers that are not
from Deepseek itself are:

- twice as expensive on the output (1.52 vs 0.87)

- six times as expensive on the input (0.33 vs 0.05)

HTML [1]: https://openrouter.ai/deepseek/deepseek-v4-pro?sort=price#...

re-thc wrote 1 day ago:
They focused on caching and other optimizations.

bgins wrote 1 day ago:
I am still very new to the open-weight/source models. If anyone is
using them full-time, Iâd really love to hear about the setup and how
they perform, as I am considering moving my org off Anthropic products.

polski-g wrote 1 day ago:
I used glm5/5.1 for 60 days. Certainly better than Sonnet 4.6, not as
good as Opus or GPT.

Use DCP or Magic Context plugin in OpenCode to keep the context below
160k and you're fine.

sdesol wrote 1 day ago:
I created this and I would say glm-4.7 accounts for 80% of the code
in [1] If you look at a file like: [1] /blob/main/internal/cli/r...

you can see that I attribute the models used. What I found was 4.7
was not very good at `go` code which was why you started to see
`Gemini 3 Flash` in the attributions.

4.7 is what Cerebras provide and for me, speed in iterations is a lot
more important. Having played around with MiMo v2.5.0-Pro, I am 100%
sure it could have done what Gemini 3 Flash did.

There were a few points where I was stuck and needed Sonnet to
explain things to me, but I think the dirty secret that Anthropic and
OpenAI won't tell you is, if you know how to code, the models are
honestly good enough.

Based on my experience with MiMo and what others are saying about GLM
5.1, we are now in a hardware race. The Chinese Models are 100% drop
in replacement for Claude if you know how to program but want to AI
to help amplify what you know. What I will consider now is what
provider can provide the fastest inference.

MiMo-v2.5.0-Pro-Ultraspeed is really good at generating good results
quickly and burning your money as fast.

HTML [1]: https://github.com/gitsense/gsc-cli
HTML [2]: https://github.com/gitsense/gsc-cli/blob/main/internal/cli/r...

marcyb5st wrote 1 day ago:
Anecdotal, but here's my experience.

For personal stuff I use forgecode with openrouter. Firstly,
forgecode is a much better harness than Cloude code (IMHO).

Anyway, regarding the models, my experience is that there is not much
difference in terms of quality, but the cost difference is insane. At
least for how I use agents. Yesterday's example is the following: I
am developing a small DSL for search across complex technical
documents. I wanted to add a small operator to it and thought that to
give fable a spin. It burned through 13 USD and while it delivered
the solution it wasn't objectively better than what Deepseek v4 did
for 1.7 dollars (same exact task because I was curious).

For full disclosure, I ask agents for piecemeal stuff. Like in the
DSL case, I designed the operators and then asked agents to implement
them one by one. Probably if I asked to design the whole thing
starting from these complex documents Fable would shine, but every
time I try to give agents broader scope tasks they burn through
millions of tokens, generate questionable code, which I have to spend
time familiarize myself with.

sroerick wrote 1 day ago:
I'm making DSLs a lot as an architecture pattern also. I'd be
curious to know what stack you're using this and how you're
approaching it

marcyb5st wrote 22 hours 54 min ago:
I am getting familiar with Rust and so I have been playing around
with Quoth ( [1] ) for now.

It is very basic and I am no DSL expert, but my idea was to build
a graph from those complex documents (maintenance manuals) a that
to decide what tools can be used for a given part on a given
equipment in a given situation. If there is a path from A to Z it
means you can use that tool given the circumstances. Basically
the DSL is about pruning the graph as you specify things. I could
have very well done without, but it is a fun project to try out
rust, so I said, why not :)

HTML [1]: https://github.com/sam0x17/quoth

kamranjon wrote 1 day ago:
I have been using deepseek v4 flash as my main model for everything
ever since dwarf star came out. I run it on my M4 Max MacBook Pro
with 128gb of memory. I run it usually as a server and connect to it
over tailscale with my coding machine and use the Pi coding agent.
Itâs a big leap over using the Qwen models though it doesnât have
vision - so I still will run those when I use vision. GLM 4.7 flash
was my previous go to for coding but Iâve completely switched to
deepseek for all non-vision things.

trollbridge wrote 1 day ago:
Qwen 3.6 seems to be the strongest local models, works OK on an RTX
5090 or a > 32GB Mac.

DragonBooster wrote 1 day ago:
These models have open weights, but at the moment most flagship
models are practically accessible only through third-party model
providers. The main exception is models in the ~30B parameter range,
which can still be run on consumer-grade GPUs. That said, even
consumer GPUs have become increasingly expensive and difficult to
justify in recent years.

mirekrusin wrote 1 day ago:
You can definitely go above 30B on consumer hardware â 2x gpus,
spark, mac, half byte quants etc.

scottcha wrote 1 day ago:
I use glm5.1 plus pi with a few customized skills and am very happy
with it. I hadnât touched my Claude 5x plan for a couple of weeks
but opened it back up in Claude code when fable was released and did
a few tasks and still was happy to return to glm/pi.

sebastianconcpt wrote 1 day ago:
Better than Qwen3.6-35B-A3B-8bit ?

When I tried glm found it way way slower (omlx as runtime)

scottcha wrote 17 hours 17 min ago:
Yes way better. We host both and while qwen3.6 is over 100tps we
usually can do glm around that too.

andai wrote 1 day ago:
I keep trying to switch to the Chinese models, but I keep finding
myself asking Claude to fix their outputs. (Both functionality and
style.) So I always end up switching back.[0]

I also keep trying GPT, which is quite solid. Very fast, great at
debugging. But its code is often overly clever and hurts my brain.

(Maybe fixable with prompting. I tried and it helped the Chinese ones
a bit. Just tell them do be elegant, like in the old image AI days
"+good -bad"!)

For now I do still need my human brain to actually be able to make
sense of the stuff, and Claude is the only one that consistently
meets that requirement.

But I am hoping that one of these days, one of the Chinese labs
figures out the special sauce :)

[0] (For smallish edits, though, I am having a great time with
DeepSeek Flash. Practically unlimited AI on tap! How cool is that.)

yanis_t wrote 1 day ago:
I was wondering how does Anthropic and likes keep competitive when Opus
is ($5 / $25) 5x times more expensive compared to Kimi K2.6 ($0.7 /
$3.4) or other Chinese models, while being only marginally better.

My theory is that US enterprise just can't send data to Chinese and
that's understandable, but is that "the moat"?

selfawareMammal wrote 22 hours 12 min ago:
Performance. I pay for Opencode but none of the models give me Codex
performance, so I have to keep my 20â¬ subscription+ the Opencode
one

bensyverson wrote 22 hours 19 min ago:
Part of Anthropic's moat is Claude Cowork & Claude Code. They got
coders comfortable with CC and enterprise users comfortable with
Cowork, and both are creating stickiness.

The reality is that $20/$100/$200/mo feels reasonable to a lot of
people relative to the value they're getting out of Claude, and if
they switch to something else, there's a risk that it won't be as
good, and they'll have a new tool to learn.

It's not an insurmountable moat, but don't underestimate the user
experience. The iPod didn't win because it was the cheapest device or
the one with the most features.

michaelcampbell wrote 22 hours 20 min ago:
> while being only marginally better.

It's only marginally better in the things it's actually comparable
to. A\ models are MUCH better in many more things; eg: things
Kimi/etc. didn't distill.

For those things the difference is like a cliff.

tornikeo wrote 22 hours 13 min ago:
That's a baseless claim that borderline reads like shilling. Do you
have any proof of that you wrote there?

gruez wrote 1 day ago:
Your question relies on the premise that Chinese companies continue
releasing free models. What's "the moat" for them continuing to do
that?

LUmBULtERA wrote 1 day ago:
API token price is one thing, but subscriptions on Claude are a good
value. Weirdly everyone says that Claude subscriptions are
subsidized because of the API price, even though (1) no one actually
knows Claude's cost of inference, and (2) Chinese providers are also
able to provide cheap inference, so why do they think Claude can't?

I also wonder if Enterprises have deals for other API pricing that is
not posted publicly, so all we see is a high API sticker price.

mnicky wrote 20 hours 53 min ago:
> no one actually knows Claude's cost of inference

There were some rumors stating that their margin is around 70%. So
they could go much cheaper probably, talking inference only. The
other thing is R&D cost...

wuliwong wrote 22 hours 9 min ago:
I only have knowledge of one enterprise deal but there is no
discount. Which I found surprising.

smoe wrote 1 day ago:
I reckon right now the Enterprise concern is more FOMO around the AI
wave and how to retrain or replace up to hundreds of thousands of
employees. I don't think cost is the main concern right now.

But if AI doesn't lead quickly to vast large scale replacement of
workers as promised, I could definitely see the C-suits and their
gaggle of consultants starting to ask questions about token pricing.

efromvt wrote 1 day ago:
I think the perception is that it is not 'only marginally better';
whether or not you specifically agree that perceived quality gap lets
them differentiate on price.

I'd further say that there are probably enough rational actors
running evals out there that the marginally better is not pure vibes
for the cases where people are spending lots of money, but I only
have direct line of sight to some of those eval suites. Maybe
everyone is irrational and anthropic is exploiting that!

khuey wrote 1 day ago:
I think most people who've tried them both would tell you Anthropic's
models are more than marginally better than Kimi. Kimi and the other
open source models may score well on SWE-bench or whatever but the
gap is noticeable IMHO once you actually try to use them.

Bnjoroge wrote 1 day ago:
It depends on what your task is and how precise your prompts are.
Planning with fable or 4.8 and laying out the plan in step by step
process and coding with mimo v2.5 pro or dsv4pro or qwen 3.7 max
and doing a final review with 5.5 has worked really well for me for
infra stuff.

mnicky wrote 20 hours 50 min ago:
Coding with sufficiently precise plan takes almost all real work
from the implementator, doesn't it? So it's not a fair
comparison...

nullbio wrote 1 day ago:
I think none of them having a defacto and high quality English
focused cli is a big part of it. None of the Chinese models I've
tried have worked well in opensource cli's. Granted, I've only tried
a few, but still...

saratogacx wrote 19 hours 35 min ago:
I've been using charm's Crush with GLM for several months and it's
been working great. I've only seen it shift to non-english once
and it was already in a wonky state when it flipped.

Bnjoroge wrote 1 day ago:
huh? They all work great in omp/opencode unless you mean their own
native clis like kimi code

freigeist79 wrote 1 day ago:
i use github copilot cli + openrouter + qwen 3.7 max and it's
really much better than i expected (used to opus 4.7 at work)

DCKing wrote 1 day ago:
The moat right now is model performance and what that means for how
many tokens and additional time you spend.

I say this as a relatively frequent user of Kimi models and generally
a big fan. But on not-yet-gamed benchmarks like DeepSWE, Kimi K2.6 is
beaten soundly by Claude Sonnet 4.6 ($3 / $15) and even slightly by
GPT 5.4 Mini ($0.75 / $4.50).

There's no question Kimi models are very good for a lot of code
tasks. They're the best quality open weight model. But to get similar
overall outcomes as on Sonnet/Opus, on average you'll spend many more
tokens and will have to do more managing of the model. You shouldn't
look at price per token, you should look at how much you pay for the
entire process.

Bnjoroge wrote 1 day ago:
I personally dont put any weight to DeepSWE. Other than 5.5 being
directionally the best model, it gets the others pretty wrong in my
experience. FrontierCode from cognition looks interesting

esperent wrote 1 day ago:
I'm more interested in how much effort I have to put in, at least
while I'm paying in the range of current subscriptions (so
~â¬100-â¬200 a month or so). If the prices go up much more than
that I'll have to switch to caring more about token efficiency. But
at current pricing the bottleneck is my attention, not model
efficiency. As such, even a small improvement in model quality -
and hence, a decrease in how much attention I have to spend on it -
makes a big difference.

papersail wrote 1 day ago:
I'm not sure I would put too much weight on DeepSWE as a benchmark,
given that GPT-5.4-mini ended up close to Opus 4.6 there.

DCKing wrote 1 day ago:
Any benchmark is iffy and has weird results, but this is the best
we got at the moment. Most people working with Opus and Kimi
would likely tell you they're much further apart than the numbers
that were quoted for Kimi K2.6, and DeepSWE seems to capture that
gap better.

One major thing DeepSWE has going for it is that all other
benchmarks (including those quoted by MoonshotAI on this page)
don't: the other benchmarks that are completely gamed. The
benchmark answers are public and part of each model's training
data. This benchmark may still be iffy, but at least it's not
gamed.

WarmWash wrote 1 day ago:
Somehow the internet has also forgot that cheating to get ahead
in China is basically a norm and expected behavior.

DCKing wrote 1 day ago:
American labs also use gamed and cherry-picked benchmarks
extensively. Anthropic used them in their Fable announcement
and avoided DeepSWE because it doesn't beat GPT-5.5 in that
one. Google's numbers for Gemini 3.5 Flash recently did not
at all line up with people's subjective experience using
these models, and this also happened with Gemini 3.1 Pro
before it.

Everybody has incentives to manipulate benchmark results to
show their models in the best light.

re-thc wrote 1 day ago:
> My theory is that US enterprise just can't send data to Chinese

Lots of US providers are hosting these âopen sourceâ models so
doubt thatâs the problem.

yababa_y wrote 1 day ago:
I want Opus to be only marginally better, but I do mostly research
engineering and its ability to not fuck up my projects is absent.
Every time my credits lapse I let kimi and composer2.5 have some play
and itâs basically just an excuse for me to keep playing computer
because when the oai/ant credits refresh I always need to spend hours
recovering from the other models either misconceptions or boneheaded
eng practices. Even when I only let it touch my web gamesâ¦

greenavocado wrote 11 hours 39 min ago:
You have to revert to Opus 4.5 and 4.6. I bet you'll see a massive
improvement based on what you're describing

DIR <- back to front page