URI:
        _______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
  HTML Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
  HTML   Show HN: Moonshine Open-Weights STT models – higher accuracy than WhisperLargev3
       
       
        Ross00781 wrote 4 hours 13 min ago:
        Open-weight STT models hitting production-grade accuracy is huge for
        privacy-sensitive deployments. Whisper was already impressive, but
        having competitive alternatives means we're not locked into a single
        model family. The real test will be multilingual performance and edge
        device efficiency—has anyone benchmarked this on M-series or Jetson?
       
        fudged71 wrote 4 hours 59 min ago:
        If it's using ONNX, can this be ported to Transformers.js?
       
        fittingopposite wrote 9 hours 49 min ago:
        Which program does support it to allow streaming? Currently using
        spokenly and parakeet but would like to transition to a model that is
        streaming instead of transcribing chunk wise.
       
        regularfry wrote 11 hours 20 min ago:
        Oh this is fantastic. I'm most interested to see if this reaches down
        to the raspberry pi zero 2, because that's a whole new ballgame if it
        does.
       
        dSebastien wrote 11 hours 58 min ago:
        I've been using Moonshine since V1 and the results are really great.
        I'd say on par with Parakeet V3 while working really well with CPU
        only.
       
        T0mSIlver wrote 13 hours 11 min ago:
        Congrats on the results. The streaming aspect is what I find most
        exciting here.
        
        I built a macOS dictation app ( [1] ) on top of Voxtral Realtime, and
        the UX difference between streaming and offline STT is night and day.
        Words appearing while you're still talking completely changes the
        feedback loop. You catch errors in real time, you can adjust what
        you're saying mid-sentence, and the whole thing feels more natural.
        Going back to "record then wait" feels broken after that.
        
        Curious how Moonshine's streaming latency compares in practice. Do you
        have numbers on time-to-first-token for the streaming mode? And on the
        serving side, do any of the integration options expose an OpenAI
        Realtime-compatible WebSocket endpoint?
        
  HTML  [1]: https://github.com/T0mSIlver/localvoxtral
       
        sourcetms wrote 14 hours 53 min ago:
        I'm offering support for this in Resonant - Already set up and running
        this week.
        
        It's incredible for a live transcription stream - the latency is WOW.
        [1] For the open source folks, that's also set up in handy, I think.
        
  HTML  [1]: https://www.onresonant.com/
       
          admiralrohan wrote 13 hours 44 min ago:
          Is this alternative to Whispr Flow?
       
        binome wrote 15 hours 30 min ago:
        I vibe-trained moonshine-tiny on amateur radio morse code last weekend,
        and was surprised at the ~2% CER I was seeing in evals and over the air
        performance was pretty acceptable for a couple hour run on a 4090.
       
        Ross00781 wrote 16 hours 15 min ago:
        The streaming architecture looks really promising for edge deployments.
        One thing I'm curious about: how does the caching mechanism handle
        multiple concurrent audio streams? For example, in a meeting
        transcription scenario with 4-5 speakers, would each stream maintain
        its own cache, or is there shared state that could create bottlenecks?
       
        dagss wrote 17 hours 7 min ago:
        Very exciting stuff!
        
            hear about what people might build with it
        
        My startup is making software for firefighters to use during missions
        on tablets, excited to see (when I get the time) if we can use this as
        a keyboard alternative on the device. It's a use case where avoiding
        "clunky" is important and a perfect usecase for speech-to-text.
        
        Due to the sector being increasingly worried about "hybrid threats" we
        try to rely on the cloud as little as possible and run things either on
        device or with the possibility of being self-hosted/on-premise. I
        really like the direction your company is going in in this respect.
        
        We'd probably need custom training -- we need Norwegian, and there's
        some lingo, e.g., "bravo one two" should become "B-1.2". While that can
        perhaps also be done with simple post-processing rules, we would also
        probably want such examples in training for improved recognition? Have
        no VC funding, but looking forward to getting some income so that we
        can send some of it in your direction :)
       
          steinvakt2 wrote 13 hours 25 min ago:
          Interesting. Can we get in touch? I just sold my webapp/saas where I
          used NB-Whisper to transcribe Norwegian media (podcast, radio, TV)
          and offer alerts and search by indexing it using elasticsearch.
          
          Edit: It was [1] (I shut down the backend server yesterday so the
          functionality is disabled).
          
  HTML    [1]: https://muninai.eu
       
            dagss wrote 12 hours 12 min ago:
            Sure! I didn't find your contact info but drop me an email at
            dag@syncmap.no.
       
        RobotToaster wrote 17 hours 12 min ago:
        > Models for other languages are released under the Moonshine Community
        License, which is a non-commercial license.
        
        Weird to only release English as open weights.
       
          riedel wrote 15 hours 40 min ago:
          I find it an even more weird practice for anyone working with speech
          or text models    not in the first paragraph name the language it is
          meant for (and I do not mean the programming language bindings). How
          many English native speakers are there 5% of the world population?
       
            RobotToaster wrote 14 hours 55 min ago:
            Approximately yes, although another 15% are non-native English
            speakers.  Chinese is a close second for total speakers.
       
        raybb wrote 19 hours 3 min ago:
        fyi the typepad link in your bio is broken
       
        oezi wrote 19 hours 31 min ago:
        Do you also support timestamps the detected word or even down to
        characters?
       
        guerython wrote 19 hours 44 min ago:
        Nice work. One metric I’d really like to see for streaming use cases
        is partial stability, not just final WER.
        
        For voice agents, the painful failure mode is partials getting
        rewritten every few hundred ms. If you can share it, metrics like
        median first-token latency, real-time factor, and "% partial tokens
        revised after 1s / 3s" on noisy far-field audio would make comparisons
        much more actionable.
        
        If those numbers look good, this seems very promising for local
        assistant pipelines.
       
          regularfry wrote 9 hours 9 min ago:
          Tangentially, have you got any idea what the equivalent "partial
          tokens revised" rate for humans is?  I know I've consciously
          experienced backtracking and re-interpreting words before, and
          presumably it happens subconsciously all the time.  But that means
          there's a bound on how low it's reasonable to expect that rate to be,
          and I don't have an intuition for what it is.
       
        heftykoo wrote 19 hours 57 min ago:
        Claiming higher accuracy than Whisper Large v3 is a bold opening move.
        Does your evaluation account for Whisper's notorious hallucination
        loops during silences (the classic 'Thank you for watching!'), or is
        this purely based on WER on clean datasets? Also, what's the VRAM
        footprint for edge deployments? If it fits on a standard 8GB Mac
        without quantization tricks, this is huge.
       
        starkparker wrote 20 hours 13 min ago:
        Implemented this to transcribe voice chat in a project and the
        streaming accuracy in English on this was unusable, even with the
        medium streaming model.
       
        francislavoie wrote 21 hours 19 min ago:
        I've helped many Twitch streamers set up [1] to plug transcription &
        translation into their streams, mainly for German audio to English
        subtitles.
        
        I'd love a faster and more accurate option than Whisper, but streamers
        need something off-the-shelf they can install in their pipeline, like
        an OBS plugin which can just grab the audio from their OBS audio
        sources.
        
        I see a couple obvious problems: this doesn't seem to support
        translation which is unfortunate, that's pretty key for this usecase.
        Also it only supports one language at a time, which is problematic with
        how streamers will frequently code-switch while talking to their chat
        in different languages or on Discord with their gameplay partners.
        Maybe such a plugin would be able to detect which language is spoken
        and route to one or the other model as needed?
        
  HTML  [1]: https://github.com/royshil/obs-localvocal
       
        saltwounds wrote 21 hours 30 min ago:
        Streaming transcription is crazy fast on an M1. Would be great to use
        this as a local option versus Wispr Flow.
       
        fareesh wrote 22 hours 16 min ago:
        Accuracy is often presumed to be english, which is fine, but it's a
        vague thing to say "higher" because does it mean higher in English
        only? Higher in some subset of languages? Which ones?
        
        The minimum useful data for this stuff is a small table of language |
        WER for dataset
       
        999900000999 wrote 22 hours 18 min ago:
        Very cool. Anyway to run this in Web assembly, I have a project in mind
       
        alexnewman wrote 22 hours 19 min ago:
        If only it did Doric
       
        nmstoker wrote 22 hours 20 min ago:
        Any plans regarding JavaScript support in the browser?
        
        There was an issue with a demo but it's missing now. I can't recall for
        sure but I think I got it working locally myself too but then found it
        broke unexpectedly and I didn't manage to find out why.
       
        Karrot_Kream wrote 23 hours 2 min ago:
        According to the OpenASR Leaderboard [1], looks like Parakeet V2/V3 and
        Canary-Qwen (a Qwen finetune) handily beat Moonshine. All 3 models are
        open, but Parakeet is the smallest of the 3. I use Parakeet V3 with
        Handy and it works great locally for me.
        
        [1] 
        
  HTML  [1]: https://huggingface.co/spaces/hf-audio/open_asr_leaderboard
       
          Imustaskforhelp wrote 9 hours 28 min ago:
          To this comment and all the other comments talking about handy below
          this comment. I tried handy right now and it's super amazing. I'm
          speaking this from Handy. This is so cool, man.
          
          And handy even takes care of all the punctuation, which is really
          nice.
          
          Thanks a lot for suggesting it to me. I actually wanted something
          like this, and I was using something like Google Docs, and it
          required me to use Chrome to get the speech to text version, and I
          actually ended up using Orion for that because Orion can actually
          work as a Chrome for some reason while still having both Firefox and
          Chrome extension support. So and I had it installed, but yeah.
          
          This is really amazing and actually a sort of lifesaver actually, so
          thanks a lot, man.
          
          Now I can actually just speak and this can convert this to text
          without having to go through any non-local model or Google Docs or
          whatever anything else.
          
          Why is this so good man? It's so good
          
          man, I actually now am thinking that I had like fully maxed out my
          typing speed to like hundred-120. But like this can actually write it
          faster. you know it's pretty amazing actually.
          
          Have a nice day, or as I abbreviate it, HAND, smiley face. :D
       
          d4rkp4ttern wrote 10 hours 25 min ago:
          Was a big fan of Handy until I found Hex, which, incredibly, has even
          faster transcription (with Parakeet V3), it’s MacOS only:
          
  HTML    [1]: https://github.com/kitlangton/Hex
       
            Imustaskforhelp wrote 9 hours 55 min ago:
            I tried this out but the brew command errors out saying it only
            works on macOS versions older than Sequoia.
            
            That's unfortunate. I think I can update my version but I have
            heard some bad things about performance from the newer update from
            my elder brother.
       
              ValentineC wrote 4 hours 55 min ago:
              > I tried this out but the brew command errors out saying it only
              works on macOS versions older than Sequoia.
              
              Newer than Sequoia, you mean?
              
              The brew recipe [1] says macOS >= 15.
              
              Anyway, I'm on Sequoia — it's mostly better than Ventura, which
              was what my M2 MacBook Pro came with. I'm holding off upgrading
              to Tahoe (macOS 26), hoping they fix liquid glAss.
              
  HTML        [1]: https://formulae.brew.sh/cask/kitlangton-hex
       
              d4rkp4ttern wrote 9 hours 6 min ago:
              works fine on my MacOS w Tahoe
       
          kardaj wrote 12 hours 21 min ago:
          I'm building a local-first transcription iOS app and have been on
          Whisper Medium, switching to Parakeet V3 based on this.
          
          One note for anyone using Handy with codex-cli on macOS: the default
          "Option + Space" shortcut inserts spaces mid-speech. "Left Ctrl + Fn"
          works cleanly instead. I'm curious to know which shortcuts you're
          using.
       
            bn-usd-mistake wrote 11 hours 55 min ago:
            I am looking for such an app. Main use case is transcribing voice
            notes received on Signal while preserving privacy. Please post when
            you launch :)
       
          agentifysh wrote 18 hours 43 min ago:
          hmmm looks like assembyAI is still unbeatable here in terms of
          cost/performance  unless im mistaken
          
          edit: holy shit parakeet is good.... Moonshine impressive too and it
          is half the param
          
          Now if only there was something just as quick as Parakeet v3 for TTS
          ! Then I can talk to codex all day long!!!
       
            fittingopposite wrote 9 hours 55 min ago:
            Also running parakeet on my phone with [1] Very lightweight and
            good quality
            
  HTML      [1]: https://github.com/notune/android_transcribe_app
       
              agentifysh wrote 1 hour 46 min ago:
              This is actually pretty impressive. What kinda phone are you
              using? Are you noticing any drain on battery heat?Do you think
              it's possible to get this working with Flutter on iOS?
       
                fittingopposite wrote 47 min ago:
                2-3 years old Android flagship phone with 8 GB RAM. When I
                looked for an app for parakeet, I think I also came across iOS
                apps. Don't recall it since I use Android. 
                Seems light on the phone/battery. Don't observe any drain but I
                also only record shorter transcripts at once. 
                Side note: Parakeet is actually pretty nice to do meetings with
                oneself. Did that on a computer while driving for an hour
                (split in several transcript chunks). Processed the raw meeting
                notes afterwards with an LLM. Effective use of the time in the
                car...
       
                  agentifysh wrote 38 min ago:
                  Thank you for sharing ! 
                  What about the quality of the transcripts? Is it able to do
                  live streaming?
       
                    fittingopposite wrote 30 min ago:
                    Unfortunately, Parakeet doesn't support streaming like
                    Moonshot does (as much as I know). Would be perfect to have
                    sth of the size of Parakeet but supporting streaming. Still
                    hope Nvidia releases a V4 with that feature :)
                    Otherwise, I think STT is basically a solved problem
                    running locally on edge devices.
       
            Dayshine wrote 15 hours 45 min ago:
            What's wrong with piper?
       
            remuskaos wrote 18 hours 10 min ago:
            Parakeet doesn't require a GPU. I'm handily running it on my Ubuntu
            Linux laptop.
       
              namibj wrote 14 hours 43 min ago:
              I'm looking to switch from feeding the default android "recorder"
              app's .WAV into Gemini 3 Pro (via the app) with (usually just) a
              `Transcribe this please:` prompt; content is usually German voice
              instructions/explanation for how to do/approach some sysadmin
              stuff; there does tend to be some amount of interjecting
              (primarily for clarifications(-posing/-requesting)) by me to
              resolve ambiguity as early as possible/practical.
              
              If e.g. parakeet can be run on my phone in real time showing the
              transcript live:
              
              - with latency low enough to be "comfortable enough" for the
              instructor to keep an eye on and approve the transcribed
              instructions
              
              [not necessarily every word of the transcript, i.e., a commanded
              "edit" doesn't need to be applied in the outcome as long as it's
              nature is otherwise clear enough to not add meaningful amounts of
              ambiguity to the final "written" instructions]
              
              by glancing at the screen while dictating the explanation (and
              blurting out any transcription complaints as soon as that's
              possible without breaking one's own string-of-thought or spoken
              grammar too much)
              
              , I'd very happily switch to that approach instead of what I was
              doing.
              
              Bonus if there's a no-bulky-or-expensive-hardware way to
              accommodate us both speaking over each other so I won't have to
              _interrupt_ his speaking just to put a clarifying comment (on
              what he just said) in the transcript for him to see and sign off,
              where the at least "only" briefly interrupts his thoughts right
              while he actually reads my transcribed words (he doesn't have to
              hear them, and it's better if he won't; I can probably get him to
              put on earmuffs to not hear me louder than he hears his thoughts,
              and a sufficiently-smoothed SNR meter for specifically his voice
              should take care him regulating his volume while the earmuffs
              mute it and I occasionally talk over him)...
       
              agentifysh wrote 18 hours 7 min ago:
              you are right i just downloaded it on handy and its working i
              can't believe it
              
              i was using assmeblyAI but this is fast and accurate and offline
              wtf!
       
          tuananh wrote 19 hours 32 min ago:
          Handy is amazing. Super quality app.
       
            agentifysh wrote 17 hours 26 min ago:
            It really is. It's kinda ridiculous that it's free.
       
              tuananh wrote 14 hours 22 min ago:
              I'm quite surprise to see that level of polish from an
              open-source project.
       
              alfiedotwtf wrote 14 hours 55 min ago:
              Are voice or a transcript sent back to their servers? If so, you
              may be the product
       
                yorwba wrote 14 hours 42 min ago:
                No, it's just somebody's open source project:
                
  HTML          [1]: https://github.com/cjpais/handy
       
          tomr75 wrote 20 hours 36 min ago:
          why V3 over V2 (assuming English only)?
       
          syntaxing wrote 20 hours 52 min ago:
          How much VRAM does parakeet take for you? For some reason it takes
          4GB+ for me using the onyx version even though it’s 600M parameters
       
          theologic wrote 21 hours 10 min ago:
          By the way, I've been using a Whisper model, specifically WhisperX,
          to do all my work, and for whatever reason I just simply was not
          familiar with the Handy app. I've now downloaded and used it, and
          what a great suggestion. Thank you for putting it here, along with
          the direct link to the leaderboard.
          
          I can tell that this is now definitely going to be my go-to model and
          app on all my clients.
       
            jasonjmcghee wrote 18 hours 10 min ago:
            I have to ask- I see this handy app running on Mac and you hold a
            key down and then it doesn't show until seemingly a while later.
            
            The one built in is much faster, and you only have to toggle it on.
            
            Are these so much more accurate? I definitely have to correct
            stuff, but pretty good experience.
            
            Also use speech to text on my iphone which seems to be the same
            accuracy.
       
          reitzensteinm wrote 21 hours 41 min ago:
          Parakeet V3 is over twice the parameter count of Moonshine Medium
          (600m vs 245m), so it's not an apples to apples comparison.
          
          I'm actually a little surprised they haven't added model size to that
          chart.
       
            bytesandbits wrote 16 hours 24 min ago:
            parakeet v3 has a much better RTFx than moonshine, it's not just
            about parameter numbers. Runs faster.
            
  HTML      [1]: https://huggingface.co/spaces/hf-audio/open_asr_leaderboar...
       
              SyneRyder wrote 4 hours 8 min ago:
              That was my experience when I tried Moonshine against Parakeet v3
              via Handy. Moonshine was noticeably slower on my 2018-era Intel
              i7 PC, and didn't seem as accurate either. I'm glad it exists,
              and I like the smaller size on disk (and presumably RAM too). But
              for my purposes with Handy I think I need the extra speed and
              accuracy Parakeet v3 is giving me.
       
              regularfry wrote 7 hours 33 min ago:
              It is about the parameter numbers if what you care about is edge
              devices with limited RAM.  Beyond a certain size your model just
              doesn't fit, it doesn't matter how good it is - you still can't
              run it.
       
            agentifysh wrote 17 hours 24 min ago:
            So I'm kinda new to this whole parakeet and moonshine stuff, and
            I'm able to run parakeet on a low end CPU without issues, so I'm
            curious as to how much that extra savings on parameters is actually
            gonna translate.
            
            Oh and I type this in handy with just my voice and parakeet version
            three, which is absolutely crazy.
       
        asqueella wrote 23 hours 6 min ago:
        For those wondering about the language support, currently English,
        Arabic, Japanese, Korean, Mandarin, Spanish, Ukrainian, Vietnamese are
        available (most in Base size = 58M params)
       
        sroussey wrote 23 hours 10 min ago:
        onnx models for browser possible?
       
        pzo wrote 23 hours 16 min ago:
        haven't tested yet but I'm wondering how it will behave when talking
        about many IT jargon and tech acronyms. For those reason I had to
        mostly run LLM after STT but that was slowing done parakeet inference.
        Otherwise had problems to detect properly sometimes when talking about
        e.g. about CoreML, int8, fp16, half float, ARKit, AVFoundation, ONNX
        etc.
       
        g-mork wrote 23 hours 27 min ago:
        How does this compare to Parakeet, which runs wonderfully on CPU?
       
        ac29 wrote 23 hours 31 min ago:
        No idea why 'sudo pip install --break-system-packages moonshine-voice'
        is the recommended way to install on raspi?
        
        The authors do acknowledge this though and give a slightly too complex
        way to do this with uv in an example project (FYI, you dont need to
        source anything if you use uv run)
       
        armcat wrote 23 hours 33 min ago:
        This is awesome, well done guys, I’m gonna try it as my ASR component
        on the local voice assistant I’ve been building [1] . The tiny
        streaming latencies you show look insane
        
  HTML  [1]: https://github.com/acatovic/ova
       
        lostmsu wrote 1 day ago:
        How does it compare to Microsoft VibeVoice ASR [1] ?
        
  HTML  [1]: https://news.ycombinator.com/item?id=46732776
       
        cyanydeez wrote 1 day ago:
        No LICENSE no go
       
          altruios wrote 1 day ago:
          reading through readme.md
          "License
          This code, apart from the source in core/third-party, is licensed
          under the MIT License, see LICENSE in this repository.
          
          The English-language models are also released under the MIT License.
          Models for other languages are released under the Moonshine Community
          License, which is a non-commercial license.
          
          The code in core/third-party is licensed according to the terms of
          the open source projects it originates from, with details in a
          LICENSE file in each subfolder."
       
          bangaladore wrote 1 day ago:
          There is a license blurb in the readme.
          
          > This code, apart from the source in core/third-party, is licensed
          under the MIT License, see LICENSE in this repository.
          
          > The English-language models are also released under the MIT
          License. Models for other languages are released under the Moonshine
          Community License, which is a non-commercial license.
          
          > The code in core/third-party is licensed according to the terms of
          the open source projects it originates from, with details in a
          LICENSE file in each subfolder.
       
            mkl wrote 20 hours 51 min ago:
            The LICENSE file that refers to is missing.  There's one in the
            python folder, but not for the rest of the code.
       
              namibj wrote 15 hours 23 min ago:
              IANAL.
              
              Presuming (I haven't checked myself) the git author information
              supports this, it should be fine to treat this as licensing the
              code it specifies under MIT; based on that license name being (to
              my understanding) unambiguous and license application being based
              on contract law and contract law basically having at it's very
              core the principle of "meeting of the minds" along with wilful
              infringement being really really hard to even argue for if the
              only thing that's separating it from being 100% clearly licensed
              in all proper ways being not copying in an MIT `LICENSE` template
              with date and author name pasted into it.
       
       
   DIR <- back to front page