codevoid.de/1/hn/comments_48498421.gph

        _______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
  HTML Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
  HTML   Removing 'um' from a recording is harder than it sounds
       
       
        ternaryoperator wrote 10 hours 14 min ago:
        Take this difficulty and make the desired sound piano and put the whole
        thing into 1960s technology and you can see why recording studios were
        never able to remove Glenn Gould's humming from his recordings.
       
        t0bia_s wrote 11 hours 0 min ago:
        Great approach. Please, do other languages. I would appreciate Czech!
       
        neves wrote 21 hours 37 min ago:
        Does it work just for English?
       
        BugsJustFindMe wrote 1 day ago:
        I find the crusade against 'um' to be annoyingly misplaced. It
        frustrates the shit out of me that iOS speech-to-text dictation refuses
        to write my 'um's and 'uh's with no way to change that behavior. If a
        person asks to remove them, fine, but don't fucking alter my speech
        patterns when I'm sending messages to people.
       
        alyssamazz wrote 1 day ago:
        Doug is a friend, but I actually use this so figured Iâd chime in.
        
        I make online course content and used to lose close to a full day
        cutting filler out of every hour or so of recording. This gets me maybe
        70% of that time back. On whether you should even cut them, I donât
        think itâs clear cut. With non-native English speakers especially,
        the um is usually a real pause before they say something that matters,
        and cutting it makes them choppy or changes what they meant. Most of
        the time though itâs just padding. That matters more for courses than
        it sounds like it should, because a common complaint I get is how long
        courses are, so any dead air I can pull out is time I give back to
        people.
        
        Anyway this is in my workflow now. Still messing with the settings to
        get it right, but I like to mess with my stack and this focuses on this
        step for me.
       
        josefritzishere wrote 1 day ago:
        I used to do this with a razor and an aluminum cutting block.
       
        AaronAPU wrote 1 day ago:
        I accidentally learned how disgusting peopleâs mouth noises are while
        developing an audio leveler. The lip smacking and snot noises between
        sentences are the stuff of nightmares if you donât do anything to
        exclude them from amplification.
        
        The best approach I could come up with was to maintain a sliding
        histogram of loudness and exclude the low-level outliers.
        
        You can do more in the noise/frequency domain but those were outside
        the scope of this tool.
       
          stavros wrote 1 day ago:
          Misphonia sufferers unite!
       
        ralferoo wrote 1 day ago:
        The title of the article is wrong. It's not that removing 'um' from a
        recording is hard, it's that not removing everything else in the
        recording while doing so is.
       
          dougcalobrisi wrote 1 day ago:
          Youâre right. I may borrow that if I do a follow up at some point
          :-)
       
        __mharrison__ wrote 1 day ago:
        Interesting. I make a bunch of video content and I went another way.
        
        When I want to redo a section, I say it again. But, I have a magic word
        â "mistake" â that I insert before. Previously I transcribed and
        just removed the sentence (or section) before mistake.
        
        I recently automated this and used AI to determine what to cut and to
        drive davinci resolve to make the edit. Saves a lot of time in my
        workflow.
       
        fragmede wrote 1 day ago:
        ...
        
        No, you run an entire second pass LLM over the output of Whisper. "no
        uhhh three no four." should just output four the numeral not even
        f.o.u.r.
        
        Hi, my name is fragmede. Judging by the date on my computer it's been
        four months since it's since I've t touched the transcription directory
        on computer and tried to improve on the state of wisprflow. Mines
        pretty good but it just doesn't... ah you can't drag me back in.
       
        slhck wrote 1 day ago:
        > Two small fixes, in order. First, each cut endpoint is allowed to
        slide a tiny bit (up to 60ms) to land in the quietest spot nearby. If
        thereâs a momentary lull in the audio just before or after the
        original cut point, slide there. The slide is bounded so it canât
        cross into a neighboring word, otherwise youâd chew off real speech.
        Second, from that quiet spot, the endpoint snaps to the nearest moment
        when the waveform is exactly crossing zero.
        
        Oh, Claudish striking again.
       
          Retr0id wrote 1 day ago:
          I call it claudeslop but I suppose claudish is slightly less
          inflammatory.
       
        ghaff wrote 1 day ago:
        When I was doing podcasts regularly, it made me acutely aware of
        various people's speech mannerisms. (Somewhat similarly, recording a
        lot of videos during COVID made me very aware of a variety of my own
        mannerisms--especially overactive hand motions.)
       
        1317 wrote 1 day ago:
        Looks interesting, would be a nicer article though if there was a demo
        with before/after to show the results, and why the previous ideas
        didn't work
        
        for something dealing with audio you do need to play the audio really
       
        boodleboodle wrote 1 day ago:
        This resonates with our crusade to eradicate Ums once and for all.
        
        - Ums Considered Harmful: [1] - Related paper:
        
  HTML  [1]: https://hamanlp.org/research/ums/
  HTML  [2]: https://hamanlp.org/SIGBOVIK_2026.pdf
       
        HeavyStorm wrote 1 day ago:
        What a very cool utility.
       
        monster_truck wrote 1 day ago:
        It takes about 30 seconds in Audacity and will give an infinitely
        better result. Also works on any other sound
       
          alyssamazz wrote 1 day ago:
          Iâve donât this in audacity many times, it doesnât work as
          well. All the umm patterns donât match exactly. Iâve had better
          overall results with erm. I havenât used audacity in years for
          this, maybe they improved the feature.
       
          HeavyStorm wrote 1 day ago:
          Doesn't sound true. Unless audacity already has a tool for this
          exactly... How would you do it on 30 seconds or less?
       
            ghaff wrote 1 day ago:
            It doesn't and ums aren't the only consistent tic you often want to
            clean up--"you know," long pauses, etc.
       
        rbbydotdev wrote 1 day ago:
        I wonder if with enough input data and transcription you could
        âfingerprintâ where a speaker personality has habits of
        interjecting âumsâ leading to more hardy analysis. Novel approach,
        but gets me thinking
       
        chrismorgan wrote 1 day ago:
        I think the âWhat it wonât touchâ section shows why the entire
        concept is unsound. Here it is with a different first sentence, and
        (other than the third sentence no longer matching ermâs reality)
        itâs perfectly coherent:
        
        > It leaves um, uh, er and elongated versions (ummmm, uhhhhh) alone.
        Those sound like fillers but theyâre doing real work in the sentence,
        and cutting them automatically would change what someone said. The rule
        erm follows: only remove things that are sound, not language.
        
        > It also doesnât touch repeated words, false starts, or long
        thinking pauses. Those arenât noise on top of the speech; they are
        the speech, just messier than the speaker would like. Cleaning them up
        is an editorial decision about which take to keep, and erm doesnât
        have an opinion about that.
        
        Think about it. Cleaning these
        things-that-can-be-just-sounds-but-can-also-very-much-be-load-bearing
        up is an editorial decision. At the very least, you need to judge based
        on the surrounding content whether the removal of an um would change
        the meaning at all; and I donât think text alone is adequate for
        that.
       
          thaumasiotes wrote 1 day ago:
          >> It leaves um, uh, er and elongated versions (ummmm, uhhhhh) alone.
          
          Something's already gone wrong here. Uh and er refer to the same
          sound. Uh is the American spelling. Er is British; to them a
          following "r" like that is just a kind of vowel.
       
            Izkata wrote 17 hours 57 min ago:
            "Er" is definitely distinct as an interjection, it's usually used
            instead of "um" to indicate a correction and does sound different.
       
            Silamoth wrote 1 day ago:
            Regardless of American vs. British spellings, those are not the
            same sound. Some British people may pronounce them the same.
            Americans definitely pronounce them differently, though. For
            instance, the word âwaterâ has a hard ârâ sound at the end;
            Americans donât pronounce it âwatuhâ like some British people
            do.
       
              thaumasiotes wrote 21 hours 50 min ago:
              They are two names for the same sound. There is no particle "er"
              in American English. There could be one, theoretically, but there
              isn't.
       
            chrismorgan wrote 1 day ago:
            Umâ¦ no. Quite different vowel sounds.
            
            (Also, in case it wasnât clear: I was quoting from the start of
            the article in that sentence.)
       
              thaumasiotes wrote 1 day ago:
              They're quite different vowel sounds in the same sense that
              "back" and "back" use "quite different vowel sounds" when
              pronounced by American vs British speakers.
              
              But not in any other sense.
              
              > in case it wasnât clear: I was quoting from the start of the
              article in that sentence.
              
              You don't seem to be quoting from the article at all, actually.
              You've combined two different sentences in a way that grossly
              misrepresents what the article says. But that's not really
              relevant to the point here.
       
        cyberax wrote 1 day ago:
        BTW, any recommendations for AI tools that remove the laugh track? I
        don't even mind the awkward acting without the missing laughter.
       
        lavaman131 wrote 1 day ago:
        This is great, I've tried out automated podcast editing tools before
        and they cut too aggressively in my experience. What are you thinking
        about doing next with this now that you've gotten the alignment
        snapping working cleanly for 'um' and 'ah', are you thinking of
        expanding the tool?
       
        npodbielski wrote 1 day ago:
        I think it is harder to remove those from your own speech. I have been
        doing that for few months now and I still get back at it when I am in
        hurry or stressed.
       
          ifwinterco wrote 1 day ago:
          In my experience native English speakers are particularly bad,
          generally when speaking a second language people are less likely to
          add random filler words.
          
          Also the type of filler word for some reason is often different
          between UK and US: British people tend to be "umm"-ers and Americans
          are more likely to add "you know" (although "umm" is also common).
          
          Once you notice it it's impossible to ignore and many, many native
          English speakers are actually terrible at speaking and add filler
          words to the point where it's very distracting
       
        supernes wrote 1 day ago:
        This approach seems kind of backwards to me. Why try to detect
        everything except the thing you're trying to remove instead of either
        sampling a few uhs and ums and treating them as noise to be silenced
        (with a sharp crossfade to the noise floor that doesn't interrupt
        speech flow) or finetuning a model to detect them specifically for full
        automation?
       
          pdpi wrote 1 day ago:
          > instead of either sampling a few uhs and ums and treating them as
          noise to be silenced
          
          If you're not paying ttention, ctting out specific sounds can easily
          cause more trouble. I for one would be quite pset if I couldn't hear
          the pire's reasoning for calling a foul.
       
        alok-g wrote 1 day ago:
        I would love to see support for videos and removal of custom filler
        words (I say 'basically' and 'like' a lot and have so far failed to
        improve myself on this).
       
          dougcalobrisi wrote 1 day ago:
          It does take videos (like mp4) as input but will only output the
          stripped audio track.
          
          I might add the custom filler word functionality and/or perhaps just
          make the filler word list configurable.
       
        wzdd wrote 1 day ago:
        Itâs a nice engineering approach, but Iâm interested in the
        motivation. Um and ah is distracting in a transcript, where you can
        naturally pause to take in information; in speech however it can serve
        as a focusing point to indicate the next part is important. See [1] for
        example. The weirdly obsessive zeal that orgs like Toastmasters have
        about eliminating them is weird.
        
        Disfluencies arenât necessarily bad even if the word starts with
        âdisâ!
        
  HTML  [1]: https://medium.com/better-humans/dont-worry-about-saying-um-ef...
       
          bongoman42 wrote 1 day ago:
          A part of saying something like um is to continue your speech and
          prevent the other person or someone else in the group from
          interjecting.
       
          goalieca wrote 1 day ago:
          The younger generation seems to love listening at 1.2x or faster. I
          think itâs a preference for a fast information dopamine hit. I may
          argue itâs even a shallow approach that prefers against pausing and
          time for careful reflection. Meanwhile, book reading is at an all
          time low seemingly because no one has a preference or patience for
          careful study and reflection.
       
            ordu wrote 1 day ago:
            > The younger generation seems to love listening at 1.2x or faster.
            
            I do not belong to the younger generation. I refused to watch
            videos because it takes too long comparing with reading. But now
            I'm watching them at 2x. You can watch a 40 min video in 20
            minutes. I'd like to compress it further to 10 min or so, but 3x is
            a paid option on youtube and I'm not sure I could digest English
            (which is a foreign language to me) at 3x.
            
            > Meanwhile, book reading is at an all time low seemingly because
            no one has a preference or patience for careful study and
            reflection.
            
            Oh, I read books too. But the content is different. You can't read
            some books at 2x. You can't listen to it on such a speed. In any
            book I think there are stretches of text you can consume at any
            speed, but sometimes you hit a dense packed information you need to
            think through. It happens with videos too. Like, try to watch
            Veritasium at 2x, you'll be forced to slow things down at least
            sometimes, because to get the message you need to learn how to
            think at 2x speed too, not just to listen.
            
            In any case the most of videos dilute their message over tens of
            minutes and you can speed up things and have plenty of time to
            think things through while watching.
       
            red-iron-pine wrote 1 day ago:
            i'm not a gen z but I routinely do that.  a habit picked up from
            grad school work and having to assimilate several frameworks and
            techniques quickly.
            
            arguably clickbait is the reason: i'm not here to listen to the
            video or all of the other fluff, i'm here to get the point as
            quickly as possible.  it's a 'meeting could have been an email'
            sort of thing where lots of videos could really just be several
            bulletpoints.
            
            AI youtubue summarizers are great in that regard.
       
            burkaman wrote 1 day ago:
            I listen to podcasts and videos at 2x speed or faster, I can still
            understand everything and it brings listening time about equal to
            what my reading time would be if I were reading an article or
            transcript. Average reading speed is generally about twice as fast
            as average speaking speed, and in produced media people tend to
            speak even slower. I realize it sounds insane to hear 2x speed
            audio if you aren't used to it, but I promise if you were to ramp
            up the speed over a couple weeks or so, you would have absolutely
            no trouble with it. There's no need to if you don't want to, I'm
            just saying that your first impression is not giving you an
            accurate experience of what it's actually like.
            
            For audiobooks I usually want to have time to hear and process
            every word, so I still speed it up but usually more like 1.5x, it
            depends on the narrator and the book. For podcasts I'm not there to
            appreciate the prose, so I go as fast as I can while still
            understanding them. I don't think it's about dopamine, I just find
            I don't gain anything by getting the same amount of information
            slower.
       
              dyauspitr wrote 1 day ago:
              That reminds me of the blind Microsoft developer that uses a
              screen reader at a very high speed to code
              
  HTML        [1]: https://youtu.be/wKISPePFrIs?is=K3nKVrpH-vOSem54
       
                tech_hutch wrote 1 day ago:
                In my limited experience, it seems a high reading speed is
                common among users of screen readers.
       
            landl0rd wrote 1 day ago:
            Podcasts and other media to which people often listen at faster
            speeds aren't produced with the professional fluency of a news
            broadcast from the fifties.  The bitrate of information is
            relatively low.  Of course many speed them up.
            
            The democratization of media created a lot of folks who've no idea
            how to disseminate information in a structured format and at an
            optimal rate.
       
            ralferoo wrote 1 day ago:
            I'm not in the younger generation, but I listen to most of youtube
            (apart from songs and comedy) at 2x speed, and wish it could be
            even faster most of the time (that's a feature of premium, but I'm
            not paying for that).
            
            The problem is that people are producing longer videos because that
            earns them more advertising revenue. Many creators now speak so
            mind-numbingly slowly, that even at 2x speed it feels like it's
            about a normal presentation speed.
            
            In almost all cases, even at 2x speed, it would be quicker to just
            read a transcript (if that was available). The problem is really
            that people are incentivised to make everything into at least a 10
            minute youtube video, when a short blog post that could have taken
            only a minute to read would have been sufficient to convey all the
            same information, and probably more useful as you could easily
            refer back to specific sections if you wanted.
       
              t0bia_s wrote 21 hours 48 min ago:
              It's medium used in wrong way. If you want getting information
              efficiently, read carefully writen text. If you want immersive
              story, watch feature film. If you want dialogue, use audio.
              
              Instead we use audio for info, text for stories and video for
              dialogues.
       
              yummybrainz wrote 1 day ago:
              FYI NewPipe allows up to 4x playback; PipePipe up to 10x! And
              both block ads, while PipePipe also integrates Sponsorblock.
       
          bluebarbet wrote 1 day ago:
          The most popular academic theory (IIRC) is that "um" and "uh" are
          conversational placeholders that say, "don't talk, I'm not finished
          speaking yet". Which obviously serves no purpose in a monologue.
          
          To me they just indicate lack of confidence on the part of the
          speaker.
       
            skrebbel wrote 1 day ago:
            There's a correlation between speaking with confidence and
            bullshitting / corner cutting. Hard, nuanced questions require more
            thinking time to produce a nuanced answer. But a bullshitter will
            just confidently answer subtly wrong stuff. But they won't say
            "uh"! Is that really better?
       
              bluebarbet wrote 1 day ago:
              Sure, that figures. Much of this is surely subjective.
       
          amelius wrote 1 day ago:
          As with all things ... Don't be opinionated and  make it an option
          for the user.
       
            saulpw wrote 21 hours 18 min ago:
            So are you saying that every podcast should ship two episodes, an
            "unedited" version and an "umless" version?  That's not really
            viable.
       
          NooneAtAll3 wrote 1 day ago:
          > in speech however it can serve as a focusing point to indicate the
          next part is important
          
          it's... exact opposite?
          
          the main (attempted) use for ummms is to keep continuation of speech
          despite the pause. And the main complaint is exactly that it ruins
          the focus and doesn't give respite
       
            RobotToaster wrote 1 day ago:
            It can be a focusing point when someone wants to highlight the
            deliberate use of euphemism, removing those would be, um, unwise.
            
            Although that is probably the less common use.
       
              latexr wrote 1 day ago:
              I think youâre both right. But youâre right regarding writing
              and your parent comment is right regarding speech.
       
          mrob wrote 1 day ago:
          >The weirdly obsessive zeal that orgs like Toastmasters have about
          eliminating them is weird.
          
          If you speak with disfluencies, you probably didn't sufficiently
          rehearse your speech. If you didn't rehearse enough, you probably
          didn't put much effort into writing it either, so why should I put
          much effort into listening? It's the same principle as AI slop.
       
            kaashif wrote 1 day ago:
            Not necessarily true, more rehearsal isn't the key to fluent
            oratory.
            
            Many people can speak off the cuff fluently and confidently,
            avoiding "like", "um", and other filler words. And even if you're
            not speaking fluently, leaving silences as punctuation is more
            effective, IMO.
            
            Many impressive speakers I've met actually cite Toastmasters! So
            their obsessive zeal actually does work.
            
            More rehearsal does work too sometimes, but it does sometimes lead
            to speeches "sounding too rehearsed".
       
              cubefox wrote 1 day ago:
              > Many people can speak off the cuff fluently and confidently,
              avoiding "like", "um", and other filler words.
              
              I don't think that's true, we usually just don't notice filler
              words in the same way we are surprised that people usually don't
              even talk in whole sentences, in contrast to written text or
              movies (which also use written text).
       
          toast0 wrote 1 day ago:
          Having heard radio interviews with and without 'internal editing' to
          remove ums and ahs, most of the time I'd rather the edited version.
          It's more concise and focused, and I find it easier to comprehend.
          Too many ums and ahs and my mind wanders, and if it's radio, I can't
          go easily go back to try again. When I've listened to podcasts or
          audiobooks, I could never easily go back a little to try again
          either, and I gave up on them (even though I have some content I
          really want to listen to, it's too frustrating, so it's not
          happening). But I'm sure other people have different preferences.
          
          I also don't care for writing that could have been made a lot more
          concise. It's a lot of work to make things shorter, but I think it's
          worthwhile.
       
            keane wrote 11 hours 55 min ago:
            For a good example of this (maybe the one you heard?), see WNYC's
            On the Media segment (aired December 30 or 31, 2004) titled
            "Pulling Back the Curtain":
            
  HTML      [1]: https://wnyc.org/story/129437-pulling-back-the-curtain/
       
              loevborg wrote 10 hours 14 min ago:
              Thanks for the link. As a longtime listener, listening to Bob
              Garfield's voice brought a tear to my eye - I'm a big fan and was
              sad when he left OTM, as much as I admire Brooke.
       
            venzaspa wrote 1 day ago:
            It just goes to show that people have very different views. I think
            when I hear people thinking out loud (ums and ahs) it's a marker
            that they are actually engaging with the question, thinking through
            an answer and not bullshitting without thinking.
       
              fasterik wrote 18 hours 51 min ago:
              I think speaking fluidly while thinking out loud is a completely
              separate skill. Some people are really good at it, usually the
              ones who get a lot of practice at public speaking. I also suspect
              extroverts have an easier time with it than introverts. "Ums" and
              "ahs" aren't necessarily evidence that a person is thinking, but
              it's also true that a lot of very smart people are "inarticulate"
              in the conventional sense.
       
              td6 wrote 1 day ago:
              I agree to you, when it's in person.
              I think what your describing is mostly the beginning of an
              answer.
              
              Just randoms "um" inbetween because your struggling to build
              sentences can get annoying both in person and online
       
                inopinatus wrote 1 day ago:
                Just sit there in silence whilst you cogitate.
       
                  gegtik wrote 1 day ago:
                  this is the move
       
                    macintux wrote 1 day ago:
                    Space fillers are sadly important for group settings where
                    you need to finish a thought before someone interjects.
                    
                    But hearing them from an interviewee drives me crazy, along
                    with "sort of", "kind of", etc. I once counted all of the
                    "sorta"s in an NPR interview, it was brutal.
       
                doubled112 wrote 1 day ago:
                "Ummm, I think I agree with this description" vs "I, think,
                umm, I agree with, umm, this description"
                
                The first one indicates something along the lines of "thinking,
                please stand by".  The second one is a struggle.
       
          siriaan wrote 1 day ago:
          Occasional ums and ahs are fine but when every other phrase starts
          with a long aaaaah it can be pretty unpleasant to listen to.
       
            netsharc wrote 16 hours 19 min ago:
            I saw a video where the speaker spoke his words quickly, but had
            long pauses every words. Luckily NewPipe has a "fast forward during
            silence option".
            
            Looking at it again he'd pause, probably trying to find the next
            word, doesn't find it, and goes "aaaah". So watching at >100% speed
            and with skip silences saved my sanity:
            
  HTML      [1]: https://www.youtube.com/watch?v=dCO633KE7RA
       
            sans_souse wrote 1 day ago:
            So, if this project's source Audio were Beavis and Butthead, you
            would be enthused?
       
        heroprotagonist wrote 1 day ago:
        Not to promote something, but Wispr Flow does that for me automatically
        if I trigger a setting for it..
        
        While it's a commercial product with a subscription, I spent a long
        time on the free tier not even hitting their limits until I started
        using it so extensively that I wanted to pay for it.
        
        And I've used Whisper in the past, mostly for tinkering.  I tried it
        for a couple of use cases but haven't touched the base project in a
        while.    But I do regularly use Faster-Whisper-XXL, an open source
        project based on Whisper, for subtitle generation.
        
        Though, for subtitle generation, I decided to support the project and
        mainly use the non-public build of Faster-Whisper-XXL Pro built for
        donators to the open source project.
        
        The extra features smooth out the subtitle editing process very
        substantially.    Toss in "--roformer_overlap 0.125  --roformer_vram 16
        --best_of 15 --ff_vocal_extract mb-roformer  --vad_method pyannote_v3"
        to the cli parameters  (and sometimes --realign) and you have much less
        work to do in SubtitleEdit or Tero Subtitler afterwards to clean it up.
       
          iib wrote 1 day ago:
          Surprisingly, it's the whisper model itself that does that. I find
          that it's also good with false starts, often correcting something
          like: "uhm, we could...we can go there" to just "we can go there", if
          spoken rapidly enough.
       
          dotancohen wrote 1 day ago:
          Is love to hear more about subtitle generation. Specifically, can you
          label different speakers? I'd be using this for meeting
          transcription. Thank you.
       
            heroprotagonist wrote 14 hours 47 min ago:
            Yeah, that's in faster-whisper-xxl via the --diarize parameter with
            additional options to tweak how it works: [1] I haven't used it
            when subtitling, though, so I don't know much more.
            
  HTML      [1]: https://github.com/Purfview/whisper-standalone-win/discuss...
       
              dotancohen wrote 11 hours 24 min ago:
              Terrific, thank you.
       
        sublinear wrote 1 day ago:
        Disfluencies are not necessarily "filler". They can convey mood or
        hesitation. Cutting them can change the meaning.
        
        A trivial example is "umm... well... (sigh) okay" versus just "okay".
        Not okay!
       
        cryptoz wrote 1 day ago:
        Really cool stuff and definitely going to try it; Iâm also finding it
        wild that Google put effort into adding ums and erms into their text to
        speech model a while back. AI puts it in, AI helps take it out.
       
        cadamsdotcom wrote 1 day ago:
        What an awesome tool and idea. Iâd be keen to see if it can integrate
        with video editing tools.
        
        Ideally it would slice the video in the timeline without actually
        removing anything, so you can scrub through your video and try with and
        without each disfluency (thank you - awesome word) & decide case by
        case which to keep!
       
        sciencesama wrote 1 day ago:
        there is a aah counter in toast master !! this is the software that
        helps !!
       
        rindalir wrote 1 day ago:
        This is fascinating! I'm going to try this on a certain clip from
        Jurassic Park.
       
        dougcalobrisi wrote 1 day ago:
        This post is mostly about how surprisingly hard it is to cut filler
        words out of speech cleanly. Apparently, stripping ums isn't a find and
        replace type thing, because Whisper's timestamps are off by up to a few
        hundred ms and cutting on them chops syllables or leaves stutters. So,
        I built a tool, erm, that starts from Whisper's guess, finds where each
        word actually starts and stops in the audio, and snaps the cuts to
        silence so there's no click, with ffmpeg doing the splicing.
        
  HTML  [1]: https://github.com/dougcalobrisi/erm
       
       
   DIR <- back to front page