_______ __ _______
| | |.---.-..----.| |--..-----..----. | | |.-----..--.--.--..-----.
| || _ || __|| < | -__|| _| | || -__|| | | ||__ --|
|___|___||___._||____||__|__||_____||__| |__|____||_____||________||_____|
on Gopher (inofficial)
HTML Visit Hacker News on the Web
COMMENT PAGE FOR:
HTML Removing 'um' from a recording is harder than it sounds
ternaryoperator wrote 10 hours 14 min ago:
Take this difficulty and make the desired sound piano and put the whole
thing into 1960s technology and you can see why recording studios were
never able to remove Glenn Gould's humming from his recordings.
t0bia_s wrote 11 hours 0 min ago:
Great approach. Please, do other languages. I would appreciate Czech!
neves wrote 21 hours 37 min ago:
Does it work just for English?
BugsJustFindMe wrote 1 day ago:
I find the crusade against 'um' to be annoyingly misplaced. It
frustrates the shit out of me that iOS speech-to-text dictation refuses
to write my 'um's and 'uh's with no way to change that behavior. If a
person asks to remove them, fine, but don't fucking alter my speech
patterns when I'm sending messages to people.
alyssamazz wrote 1 day ago:
Doug is a friend, but I actually use this so figured Iâd chime in.
I make online course content and used to lose close to a full day
cutting filler out of every hour or so of recording. This gets me maybe
70% of that time back. On whether you should even cut them, I donât
think itâs clear cut. With non-native English speakers especially,
the um is usually a real pause before they say something that matters,
and cutting it makes them choppy or changes what they meant. Most of
the time though itâs just padding. That matters more for courses than
it sounds like it should, because a common complaint I get is how long
courses are, so any dead air I can pull out is time I give back to
people.
Anyway this is in my workflow now. Still messing with the settings to
get it right, but I like to mess with my stack and this focuses on this
step for me.
josefritzishere wrote 1 day ago:
I used to do this with a razor and an aluminum cutting block.
AaronAPU wrote 1 day ago:
I accidentally learned how disgusting peopleâs mouth noises are while
developing an audio leveler. The lip smacking and snot noises between
sentences are the stuff of nightmares if you donât do anything to
exclude them from amplification.
The best approach I could come up with was to maintain a sliding
histogram of loudness and exclude the low-level outliers.
You can do more in the noise/frequency domain but those were outside
the scope of this tool.
stavros wrote 1 day ago:
Misphonia sufferers unite!
ralferoo wrote 1 day ago:
The title of the article is wrong. It's not that removing 'um' from a
recording is hard, it's that not removing everything else in the
recording while doing so is.
dougcalobrisi wrote 1 day ago:
Youâre right. I may borrow that if I do a follow up at some point
:-)
__mharrison__ wrote 1 day ago:
Interesting. I make a bunch of video content and I went another way.
When I want to redo a section, I say it again. But, I have a magic word
â "mistake" â that I insert before. Previously I transcribed and
just removed the sentence (or section) before mistake.
I recently automated this and used AI to determine what to cut and to
drive davinci resolve to make the edit. Saves a lot of time in my
workflow.
fragmede wrote 1 day ago:
...
No, you run an entire second pass LLM over the output of Whisper. "no
uhhh three no four." should just output four the numeral not even
f.o.u.r.
Hi, my name is fragmede. Judging by the date on my computer it's been
four months since it's since I've t touched the transcription directory
on computer and tried to improve on the state of wisprflow. Mines
pretty good but it just doesn't... ah you can't drag me back in.
slhck wrote 1 day ago:
> Two small fixes, in order. First, each cut endpoint is allowed to
slide a tiny bit (up to 60ms) to land in the quietest spot nearby. If
thereâs a momentary lull in the audio just before or after the
original cut point, slide there. The slide is bounded so it canât
cross into a neighboring word, otherwise youâd chew off real speech.
Second, from that quiet spot, the endpoint snaps to the nearest moment
when the waveform is exactly crossing zero.
Oh, Claudish striking again.
Retr0id wrote 1 day ago:
I call it claudeslop but I suppose claudish is slightly less
inflammatory.
ghaff wrote 1 day ago:
When I was doing podcasts regularly, it made me acutely aware of
various people's speech mannerisms. (Somewhat similarly, recording a
lot of videos during COVID made me very aware of a variety of my own
mannerisms--especially overactive hand motions.)
1317 wrote 1 day ago:
Looks interesting, would be a nicer article though if there was a demo
with before/after to show the results, and why the previous ideas
didn't work
for something dealing with audio you do need to play the audio really
boodleboodle wrote 1 day ago:
This resonates with our crusade to eradicate Ums once and for all.
- Ums Considered Harmful: [1] - Related paper:
HTML [1]: https://hamanlp.org/research/ums/
HTML [2]: https://hamanlp.org/SIGBOVIK_2026.pdf
HeavyStorm wrote 1 day ago:
What a very cool utility.
monster_truck wrote 1 day ago:
It takes about 30 seconds in Audacity and will give an infinitely
better result. Also works on any other sound
alyssamazz wrote 1 day ago:
Iâve donât this in audacity many times, it doesnât work as
well. All the umm patterns donât match exactly. Iâve had better
overall results with erm. I havenât used audacity in years for
this, maybe they improved the feature.
HeavyStorm wrote 1 day ago:
Doesn't sound true. Unless audacity already has a tool for this
exactly... How would you do it on 30 seconds or less?
ghaff wrote 1 day ago:
It doesn't and ums aren't the only consistent tic you often want to
clean up--"you know," long pauses, etc.
rbbydotdev wrote 1 day ago:
I wonder if with enough input data and transcription you could
âfingerprintâ where a speaker personality has habits of
interjecting âumsâ leading to more hardy analysis. Novel approach,
but gets me thinking
chrismorgan wrote 1 day ago:
I think the âWhat it wonât touchâ section shows why the entire
concept is unsound. Here it is with a different first sentence, and
(other than the third sentence no longer matching ermâs reality)
itâs perfectly coherent:
> It leaves um, uh, er and elongated versions (ummmm, uhhhhh) alone.
Those sound like fillers but theyâre doing real work in the sentence,
and cutting them automatically would change what someone said. The rule
erm follows: only remove things that are sound, not language.
> It also doesnât touch repeated words, false starts, or long
thinking pauses. Those arenât noise on top of the speech; they are
the speech, just messier than the speaker would like. Cleaning them up
is an editorial decision about which take to keep, and erm doesnât
have an opinion about that.
Think about it. Cleaning these
things-that-can-be-just-sounds-but-can-also-very-much-be-load-bearing
up is an editorial decision. At the very least, you need to judge based
on the surrounding content whether the removal of an um would change
the meaning at all; and I donât think text alone is adequate for
that.
thaumasiotes wrote 1 day ago:
>> It leaves um, uh, er and elongated versions (ummmm, uhhhhh) alone.
Something's already gone wrong here. Uh and er refer to the same
sound. Uh is the American spelling. Er is British; to them a
following "r" like that is just a kind of vowel.
Izkata wrote 17 hours 57 min ago:
"Er" is definitely distinct as an interjection, it's usually used
instead of "um" to indicate a correction and does sound different.
Silamoth wrote 1 day ago:
Regardless of American vs. British spellings, those are not the
same sound. Some British people may pronounce them the same.
Americans definitely pronounce them differently, though. For
instance, the word âwaterâ has a hard ârâ sound at the end;
Americans donât pronounce it âwatuhâ like some British people
do.
thaumasiotes wrote 21 hours 50 min ago:
They are two names for the same sound. There is no particle "er"
in American English. There could be one, theoretically, but there
isn't.
chrismorgan wrote 1 day ago:
Um⦠no. Quite different vowel sounds.
(Also, in case it wasnât clear: I was quoting from the start of
the article in that sentence.)
thaumasiotes wrote 1 day ago:
They're quite different vowel sounds in the same sense that
"back" and "back" use "quite different vowel sounds" when
pronounced by American vs British speakers.
But not in any other sense.
> in case it wasnât clear: I was quoting from the start of the
article in that sentence.
You don't seem to be quoting from the article at all, actually.
You've combined two different sentences in a way that grossly
misrepresents what the article says. But that's not really
relevant to the point here.
cyberax wrote 1 day ago:
BTW, any recommendations for AI tools that remove the laugh track? I
don't even mind the awkward acting without the missing laughter.
lavaman131 wrote 1 day ago:
This is great, I've tried out automated podcast editing tools before
and they cut too aggressively in my experience. What are you thinking
about doing next with this now that you've gotten the alignment
snapping working cleanly for 'um' and 'ah', are you thinking of
expanding the tool?
npodbielski wrote 1 day ago:
I think it is harder to remove those from your own speech. I have been
doing that for few months now and I still get back at it when I am in
hurry or stressed.
ifwinterco wrote 1 day ago:
In my experience native English speakers are particularly bad,
generally when speaking a second language people are less likely to
add random filler words.
Also the type of filler word for some reason is often different
between UK and US: British people tend to be "umm"-ers and Americans
are more likely to add "you know" (although "umm" is also common).
Once you notice it it's impossible to ignore and many, many native
English speakers are actually terrible at speaking and add filler
words to the point where it's very distracting
supernes wrote 1 day ago:
This approach seems kind of backwards to me. Why try to detect
everything except the thing you're trying to remove instead of either
sampling a few uhs and ums and treating them as noise to be silenced
(with a sharp crossfade to the noise floor that doesn't interrupt
speech flow) or finetuning a model to detect them specifically for full
automation?
pdpi wrote 1 day ago:
> instead of either sampling a few uhs and ums and treating them as
noise to be silenced
If you're not paying ttention, ctting out specific sounds can easily
cause more trouble. I for one would be quite pset if I couldn't hear
the pire's reasoning for calling a foul.
alok-g wrote 1 day ago:
I would love to see support for videos and removal of custom filler
words (I say 'basically' and 'like' a lot and have so far failed to
improve myself on this).
dougcalobrisi wrote 1 day ago:
It does take videos (like mp4) as input but will only output the
stripped audio track.
I might add the custom filler word functionality and/or perhaps just
make the filler word list configurable.
wzdd wrote 1 day ago:
Itâs a nice engineering approach, but Iâm interested in the
motivation. Um and ah is distracting in a transcript, where you can
naturally pause to take in information; in speech however it can serve
as a focusing point to indicate the next part is important. See [1] for
example. The weirdly obsessive zeal that orgs like Toastmasters have
about eliminating them is weird.
Disfluencies arenât necessarily bad even if the word starts with
âdisâ!
HTML [1]: https://medium.com/better-humans/dont-worry-about-saying-um-ef...
bongoman42 wrote 1 day ago:
A part of saying something like um is to continue your speech and
prevent the other person or someone else in the group from
interjecting.
goalieca wrote 1 day ago:
The younger generation seems to love listening at 1.2x or faster. I
think itâs a preference for a fast information dopamine hit. I may
argue itâs even a shallow approach that prefers against pausing and
time for careful reflection. Meanwhile, book reading is at an all
time low seemingly because no one has a preference or patience for
careful study and reflection.
ordu wrote 1 day ago:
> The younger generation seems to love listening at 1.2x or faster.
I do not belong to the younger generation. I refused to watch
videos because it takes too long comparing with reading. But now
I'm watching them at 2x. You can watch a 40 min video in 20
minutes. I'd like to compress it further to 10 min or so, but 3x is
a paid option on youtube and I'm not sure I could digest English
(which is a foreign language to me) at 3x.
> Meanwhile, book reading is at an all time low seemingly because
no one has a preference or patience for careful study and
reflection.
Oh, I read books too. But the content is different. You can't read
some books at 2x. You can't listen to it on such a speed. In any
book I think there are stretches of text you can consume at any
speed, but sometimes you hit a dense packed information you need to
think through. It happens with videos too. Like, try to watch
Veritasium at 2x, you'll be forced to slow things down at least
sometimes, because to get the message you need to learn how to
think at 2x speed too, not just to listen.
In any case the most of videos dilute their message over tens of
minutes and you can speed up things and have plenty of time to
think things through while watching.
red-iron-pine wrote 1 day ago:
i'm not a gen z but I routinely do that. a habit picked up from
grad school work and having to assimilate several frameworks and
techniques quickly.
arguably clickbait is the reason: i'm not here to listen to the
video or all of the other fluff, i'm here to get the point as
quickly as possible. it's a 'meeting could have been an email'
sort of thing where lots of videos could really just be several
bulletpoints.
AI youtubue summarizers are great in that regard.
burkaman wrote 1 day ago:
I listen to podcasts and videos at 2x speed or faster, I can still
understand everything and it brings listening time about equal to
what my reading time would be if I were reading an article or
transcript. Average reading speed is generally about twice as fast
as average speaking speed, and in produced media people tend to
speak even slower. I realize it sounds insane to hear 2x speed
audio if you aren't used to it, but I promise if you were to ramp
up the speed over a couple weeks or so, you would have absolutely
no trouble with it. There's no need to if you don't want to, I'm
just saying that your first impression is not giving you an
accurate experience of what it's actually like.
For audiobooks I usually want to have time to hear and process
every word, so I still speed it up but usually more like 1.5x, it
depends on the narrator and the book. For podcasts I'm not there to
appreciate the prose, so I go as fast as I can while still
understanding them. I don't think it's about dopamine, I just find
I don't gain anything by getting the same amount of information
slower.
dyauspitr wrote 1 day ago:
That reminds me of the blind Microsoft developer that uses a
screen reader at a very high speed to code
HTML [1]: https://youtu.be/wKISPePFrIs?is=K3nKVrpH-vOSem54
tech_hutch wrote 1 day ago:
In my limited experience, it seems a high reading speed is
common among users of screen readers.
landl0rd wrote 1 day ago:
Podcasts and other media to which people often listen at faster
speeds aren't produced with the professional fluency of a news
broadcast from the fifties. The bitrate of information is
relatively low. Of course many speed them up.
The democratization of media created a lot of folks who've no idea
how to disseminate information in a structured format and at an
optimal rate.
ralferoo wrote 1 day ago:
I'm not in the younger generation, but I listen to most of youtube
(apart from songs and comedy) at 2x speed, and wish it could be
even faster most of the time (that's a feature of premium, but I'm
not paying for that).
The problem is that people are producing longer videos because that
earns them more advertising revenue. Many creators now speak so
mind-numbingly slowly, that even at 2x speed it feels like it's
about a normal presentation speed.
In almost all cases, even at 2x speed, it would be quicker to just
read a transcript (if that was available). The problem is really
that people are incentivised to make everything into at least a 10
minute youtube video, when a short blog post that could have taken
only a minute to read would have been sufficient to convey all the
same information, and probably more useful as you could easily
refer back to specific sections if you wanted.
t0bia_s wrote 21 hours 48 min ago:
It's medium used in wrong way. If you want getting information
efficiently, read carefully writen text. If you want immersive
story, watch feature film. If you want dialogue, use audio.
Instead we use audio for info, text for stories and video for
dialogues.
yummybrainz wrote 1 day ago:
FYI NewPipe allows up to 4x playback; PipePipe up to 10x! And
both block ads, while PipePipe also integrates Sponsorblock.
bluebarbet wrote 1 day ago:
The most popular academic theory (IIRC) is that "um" and "uh" are
conversational placeholders that say, "don't talk, I'm not finished
speaking yet". Which obviously serves no purpose in a monologue.
To me they just indicate lack of confidence on the part of the
speaker.
skrebbel wrote 1 day ago:
There's a correlation between speaking with confidence and
bullshitting / corner cutting. Hard, nuanced questions require more
thinking time to produce a nuanced answer. But a bullshitter will
just confidently answer subtly wrong stuff. But they won't say
"uh"! Is that really better?
bluebarbet wrote 1 day ago:
Sure, that figures. Much of this is surely subjective.
amelius wrote 1 day ago:
As with all things ... Don't be opinionated and make it an option
for the user.
saulpw wrote 21 hours 18 min ago:
So are you saying that every podcast should ship two episodes, an
"unedited" version and an "umless" version? That's not really
viable.
NooneAtAll3 wrote 1 day ago:
> in speech however it can serve as a focusing point to indicate the
next part is important
it's... exact opposite?
the main (attempted) use for ummms is to keep continuation of speech
despite the pause. And the main complaint is exactly that it ruins
the focus and doesn't give respite
RobotToaster wrote 1 day ago:
It can be a focusing point when someone wants to highlight the
deliberate use of euphemism, removing those would be, um, unwise.
Although that is probably the less common use.
latexr wrote 1 day ago:
I think youâre both right. But youâre right regarding writing
and your parent comment is right regarding speech.
mrob wrote 1 day ago:
>The weirdly obsessive zeal that orgs like Toastmasters have about
eliminating them is weird.
If you speak with disfluencies, you probably didn't sufficiently
rehearse your speech. If you didn't rehearse enough, you probably
didn't put much effort into writing it either, so why should I put
much effort into listening? It's the same principle as AI slop.
kaashif wrote 1 day ago:
Not necessarily true, more rehearsal isn't the key to fluent
oratory.
Many people can speak off the cuff fluently and confidently,
avoiding "like", "um", and other filler words. And even if you're
not speaking fluently, leaving silences as punctuation is more
effective, IMO.
Many impressive speakers I've met actually cite Toastmasters! So
their obsessive zeal actually does work.
More rehearsal does work too sometimes, but it does sometimes lead
to speeches "sounding too rehearsed".
cubefox wrote 1 day ago:
> Many people can speak off the cuff fluently and confidently,
avoiding "like", "um", and other filler words.
I don't think that's true, we usually just don't notice filler
words in the same way we are surprised that people usually don't
even talk in whole sentences, in contrast to written text or
movies (which also use written text).
toast0 wrote 1 day ago:
Having heard radio interviews with and without 'internal editing' to
remove ums and ahs, most of the time I'd rather the edited version.
It's more concise and focused, and I find it easier to comprehend.
Too many ums and ahs and my mind wanders, and if it's radio, I can't
go easily go back to try again. When I've listened to podcasts or
audiobooks, I could never easily go back a little to try again
either, and I gave up on them (even though I have some content I
really want to listen to, it's too frustrating, so it's not
happening). But I'm sure other people have different preferences.
I also don't care for writing that could have been made a lot more
concise. It's a lot of work to make things shorter, but I think it's
worthwhile.
keane wrote 11 hours 55 min ago:
For a good example of this (maybe the one you heard?), see WNYC's
On the Media segment (aired December 30 or 31, 2004) titled
"Pulling Back the Curtain":
HTML [1]: https://wnyc.org/story/129437-pulling-back-the-curtain/
loevborg wrote 10 hours 14 min ago:
Thanks for the link. As a longtime listener, listening to Bob
Garfield's voice brought a tear to my eye - I'm a big fan and was
sad when he left OTM, as much as I admire Brooke.
venzaspa wrote 1 day ago:
It just goes to show that people have very different views. I think
when I hear people thinking out loud (ums and ahs) it's a marker
that they are actually engaging with the question, thinking through
an answer and not bullshitting without thinking.
fasterik wrote 18 hours 51 min ago:
I think speaking fluidly while thinking out loud is a completely
separate skill. Some people are really good at it, usually the
ones who get a lot of practice at public speaking. I also suspect
extroverts have an easier time with it than introverts. "Ums" and
"ahs" aren't necessarily evidence that a person is thinking, but
it's also true that a lot of very smart people are "inarticulate"
in the conventional sense.
td6 wrote 1 day ago:
I agree to you, when it's in person.
I think what your describing is mostly the beginning of an
answer.
Just randoms "um" inbetween because your struggling to build
sentences can get annoying both in person and online
inopinatus wrote 1 day ago:
Just sit there in silence whilst you cogitate.
gegtik wrote 1 day ago:
this is the move
macintux wrote 1 day ago:
Space fillers are sadly important for group settings where
you need to finish a thought before someone interjects.
But hearing them from an interviewee drives me crazy, along
with "sort of", "kind of", etc. I once counted all of the
"sorta"s in an NPR interview, it was brutal.
doubled112 wrote 1 day ago:
"Ummm, I think I agree with this description" vs "I, think,
umm, I agree with, umm, this description"
The first one indicates something along the lines of "thinking,
please stand by". The second one is a struggle.
siriaan wrote 1 day ago:
Occasional ums and ahs are fine but when every other phrase starts
with a long aaaaah it can be pretty unpleasant to listen to.
netsharc wrote 16 hours 19 min ago:
I saw a video where the speaker spoke his words quickly, but had
long pauses every words. Luckily NewPipe has a "fast forward during
silence option".
Looking at it again he'd pause, probably trying to find the next
word, doesn't find it, and goes "aaaah". So watching at >100% speed
and with skip silences saved my sanity:
HTML [1]: https://www.youtube.com/watch?v=dCO633KE7RA
sans_souse wrote 1 day ago:
So, if this project's source Audio were Beavis and Butthead, you
would be enthused?
heroprotagonist wrote 1 day ago:
Not to promote something, but Wispr Flow does that for me automatically
if I trigger a setting for it..
While it's a commercial product with a subscription, I spent a long
time on the free tier not even hitting their limits until I started
using it so extensively that I wanted to pay for it.
And I've used Whisper in the past, mostly for tinkering. I tried it
for a couple of use cases but haven't touched the base project in a
while. But I do regularly use Faster-Whisper-XXL, an open source
project based on Whisper, for subtitle generation.
Though, for subtitle generation, I decided to support the project and
mainly use the non-public build of Faster-Whisper-XXL Pro built for
donators to the open source project.
The extra features smooth out the subtitle editing process very
substantially. Toss in "--roformer_overlap 0.125 --roformer_vram 16
--best_of 15 --ff_vocal_extract mb-roformer --vad_method pyannote_v3"
to the cli parameters (and sometimes --realign) and you have much less
work to do in SubtitleEdit or Tero Subtitler afterwards to clean it up.
iib wrote 1 day ago:
Surprisingly, it's the whisper model itself that does that. I find
that it's also good with false starts, often correcting something
like: "uhm, we could...we can go there" to just "we can go there", if
spoken rapidly enough.
dotancohen wrote 1 day ago:
Is love to hear more about subtitle generation. Specifically, can you
label different speakers? I'd be using this for meeting
transcription. Thank you.
heroprotagonist wrote 14 hours 47 min ago:
Yeah, that's in faster-whisper-xxl via the --diarize parameter with
additional options to tweak how it works: [1] I haven't used it
when subtitling, though, so I don't know much more.
HTML [1]: https://github.com/Purfview/whisper-standalone-win/discuss...
dotancohen wrote 11 hours 24 min ago:
Terrific, thank you.
sublinear wrote 1 day ago:
Disfluencies are not necessarily "filler". They can convey mood or
hesitation. Cutting them can change the meaning.
A trivial example is "umm... well... (sigh) okay" versus just "okay".
Not okay!
cryptoz wrote 1 day ago:
Really cool stuff and definitely going to try it; Iâm also finding it
wild that Google put effort into adding ums and erms into their text to
speech model a while back. AI puts it in, AI helps take it out.
cadamsdotcom wrote 1 day ago:
What an awesome tool and idea. Iâd be keen to see if it can integrate
with video editing tools.
Ideally it would slice the video in the timeline without actually
removing anything, so you can scrub through your video and try with and
without each disfluency (thank you - awesome word) & decide case by
case which to keep!
sciencesama wrote 1 day ago:
there is a aah counter in toast master !! this is the software that
helps !!
rindalir wrote 1 day ago:
This is fascinating! I'm going to try this on a certain clip from
Jurassic Park.
dougcalobrisi wrote 1 day ago:
This post is mostly about how surprisingly hard it is to cut filler
words out of speech cleanly. Apparently, stripping ums isn't a find and
replace type thing, because Whisper's timestamps are off by up to a few
hundred ms and cutting on them chops syllables or leaves stutters. So,
I built a tool, erm, that starts from Whisper's guess, finds where each
word actually starts and stops in the audio, and snaps the cuts to
silence so there's no click, with ffmpeg doing the splicing.
HTML [1]: https://github.com/dougcalobrisi/erm
DIR <- back to front page