Local YouTube transcription pipeline
====================================

I have a love-hate relationship with long YouTube videos.  On one hand,
they're jam-packed with niche information often not found elsewhere on the
Internet.  On the other, I don't find it easy to digest long videos, and
rarely have the time to watch them anyway.

In more innocent times, YouTube made it easy to extract and read a video's
subtitles, whether uploaded by the video's creator, or extracted by YT's
built in speech-to-text recognition.  There even existed websites that
automated this process.  Unfortunately, my favorite of these sites broke
after YouTube API changes, and I was left without a way to watch/read
new videos.

Whisper to the rescue.  Whisper is a speech recognition tool that can be
run locally.  I set up a few interlocking shell scripts glueing together
speech recognition and YouTube video retrieval.  We'll walk through them
below:

--- /usr/local/bin/whisper-pipe ---
#!/bin/bash
if [[ "$*" == "--help" ]]
then
    echo 'whisper-pipe packages whisper and ffmpeg for turnkey use.'
    echo 'any further arguments are passed along to whisper-cli.'
    echo '    example: whisper-pipe < my_recording.mp3'
else
ffmpeg -loglevel error \
       -i - -ar 16000 -ac 1 -c:a pcm_s16le -f wav - \
    | whisper-cli - --no-prints -m ~/apps/whisper/ggml-medium.en.bin $*
fi

whisper-cpp is a standalone C++ inference engine compatible with Whisper
speech-to-text models.  This script sets up whisper-cpp for use in a
pipeline.  It is broken into three parts:

- Help text, because I like supplying that for nontrivial scripts.

- A ffmpeg command that converts the incoming audio stream into the highly
  specific input format whisper-cli demands.

- whisper-cli itself.  The --no-prints option stops it from oversharing
  status information that has no place in a pipeline.  -m specifies the
  path to the Whisper model to use.  Small, medium, and large models
  are available for download; the medium model strikes a good balance
  between execution speed and transcription accuracy.  My build of
  whisper-cli has CUDA support, allowing my fancy GPU to transcribe
  1 minute of audio in a few seconds.

--- /usr/local/bin/yt-dlp-audio-pipe ---
#!/bin/bash
yt-dlp --progress-delta 1 -f worstaudio -o - $*

yt-dlp is a YouTube downloader.  This script retrieves audio from a given
YouTube video.  The video URL is specified as a command-line argument.
Audio is sent to stdout.

- --progress-delta 1 limits the download-progress meter to being redrawn
  only once per second.  This is necessary to keep the script responsive
  on vintage serial consoles like my HP 700/96.

- -f worstaudio selects the lowest-bandwidth audio-only stream available,
  to conserve bandwidth.

--- /usr/local/bin/yt-transcribe ---
#!/bin/bash
yt-dlp-audio-pipe "$1" | whisper-pipe | fmt -w76 -s | less +F --exit-follow-on-close

Runs the top-level conversion pipeline.

- yt-dlp-audio-pipe, whisper-pipe - As described above

- fmt wraps lines to 76 chars instead of whatever width Whisper chooses.

- less +F --exit-follow-on-close displays the results in a pager for
  easy reading.  The options +F --exit-follow-on-close cause less to
  automatically request more data from the pipe until it is drained,
  allowing whisper-cli to terminate.  Without this option, whisper-cli
  is held open (blocked attempting to write data to the pipe) and just
  sits there consuming memory until the user has scrolled to the end of
  the file.

The conversion process is structured as a pipeline to avoid tightly
coupling it to a specific video provider, such as YouTube.  The
whisper-pipe script in particular is applicable to any audio stream.

This is the most useful script I've written in recent memory.  Granted,
it's just glue logic to stick together existing programs, but it's very
useful glue.  In the few days since I got it working, I've put a big
dent in my YouTube "to-watch" list.  And I appreciate that it generalizes
to transcribing pretty much anything.