Local YouTube transcription pipeline ==================================== I have a love-hate relationship with long YouTube videos. On one hand, they're jam-packed with niche information often not found elsewhere on the Internet. On the other, I don't find it easy to digest long videos, and rarely have the time to watch them anyway. In more innocent times, YouTube made it easy to extract and read a video's subtitles, whether uploaded by the video's creator, or extracted by YT's built in speech-to-text recognition. There even existed websites that automated this process. Unfortunately, my favorite of these sites broke after YouTube API changes, and I was left without a way to watch/read new videos. Whisper to the rescue. Whisper is a speech recognition tool that can be run locally. I set up a few interlocking shell scripts glueing together speech recognition and YouTube video retrieval. We'll walk through them below: --- /usr/local/bin/whisper-pipe --- #!/bin/bash if [[ "$*" == "--help" ]] then echo 'whisper-pipe packages whisper and ffmpeg for turnkey use.' echo 'any further arguments are passed along to whisper-cli.' echo ' example: whisper-pipe < my_recording.mp3' else ffmpeg -loglevel error \ -i - -ar 16000 -ac 1 -c:a pcm_s16le -f wav - \ | whisper-cli - --no-prints -m ~/apps/whisper/ggml-medium.en.bin $* fi whisper-cpp is a standalone C++ inference engine compatible with Whisper speech-to-text models. This script sets up whisper-cpp for use in a pipeline. It is broken into three parts: - Help text, because I like supplying that for nontrivial scripts. - A ffmpeg command that converts the incoming audio stream into the highly specific input format whisper-cli demands. - whisper-cli itself. The --no-prints option stops it from oversharing status information that has no place in a pipeline. -m specifies the path to the Whisper model to use. Small, medium, and large models are available for download; the medium model strikes a good balance between execution speed and transcription accuracy. My build of whisper-cli has CUDA support, allowing my fancy GPU to transcribe 1 minute of audio in a few seconds. --- /usr/local/bin/yt-dlp-audio-pipe --- #!/bin/bash yt-dlp --progress-delta 1 -f worstaudio -o - $* yt-dlp is a YouTube downloader. This script retrieves audio from a given YouTube video. The video URL is specified as a command-line argument. Audio is sent to stdout. - --progress-delta 1 limits the download-progress meter to being redrawn only once per second. This is necessary to keep the script responsive on vintage serial consoles like my HP 700/96. - -f worstaudio selects the lowest-bandwidth audio-only stream available, to conserve bandwidth. --- /usr/local/bin/yt-transcribe --- #!/bin/bash yt-dlp-audio-pipe "$1" | whisper-pipe | fmt -w76 -s | less +F --exit-follow-on-close Runs the top-level conversion pipeline. - yt-dlp-audio-pipe, whisper-pipe - As described above - fmt wraps lines to 76 chars instead of whatever width Whisper chooses. - less +F --exit-follow-on-close displays the results in a pager for easy reading. The options +F --exit-follow-on-close cause less to automatically request more data from the pipe until it is drained, allowing whisper-cli to terminate. Without this option, whisper-cli is held open (blocked attempting to write data to the pipe) and just sits there consuming memory until the user has scrolled to the end of the file. The conversion process is structured as a pipeline to avoid tightly coupling it to a specific video provider, such as YouTube. The whisper-pipe script in particular is applicable to any audio stream. This is the most useful script I've written in recent memory. Granted, it's just glue logic to stick together existing programs, but it's very useful glue. In the few days since I got it working, I've put a big dent in my YouTube "to-watch" list. And I appreciate that it generalizes to transcribing pretty much anything.