YouTube Video Transcriber — Local, Free, GPU-Accelerated
A Python script to transcribe any YouTube video locally using OpenAI’s open-source Whisper model. No API keys, no tokens, no cloud services — everything runs on your own machine.
Why
Sometimes you need a transcript of a YouTube video — for notes, research, or feeding into another tool. Most online transcription services either cost money, require API keys, or send your data to third-party servers. This script does it all locally using two excellent open-source tools: yt-dlp for downloading audio and Whisper for speech-to-text.
Prerequisites
- Python 3 with a virtual environment
ffmpeginstalled on your system (sudo apt install ffmpegon Debian/Ubuntu)- A GPU with CUDA is optional but significantly speeds things up
Setup
If you’re on a modern Debian-based system, pip will refuse to install packages system-wide. Use a virtual environment instead. The shared venv lives at /home/prox/runpython/venv and is accessible system-wide via the hawk-python wrapper (see /usr/local/bin/hawk-python).
Install the dependencies:
hawk-python -m pip install yt-dlp openai-whisper
Optional: GPU Support
Whisper uses GPU automatically if PyTorch has CUDA support. To install the GPU-enabled version of PyTorch:
# Check your CUDA version first
nvidia-smi
# For CUDA 12.x:
hawk-python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# For CUDA 11.8:
hawk-python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Usage
Basic usage — provide a YouTube URL and the script handles the rest:
hawk-python youtube_transcriber.py "https://youtu.be/VIDEO_ID"
Full example with all options:
hawk-python ~/HawkVault/Clawbrain/scripts/youtube_transcriber.py \
https://youtu.be/Nt03hgxv5TE \
--output ~/HawkVault/Inbox/ytt.md \
--model medium \
--device cuda
Options
| Flag | Values | Default | Description |
|---|---|---|---|
--model | tiny, base, small, medium, large | base | Larger models are more accurate but slower |
--output | any file path | <video_title>_transcript.txt | Where to save the transcript |
--device | auto, cuda, cpu | auto | Force a specific device or let it auto-detect |
Model Sizes
| Model | Download Size | Speed | Accuracy |
|---|---|---|---|
| tiny | ~39 MB | Fastest | Basic |
| base | ~74 MB | Fast | Good |
| small | ~244 MB | Moderate | Better |
| medium | ~769 MB | Slow | Great |
| large | ~1.5 GB | Slowest | Best |
The model weights are downloaded once on first use and cached locally. The medium model is a solid choice — good balance of accuracy and speed, especially with a GPU.
Notes on YouTube URLs
YouTube share links include a tracking parameter like ?si=_R10SfW1LGZymUTd. This won’t cause issues, but it’s not needed — the video ID alone is sufficient. Both of these work:
https://youtu.be/Nt03hgxv5TE?si=_R10SfW1LGZymUTdhttps://youtu.be/Nt03hgxv5TE
Full youtube.com URLs work too:
https://www.youtube.com/watch?v=Nt03hgxv5TE
Output Format
The transcript file contains two sections:
- Full Text — the complete transcription as a single block of text
- Timestamped Segments — the same content broken into timestamped chunks
Transcript: Video Title Here
============================
--- Full Text ---
The entire transcription appears here as continuous text...
--- Timestamped Segments ---
[00:00] First segment of speech.
[00:15] Next segment continues here.
[01:02] And so on through the video.
The Script
Save this as youtube_transcriber.py:
#!/usr/bin/env python3
"""
YouTube Video Transcriber
=========================
Downloads audio from a YouTube video and transcribes it using OpenAI's Whisper model.
Usage:
python youtube_transcriber.py <youtube_url> [--model <model_size>] [--output <output_file>]
Requirements:
pip install yt-dlp openai-whisper
Examples:
python youtube_transcriber.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
python youtube_transcriber.py "https://youtu.be/dQw4w9WgXcQ" --model medium --output transcript.txt
"""
import argparse
import os
import sys
import tempfile
import time
def check_dependencies():
"""Check that required packages are installed."""
missing = []
try:
import yt_dlp
except ImportError:
missing.append("yt-dlp")
try:
import whisper
except ImportError:
missing.append("openai-whisper")
if missing:
print("❌ Missing dependencies. Install them with:")
print(f" pip install {' '.join(missing)}")
sys.exit(1)
def download_audio(url: str, output_dir: str) -> str:
"""Download audio from a YouTube video and return the file path."""
import yt_dlp
output_template = os.path.join(output_dir, "audio.%(ext)s")
ydl_opts = {
"format": "bestaudio/best",
"postprocessors": [
{
"key": "FFmpegExtractAudio",
"preferredcodec": "mp3",
"preferredquality": "192",
}
],
"outtmpl": output_template,
"quiet": False,
"no_warnings": True,
}
print(f"\n🔽 Downloading audio from: {url}")
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
info = ydl.extract_info(url, download=True)
title = info.get("title", "Unknown")
duration = info.get("duration", 0)
print(f" Title: {title}")
print(f" Duration: {duration // 60}m {duration % 60}s")
audio_path = os.path.join(output_dir, "audio.mp3")
if not os.path.exists(audio_path):
# Sometimes the extension varies
for ext in ["mp3", "m4a", "wav", "webm", "opus"]:
candidate = os.path.join(output_dir, f"audio.{ext}")
if os.path.exists(candidate):
audio_path = candidate
break
if not os.path.exists(audio_path):
raise FileNotFoundError("Failed to download audio file.")
return audio_path, title
def get_device(requested: str = "auto") -> str:
"""Determine the best available device."""
import torch
if requested != "auto":
return requested
if torch.cuda.is_available():
gpu_name = torch.cuda.get_device_name(0)
print(f" 🎮 GPU detected: {gpu_name}")
return "cuda"
else:
print(" 💻 No GPU detected, using CPU")
return "cpu"
def transcribe_audio(audio_path: str, model_size: str = "base", device: str = "auto") -> dict:
"""Transcribe audio using Whisper and return the result."""
import whisper
device = get_device(device)
print(f"\n🧠 Loading Whisper model: {model_size} (on {device})")
model = whisper.load_model(model_size, device=device)
print("📝 Transcribing audio (this may take a while)...")
start = time.time()
result = model.transcribe(audio_path, verbose=False, fp16=(device == "cuda"))
elapsed = time.time() - start
print(f" Transcription completed in {elapsed:.1f}s")
return result
def format_timestamp(seconds: float) -> str:
"""Convert seconds to HH:MM:SS format."""
h = int(seconds // 3600)
m = int((seconds % 3600) // 60)
s = int(seconds % 60)
if h > 0:
return f"{h:02d}:{m:02d}:{s:02d}"
return f"{m:02d}:{s:02d}"
def save_transcript(result: dict, output_path: str, title: str = ""):
"""Save the transcript to a text file with timestamps."""
with open(output_path, "w", encoding="utf-8") as f:
if title:
f.write(f"Transcript: {title}\n")
f.write("=" * (len(title) + 13) + "\n\n")
# Write full text
f.write("--- Full Text ---\n\n")
f.write(result["text"].strip() + "\n\n")
# Write timestamped segments
f.write("--- Timestamped Segments ---\n\n")
for seg in result.get("segments", []):
ts = format_timestamp(seg["start"])
text = seg["text"].strip()
f.write(f"[{ts}] {text}\n")
print(f"\n💾 Transcript saved to: {output_path}")
def main():
parser = argparse.ArgumentParser(description="Transcribe YouTube videos using Whisper")
parser.add_argument("url", help="YouTube video URL")
parser.add_argument(
"--model",
default="base",
choices=["tiny", "base", "small", "medium", "large"],
help="Whisper model size (default: base). Larger = more accurate but slower.",
)
parser.add_argument(
"--device",
default="auto",
choices=["auto", "cuda", "cpu"],
help="Device to use (default: auto). Auto picks GPU if available.",
)
parser.add_argument(
"--output",
default=None,
help="Output file path (default: <video_title>_transcript.txt)",
)
parser.add_argument(
"--timestamps",
action="store_true",
default=True,
help="Include timestamps in output (default: True)",
)
args = parser.parse_args()
check_dependencies()
with tempfile.TemporaryDirectory() as tmp_dir:
# Download audio
audio_path, title = download_audio(args.url, tmp_dir)
# Transcribe
result = transcribe_audio(audio_path, args.model, args.device)
# Determine output path
if args.output:
output_path = args.output
else:
safe_title = "".join(c if c.isalnum() or c in " -_" else "" for c in title)
safe_title = safe_title.strip().replace(" ", "_")[:80]
output_path = f"{safe_title}_transcript.txt"
# Save
save_transcript(result, output_path, title)
# Print preview
text = result["text"].strip()
preview = text[:500] + "..." if len(text) > 500 else text
print(f"\n📄 Preview:\n{preview}")
print("\n✅ Done!")
if __name__ == "__main__":
main()
Troubleshooting
externally-managed-environment error when installing with pip: Your system Python is protected. Use a virtual environment — either activate it first with source ~/path/to/venv/bin/activate or call the venv’s pip directly: ~/path/to/venv/bin/pip install ...
required file not found when running venv pip/python: The venv was likely built with a Python version that has since been upgraded or removed. Rebuild it with python3 -m venv ~/path/to/venv --clear and reinstall your packages.
Slow transcription: Use a smaller model (--model base or --model tiny), or install GPU-enabled PyTorch for CUDA acceleration. The medium model on CPU can take several minutes for a long video.
ffmpeg not found: Install it with sudo apt install ffmpeg (Debian/Ubuntu) or brew install ffmpeg (macOS).

Comments