YouTube Video Transcriber — Local, Free, GPU-Accelerated

A Python script to transcribe any YouTube video locally using OpenAI’s open-source Whisper model. No API keys, no tokens, no cloud services — everything runs on your own machine.

Why

Sometimes you need a transcript of a YouTube video — for notes, research, or feeding into another tool. Most online transcription services either cost money, require API keys, or send your data to third-party servers. This script does it all locally using two excellent open-source tools: yt-dlp for downloading audio and Whisper for speech-to-text.

Prerequisites

  • Python 3 with a virtual environment
  • ffmpeg installed on your system (sudo apt install ffmpeg on Debian/Ubuntu)
  • A GPU with CUDA is optional but significantly speeds things up

Setup

If you’re on a modern Debian-based system, pip will refuse to install packages system-wide. Use a virtual environment instead. The shared venv lives at /home/prox/runpython/venv and is accessible system-wide via the hawk-python wrapper (see /usr/local/bin/hawk-python).

Install the dependencies:

hawk-python -m pip install yt-dlp openai-whisper

Optional: GPU Support

Whisper uses GPU automatically if PyTorch has CUDA support. To install the GPU-enabled version of PyTorch:

# Check your CUDA version first
nvidia-smi

# For CUDA 12.x:
hawk-python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# For CUDA 11.8:
hawk-python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Usage

Basic usage — provide a YouTube URL and the script handles the rest:

hawk-python youtube_transcriber.py "https://youtu.be/VIDEO_ID"

Full example with all options:

hawk-python ~/HawkVault/Clawbrain/scripts/youtube_transcriber.py \
  https://youtu.be/Nt03hgxv5TE \
  --output ~/HawkVault/Inbox/ytt.md \
  --model medium \
  --device cuda

Options

FlagValuesDefaultDescription
--modeltiny, base, small, medium, largebaseLarger models are more accurate but slower
--outputany file path<video_title>_transcript.txtWhere to save the transcript
--deviceauto, cuda, cpuautoForce a specific device or let it auto-detect

Model Sizes

ModelDownload SizeSpeedAccuracy
tiny~39 MBFastestBasic
base~74 MBFastGood
small~244 MBModerateBetter
medium~769 MBSlowGreat
large~1.5 GBSlowestBest

The model weights are downloaded once on first use and cached locally. The medium model is a solid choice — good balance of accuracy and speed, especially with a GPU.

Notes on YouTube URLs

YouTube share links include a tracking parameter like ?si=_R10SfW1LGZymUTd. This won’t cause issues, but it’s not needed — the video ID alone is sufficient. Both of these work:

  • https://youtu.be/Nt03hgxv5TE?si=_R10SfW1LGZymUTd
  • https://youtu.be/Nt03hgxv5TE

Full youtube.com URLs work too:

  • https://www.youtube.com/watch?v=Nt03hgxv5TE

Output Format

The transcript file contains two sections:

  1. Full Text — the complete transcription as a single block of text
  2. Timestamped Segments — the same content broken into timestamped chunks
Transcript: Video Title Here
============================

--- Full Text ---

The entire transcription appears here as continuous text...

--- Timestamped Segments ---

[00:00] First segment of speech.
[00:15] Next segment continues here.
[01:02] And so on through the video.

The Script

Save this as youtube_transcriber.py:

#!/usr/bin/env python3
"""
YouTube Video Transcriber
=========================
Downloads audio from a YouTube video and transcribes it using OpenAI's Whisper model.

Usage:
    python youtube_transcriber.py <youtube_url> [--model <model_size>] [--output <output_file>]

Requirements:
    pip install yt-dlp openai-whisper

Examples:
    python youtube_transcriber.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
    python youtube_transcriber.py "https://youtu.be/dQw4w9WgXcQ" --model medium --output transcript.txt
"""

import argparse
import os
import sys
import tempfile
import time


def check_dependencies():
    """Check that required packages are installed."""
    missing = []
    try:
        import yt_dlp
    except ImportError:
        missing.append("yt-dlp")
    try:
        import whisper
    except ImportError:
        missing.append("openai-whisper")

    if missing:
        print("❌ Missing dependencies. Install them with:")
        print(f"   pip install {' '.join(missing)}")
        sys.exit(1)


def download_audio(url: str, output_dir: str) -> str:
    """Download audio from a YouTube video and return the file path."""
    import yt_dlp

    output_template = os.path.join(output_dir, "audio.%(ext)s")

    ydl_opts = {
        "format": "bestaudio/best",
        "postprocessors": [
            {
                "key": "FFmpegExtractAudio",
                "preferredcodec": "mp3",
                "preferredquality": "192",
            }
        ],
        "outtmpl": output_template,
        "quiet": False,
        "no_warnings": True,
    }

    print(f"\n🔽 Downloading audio from: {url}")
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        info = ydl.extract_info(url, download=True)
        title = info.get("title", "Unknown")
        duration = info.get("duration", 0)
        print(f"   Title: {title}")
        print(f"   Duration: {duration // 60}m {duration % 60}s")

    audio_path = os.path.join(output_dir, "audio.mp3")
    if not os.path.exists(audio_path):
        # Sometimes the extension varies
        for ext in ["mp3", "m4a", "wav", "webm", "opus"]:
            candidate = os.path.join(output_dir, f"audio.{ext}")
            if os.path.exists(candidate):
                audio_path = candidate
                break

    if not os.path.exists(audio_path):
        raise FileNotFoundError("Failed to download audio file.")

    return audio_path, title


def get_device(requested: str = "auto") -> str:
    """Determine the best available device."""
    import torch

    if requested != "auto":
        return requested

    if torch.cuda.is_available():
        gpu_name = torch.cuda.get_device_name(0)
        print(f"   🎮 GPU detected: {gpu_name}")
        return "cuda"
    else:
        print("   💻 No GPU detected, using CPU")
        return "cpu"


def transcribe_audio(audio_path: str, model_size: str = "base", device: str = "auto") -> dict:
    """Transcribe audio using Whisper and return the result."""
    import whisper

    device = get_device(device)
    print(f"\n🧠 Loading Whisper model: {model_size} (on {device})")
    model = whisper.load_model(model_size, device=device)

    print("📝 Transcribing audio (this may take a while)...")
    start = time.time()
    result = model.transcribe(audio_path, verbose=False, fp16=(device == "cuda"))
    elapsed = time.time() - start
    print(f"   Transcription completed in {elapsed:.1f}s")

    return result


def format_timestamp(seconds: float) -> str:
    """Convert seconds to HH:MM:SS format."""
    h = int(seconds // 3600)
    m = int((seconds % 3600) // 60)
    s = int(seconds % 60)
    if h > 0:
        return f"{h:02d}:{m:02d}:{s:02d}"
    return f"{m:02d}:{s:02d}"


def save_transcript(result: dict, output_path: str, title: str = ""):
    """Save the transcript to a text file with timestamps."""
    with open(output_path, "w", encoding="utf-8") as f:
        if title:
            f.write(f"Transcript: {title}\n")
            f.write("=" * (len(title) + 13) + "\n\n")

        # Write full text
        f.write("--- Full Text ---\n\n")
        f.write(result["text"].strip() + "\n\n")

        # Write timestamped segments
        f.write("--- Timestamped Segments ---\n\n")
        for seg in result.get("segments", []):
            ts = format_timestamp(seg["start"])
            text = seg["text"].strip()
            f.write(f"[{ts}] {text}\n")

    print(f"\n💾 Transcript saved to: {output_path}")


def main():
    parser = argparse.ArgumentParser(description="Transcribe YouTube videos using Whisper")
    parser.add_argument("url", help="YouTube video URL")
    parser.add_argument(
        "--model",
        default="base",
        choices=["tiny", "base", "small", "medium", "large"],
        help="Whisper model size (default: base). Larger = more accurate but slower.",
    )
    parser.add_argument(
        "--device",
        default="auto",
        choices=["auto", "cuda", "cpu"],
        help="Device to use (default: auto). Auto picks GPU if available.",
    )
    parser.add_argument(
        "--output",
        default=None,
        help="Output file path (default: <video_title>_transcript.txt)",
    )
    parser.add_argument(
        "--timestamps",
        action="store_true",
        default=True,
        help="Include timestamps in output (default: True)",
    )

    args = parser.parse_args()

    check_dependencies()

    with tempfile.TemporaryDirectory() as tmp_dir:
        # Download audio
        audio_path, title = download_audio(args.url, tmp_dir)

        # Transcribe
        result = transcribe_audio(audio_path, args.model, args.device)

        # Determine output path
        if args.output:
            output_path = args.output
        else:
            safe_title = "".join(c if c.isalnum() or c in " -_" else "" for c in title)
            safe_title = safe_title.strip().replace(" ", "_")[:80]
            output_path = f"{safe_title}_transcript.txt"

        # Save
        save_transcript(result, output_path, title)

        # Print preview
        text = result["text"].strip()
        preview = text[:500] + "..." if len(text) > 500 else text
        print(f"\n📄 Preview:\n{preview}")

    print("\n✅ Done!")


if __name__ == "__main__":
    main()

Troubleshooting

externally-managed-environment error when installing with pip: Your system Python is protected. Use a virtual environment — either activate it first with source ~/path/to/venv/bin/activate or call the venv’s pip directly: ~/path/to/venv/bin/pip install ...

required file not found when running venv pip/python: The venv was likely built with a Python version that has since been upgraded or removed. Rebuild it with python3 -m venv ~/path/to/venv --clear and reinstall your packages.

Slow transcription: Use a smaller model (--model base or --model tiny), or install GPU-enabled PyTorch for CUDA acceleration. The medium model on CPU can take several minutes for a long video.

ffmpeg not found: Install it with sudo apt install ffmpeg (Debian/Ubuntu) or brew install ffmpeg (macOS).