需要代理伺服器嗎？試試我們的 ISP 代理伺服器吧！

網頁爬取

How to Scrape YouTube in 2026: The Methods That Actually Still Work

AJ泰特
2025年1月17日

YouTube is one of the largest sources of public web data on the internet — over 800 million videos, billions of comments, transcripts for most uploads, and structured metadata that powers everything from market research to AI training. It’s also one of the most aggressively defended properties online, and the methods that worked for scraping it three years ago mostly don’t anymore.

The post you’re reading replaces an older guide that was technically accurate in 2022. It isn’t accurate now. YouTube tightened its anti-bot stack significantly through 2025: Proof of Origin tokens now gate many endpoints, visitor data cookies are required for most video page loads, and datacenter IP ranges trigger CAPTCHA challenges within a few hundred requests. The CSS selectors that older tutorials use don’t match YouTube’s current markup. Even the “dislikes” data those tutorials reference hasn’t been publicly accessible since November 2021.

Here’s what does work in 2026, with code that actually runs.

Tired of IP bans stalling your operations? Deploy our Residential Proxies for high-velocity rotation or secure ISP Proxies for total account longevity.

What you can scrape (and what you can’t)

Public data, freely available:

Video metadata — title, description, view count, upload date, channel, duration, tags
Channel data — name, description, subscriber count (approximate), video list, About page info
Comments — including replies, like counts, timestamps
Transcripts and captions — for any video with captions enabled
Search results — full SERPs for any query
Playlists — full contents and metadata
Live chat replays — for streams that have ended

Off-limits regardless of method:

Private videos
Unlisted videos you don’t have the URL for
Anything behind login the account doesn’t own
Engagement data on individual viewers
Exact subscriber counts (YouTube only displays rounded numbers publicly)

The legal layer matters too: YouTube’s Terms of Service restrict automated access. Scraping public data is generally legal in most jurisdictions, but ToS violations can result in account bans and, in some cases, civil action. Don’t scrape behind login. Don’t collect personally identifiable information beyond what’s publicly displayed. Don’t redistribute video content. If you’re collecting for AI training, document the source and respect creator rights — the EU AI Act and similar regulations are increasingly active here.

The four methods worth knowing

There are really four approaches in 2026 — picking the right one depends on what you’re after and at what scale.

方法	Best for	Auth needed	Scale
YouTube Data API v3	Structured queries, small to medium volume	API key	Limited (10K quota units/day free)
yt-dlp	Everything else — metadata, comments, transcripts, batch	None (cookies for some videos)	Medium to high with proxies
youtube-transcript-api	Transcripts only	沒有	中等
Custom scraping (requests + InnerTube)	Specialized cases the above don’t cover	沒有	High with infrastructure

In practice, most serious operations use yt-dlp as the workhorse and reach for the others when yt-dlp can’t cover the case.

Method 1: YouTube Data API v3

The cleanest option when it fits. Google’s official API returns structured JSON for videos, channels, playlists, search, and comments. There’s no anti-bot friction — the request either succeeds or returns a clear quota error.

The catch is quota. The free tier gives you 10,000 units per day. A videos.list call costs 1 unit; a search costs 100. That’s roughly 100 search queries per day, or 10,000 video metadata fetches. Fine for many use cases; useless for anything at scale.

python

from googleapiclient.discovery import build

API_KEY = "YOUR_API_KEY"
youtube = build("youtube", "v3", developerKey=API_KEY)

# Get metadata for a specific video
response = youtube.videos().list(
    part="snippet,statistics,contentDetails",
    id="dQw4w9WgXcQ"
).execute()

video = response["items"][0]
print(video["snippet"]["title"])
print(video["statistics"]["viewCount"])
print(video["contentDetails"]["duration"])

Use the API when your volume fits the quota and you want clean, stable data. Apply for an increased quota if you have a legitimate business case — Google approves these for real applications, just not for “I’m building a scraper.”

Method 2: yt-dlp

yt-dlp is the de facto standard for everything the API doesn’t cover cleanly. It’s an actively maintained fork of youtube-dl, no API key required, and handles metadata, comments, transcripts, and downloads in one tool.

Install:

bash

pip install yt-dlp

Get metadata for a single video without downloading anything:

python

import yt_dlp

def get_video_metadata(url):
    opts = {
        "quiet": True,
        "skip_download": True,
        "no_warnings": True,
    }
    with yt_dlp.YoutubeDL(opts) as ydl:
        info = ydl.extract_info(url, download=False)
    return {
        "title": info.get("title"),
        "channel": info.get("uploader"),
        "views": info.get("view_count"),
        "likes": info.get("like_count"),
        "duration_sec": info.get("duration"),
        "upload_date": info.get("upload_date"),
        "description": info.get("description"),
        "tags": info.get("tags", []),
    }

data = get_video_metadata("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
print(data)

Pull comments (including replies):

python

import yt_dlp

def get_comments(url, max_comments=500):
    opts = {
        "quiet": True,
        "skip_download": True,
        "getcomments": True,
        "extractor_args": {
            "youtube": {
                "max_comments": [str(max_comments)],
                "comment_sort": ["top"],
            }
        },
    }
    with yt_dlp.YoutubeDL(opts) as ydl:
        info = ydl.extract_info(url, download=False)
    return info.get("comments", [])

comments = get_comments("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
for c in comments[:5]:
    print(f"{c['author']}: {c['text']} ({c['like_count']} likes)")

Batch-scrape a list of video IDs from a search:

bash

yt-dlp --flat-playlist -j "ytsearch50:web scraping tutorial" | jq -r '.id'

A few practical notes that aren’t in most tutorials:

Update yt-dlp frequently. YouTube breaks the extractor regularly — at least monthly. Running an old version is the #1 reason scripts mysteriously return empty results. pip install -U yt-dlp should be in your maintenance schedule.
Age-restricted videos require cookies. Export cookies from a browser session and pass them via cookiefile in ydl_opts.
High volume needs proxies. Without them, you’ll hit rate limits and IP blocks fast. More on this below.

Method 3: youtube-transcript-api

If transcripts are all you need — and increasingly they are, since transcript text is the most useful input for LLM-based content analysis — youtube-transcript-api is lighter and faster than yt-dlp.

python

from youtube_transcript_api import YouTubeTranscriptApi

def get_transcript(video_id, languages=("en",)):
    try:
        transcript = YouTubeTranscriptApi.get_transcript(
            video_id, languages=list(languages)
        )
        return " ".join(entry["text"] for entry in transcript)
    except Exception as e:
        print(f"No transcript available: {e}")
        return None

text = get_transcript("dQw4w9WgXcQ")
print(text[:500] if text else "None")

Transcripts pair naturally with sentiment analysis, RAG pipelines, and any LLM-based workflow that needs text content from video. This is one of the fastest-growing scraping use cases in 2026.

Method 4: InnerTube and ytInitialData

For specialized cases — channel About data, specific continuation tokens, anything the above tools don’t cleanly expose — you can hit YouTube’s internal endpoints directly. The frontend uses a private API at /youtubei/v1/, and most video pages embed a ytInitialData JSON object in a <script> tag that contains the rendered page state.

This is more brittle than the other methods — YouTube changes the structure periodically — but it’s also the most flexible:

python

import requests
import re
import json

def extract_initial_data(video_url):
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                      "AppleWebKit/537.36 (KHTML, like Gecko) "
                      "Chrome/120.0.0.0 Safari/537.36",
    }
    response = requests.get(video_url, headers=headers)
    match = re.search(r"var ytInitialData = ({.*?});</script>", response.text)
    if not match:
        return None
    return json.loads(match.group(1))

data = extract_initial_data("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
# data is the full page state — navigate it for whatever you need

Use this approach only when the other methods don’t cover the case. The structure of ytInitialData is enormous and changes regularly; navigating it requires browser-devtools spelunking each time.

The proxy layer

Past a few hundred requests per day from a single IP, YouTube starts pushing back. The pattern in 2026 is consistent: first a 429 Too Many Requests, then a soft block where requests succeed but return degraded data, then a hard block where every request gets a “sign in to confirm you’re not a bot” wall.

Three things determine whether scraping at scale works:

IP type. Datacenter IPs are flagged within a few hundred requests. Residential or ISP proxies route around this — to YouTube, traffic from a residential IP looks like a normal user on a home connection.

Rotation pattern. For yt-dlp, rotating IPs per video request is the standard. For session-based scraping (paginating comments, browsing a channel), sticky sessions (the same IP for 10–30 minutes) look more natural than rotation mid-session.

Geographic distribution. YouTube serves different content to different regions. If you’re collecting region-specific data — trending lists, localized search results, regional video availability — your proxies need to live in those regions.

Connecting a proxy to yt-dlp is straightforward:

bash

yt-dlp --proxy "http://USER:PASS@proxy.example.com:8080" \
       --skip-download --write-info-json \
       "https://www.youtube.com/watch?v=dQw4w9WgXcQ"

Or in Python:

python

opts = {
    "quiet": True,
    "skip_download": True,
    "proxy": "http://USER:PASS@proxy.example.com:8080",
}

For requests-based scraping (Method 4 above):

python

proxies = {
    "http": "http://USER:PASS@proxy.example.com:8080",
    "https": "http://USER:PASS@proxy.example.com:8080",
}
response = requests.get(video_url, headers=headers, proxies=proxies)

IPBurger’s residential and ISP proxies are built for this kind of work — clean IPs, country-level targeting, sticky sessions when you need them — and the broader principle applies regardless of provider: at the volumes that matter, the proxy layer is the load-bearing piece. The scraping script is 50 lines; the infrastructure decides whether it runs to completion or stalls at 30%.

A reasonable default workflow

If you’re starting a YouTube scraping project today, this is the path that scales:

Define the data you actually need. Most projects don’t need everything — narrowing the scope avoids quota waste and reduces the surface for blocks.
Try the YouTube Data API first if it fits. If your volume is under 10,000 units/day and the API exposes what you need, this is the most reliable path.
Reach for yt-dlp when the API doesn’t fit. It covers comments, transcripts, batch operations, and anything search-result-based.
Add a residential proxy layer once you start hitting 429s. Don’t try to outsmart the rate limiter with delays alone — once your IP is flagged, it’s flagged.
Use youtube-transcript-api for transcript-heavy work. Lighter than yt-dlp, faster for that specific job.
Custom requests + InnerTube only when nothing else covers it. Worth knowing about; not worth starting with.
Build for breakage. Whatever you ship today will break within six months. YouTube changes its frontend, the extractor catches up, you update the dependency. Plan for this.

What’s actually changed since 2022

For anyone migrating from an older scraping setup, the substantive changes worth flagging:

Dislikes data is gone. YouTube removed public dislike counts in November 2021. No method recovers them; third-party “dislike” estimates are guesses.
Proof of Origin tokens now gate many streaming and detailed metadata endpoints. yt-dlp handles this internally when it can; manual scraping has to deal with it explicitly.
Visitor data cookies are required for most video page loads. A fresh IP without a warmed-up session often hits a consent wall instead of the video.
CSS selectors from old tutorials don’t match. YouTube reorganized its frontend significantly through 2023 and 2024. Any tutorial referencing classes like yt-uix-tile-link is referencing a markup version that hasn’t existed for years.
Quota costs increased for some endpoints. Search is still 100 units, but some other endpoints have shifted. Check the current quota guide before estimating.

The honest summary

YouTube scraping in 2026 is more achievable than it sounds — the tooling has gotten much better since 2022 (yt-dlp is genuinely excellent), and the official API is reliable for jobs that fit. The hard part isn’t the scraping logic; it’s the infrastructure underneath it. Run residential or ISP proxies, update your tools weekly, and design for the inevitable breakages, and the data is yours.

Your business is only as strong as your proxy uptime. Switch to business-grade Static ISP Proxies for dedicated speeds and unshakeable reliability. OR Deploy Rotating Residential Proxies and achieve a 99.9% scraping success rate.

Stop Worrying About Your Proxy Quality

Our Static ISP proxies are guaranteed clean and dedicated 100% to you. No shared baggage, just performance.

Claim Your Dedicated IP

更深入地瞭解網頁抓取

How to Safely Manage Multiple eBay Stealth Accounts Without a Ban

電子商務代理

How to Safely Manage Multiple eBay Stealth Account 2026 Without a Ban

Running an eBay stealth account in 2026 is riskier than most sellers think. The margin for error keeps getting smaller. eBay suspended more than 37,000 accounts in 2024 for IP

代理

Proxy Deployment Guide: From Setup to Scale

Learn proxy deployment setup, scaling strategies, and best practices to optimize your infrastructure with this comprehensive guide

代理

How we almost lost 1500+ loyal customers, and how we retained them

Our most loyal customers are loyal to one and only one thing, their steady and speedy, static Fresh/Private IP Address. These Fresh/Private and static IP Addresses are from ranges that

探索網頁抓取

Stop Getting Blocked. Start Scaling Today.

Join 10,000+ companies using the most resilient residential and ISP proxies to collect real-time data at scale.

100M+ IP Pool

Instant Activation

24/7 Expert Support