网络抓取

2026年如何抓取YouTube数据：那些至今仍有效的方法

AJ Tait
2025年1月17日

YouTube is one of the largest sources of public web data on the internet — over 800 million videos, billions of comments, transcripts for most uploads, and structured metadata that powers everything from market research to AI training. It’s also one of the most aggressively defended properties online, and the methods that worked for scraping it three years ago mostly don’t anymore.

The post you’re reading replaces an older guide that was technically accurate in 2022. It isn’t accurate now. YouTube tightened its anti-bot stack significantly through 2025: Proof of Origin tokens now gate many endpoints, visitor data cookies are required for most video page loads, and datacenter IP ranges trigger CAPTCHA challenges within a few hundred requests. The CSS selectors that older tutorials use don’t match YouTube’s current markup. Even the “dislikes” data those tutorials reference hasn’t been publicly accessible since November 2021.

Here’s what does work in 2026, with code that actually runs.

厌倦了IP封禁拖慢您的运营进度吗？部署我们的住宅代理以实现高速轮换，或使用安全的ISP代理来确保账户长期稳定运行。

What you can scrape (and what you can’t)

Public data, freely available:

Video metadata — title, description, view count, upload date, channel, duration, tags
Channel data — name, description, subscriber count (approximate), video list, About page info
Comments — including replies, like counts, timestamps
Transcripts and captions — for any video with captions enabled
Search results — full SERPs for any query
Playlists — full contents and metadata
Live chat replays — for streams that have ended

无论采用何种方法，均禁止：

Private videos
Unlisted videos you don’t have the URL for
Anything behind login the account doesn’t own
Engagement data on individual viewers
Exact subscriber counts (YouTube only displays rounded numbers publicly)

The legal layer matters too: YouTube’s Terms of Service restrict automated access. Scraping public data is generally legal in most jurisdictions, but ToS violations can result in account bans and, in some cases, civil action. Don’t scrape behind login. Don’t collect personally identifiable information beyond what’s publicly displayed. Don’t redistribute video content. If you’re collecting for AI training, document the source and respect creator rights — the EU AI Act and similar regulations are increasingly active here.

The four methods worth knowing

There are really four approaches in 2026 — picking the right one depends on what you’re after and at what scale.

方法	最适合	Auth needed	Scale
YouTube Data API v3	Structured queries, small to medium volume	API key	Limited (10K quota units/day free)
yt-dlp	Everything else — metadata, comments, transcripts, batch	None (cookies for some videos)	Medium to high with proxies
youtube-transcript-api	Transcripts only	无	中型
Custom scraping (requests + InnerTube)	Specialized cases the above don’t cover	无	High with infrastructure

In practice, most serious operations use yt-dlp as the workhorse and reach for the others when yt-dlp can’t cover the case.

Method 1: YouTube Data API v3

The cleanest option when it fits. Google’s official API returns structured JSON for videos, channels, playlists, search, and comments. There’s no anti-bot friction — the request either succeeds or returns a clear quota error.

The catch is quota. The free tier gives you 10,000 units per day. A videos.list call costs 1 unit; a search costs 100. That’s roughly 100 search queries per day, or 10,000 video metadata fetches. Fine for many use cases; useless for anything at scale.

python

from googleapiclient.discovery import build

API_KEY = "YOUR_API_KEY"
youtube = build("youtube", "v3", developerKey=API_KEY)

# Get metadata for a specific video
response = youtube.videos().list(
    part="snippet,statistics,contentDetails",
    id="dQw4w9WgXcQ"
).execute()

video = response["items"][0]
print(video["snippet"]["title"])
print(video["statistics"]["viewCount"])
print(video["contentDetails"]["duration"])

Use the API when your volume fits the quota and you want clean, stable data. Apply for an increased quota if you have a legitimate business case — Google approves these for real applications, just not for “I’m building a scraper.”

Method 2: yt-dlp

yt-dlp is the de facto standard for everything the API doesn’t cover cleanly. It’s an actively maintained fork of youtube-dl, no API key required, and handles metadata, comments, transcripts, and downloads in one tool.

Install:

bash

pip install yt-dlp

Get metadata for a single video without downloading anything:

python

import yt_dlp

def get_video_metadata(url):
    opts = {
        "quiet": True,
        "skip_download": True,
        "no_warnings": True,
    }
    with yt_dlp.YoutubeDL(opts) as ydl:
        info = ydl.extract_info(url, download=False)
    return {
        "title": info.get("title"),
        "channel": info.get("uploader"),
        "views": info.get("view_count"),
        "likes": info.get("like_count"),
        "duration_sec": info.get("duration"),
        "upload_date": info.get("upload_date"),
        "description": info.get("description"),
        "tags": info.get("tags", []),
    }

data = get_video_metadata("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
print(data)

Pull comments (including replies):

python

import yt_dlp

def get_comments(url, max_comments=500):
    opts = {
        "quiet": True,
        "skip_download": True,
        "getcomments": True,
        "extractor_args": {
            "youtube": {
                "max_comments": [str(max_comments)],
                "comment_sort": ["top"],
            }
        },
    }
    with yt_dlp.YoutubeDL(opts) as ydl:
        info = ydl.extract_info(url, download=False)
    return info.get("comments", [])

comments = get_comments("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
for c in comments[:5]:
    print(f"{c['author']}: {c['text']} ({c['like_count']} likes)")

Batch-scrape a list of video IDs from a search:

bash

yt-dlp --flat-playlist -j "ytsearch50:web scraping tutorial" | jq -r '.id'

以下是一些大多数教程中未提及的实用提示：

Update yt-dlp frequently. YouTube breaks the extractor regularly — at least monthly. Running an old version is the #1 reason scripts mysteriously return empty results. pip install -U yt-dlp should be in your maintenance schedule.
Age-restricted videos require cookies. Export cookies from a browser session and pass them via cookiefile in ydl_opts.
High volume needs proxies. Without them, you’ll hit rate limits and IP blocks fast. More on this below.

Method 3: youtube-transcript-api

If transcripts are all you need — and increasingly they are, since transcript text is the most useful input for LLM-based content analysis — youtube-transcript-api is lighter and faster than yt-dlp.

python

from youtube_transcript_api import YouTubeTranscriptApi

def get_transcript(video_id, languages=("en",)):
    try:
        transcript = YouTubeTranscriptApi.get_transcript(
            video_id, languages=list(languages)
        )
        return " ".join(entry["text"] for entry in transcript)
    except Exception as e:
        print(f"No transcript available: {e}")
        return None

text = get_transcript("dQw4w9WgXcQ")
print(text[:500] if text else "None")

Transcripts pair naturally with sentiment analysis, RAG pipelines, and any LLM-based workflow that needs text content from video. This is one of the fastest-growing scraping use cases in 2026.

Method 4: InnerTube and ytInitialData

For specialized cases — channel About data, specific continuation tokens, anything the above tools don’t cleanly expose — you can hit YouTube’s internal endpoints directly. The frontend uses a private API at /youtubei/v1/, and most video pages embed a ytInitialData JSON object in a <script> tag that contains the rendered page state.

This is more brittle than the other methods — YouTube changes the structure periodically — but it’s also the most flexible:

python

import requests
import re
import json

def extract_initial_data(video_url):
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                      "AppleWebKit/537.36 (KHTML, like Gecko) "
                      "Chrome/120.0.0.0 Safari/537.36",
    }
    response = requests.get(video_url, headers=headers)
    match = re.search(r"var ytInitialData = ({.*?});</script>", response.text)
    if not match:
        return None
    return json.loads(match.group(1))

data = extract_initial_data("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
# data is the full page state — navigate it for whatever you need

Use this approach only when the other methods don’t cover the case. The structure of ytInitialData is enormous and changes regularly; navigating it requires browser-devtools spelunking each time.

代理层

Past a few hundred requests per day from a single IP, YouTube starts pushing back. The pattern in 2026 is consistent: first a 429 Too Many Requests, then a soft block where requests succeed but return degraded data, then a hard block where every request gets a “sign in to confirm you’re not a bot” wall.

Three things determine whether scraping at scale works:

IP type. Datacenter IPs are flagged within a few hundred requests. Residential or ISP proxies route around this — to YouTube, traffic from a residential IP looks like a normal user on a home connection.

Rotation pattern. For yt-dlp, rotating IPs per video request is the standard. For session-based scraping (paginating comments, browsing a channel), sticky sessions (the same IP for 10–30 minutes) look more natural than rotation mid-session.

Geographic distribution. YouTube serves different content to different regions. If you’re collecting region-specific data — trending lists, localized search results, regional video availability — your proxies need to live in those regions.

Connecting a proxy to yt-dlp is straightforward:

bash

yt-dlp --proxy "http://USER:PASS@proxy.example.com:8080" \
       --skip-download --write-info-json \
       "https://www.youtube.com/watch?v=dQw4w9WgXcQ"

Or in Python:

python

opts = {
    "quiet": True,
    "skip_download": True,
    "proxy": "http://USER:PASS@proxy.example.com:8080",
}

For requests-based scraping (Method 4 above):

python

proxies = {
    "http": "http://USER:PASS@proxy.example.com:8080",
    "https": "http://USER:PASS@proxy.example.com:8080",
}
response = requests.get(video_url, headers=headers, proxies=proxies)

IPBurger’s residential and ISP proxies are built for this kind of work — clean IPs, country-level targeting, sticky sessions when you need them — and the broader principle applies regardless of provider: at the volumes that matter, the proxy layer is the load-bearing piece. The scraping script is 50 lines; the infrastructure decides whether it runs to completion or stalls at 30%.

A reasonable default workflow

If you’re starting a YouTube scraping project today, this is the path that scales:

Define the data you actually need. Most projects don’t need everything — narrowing the scope avoids quota waste and reduces the surface for blocks.
Try the YouTube Data API first if it fits. If your volume is under 10,000 units/day and the API exposes what you need, this is the most reliable path.
Reach for yt-dlp when the API doesn’t fit. It covers comments, transcripts, batch operations, and anything search-result-based.
Add a residential proxy layer once you start hitting 429s. Don’t try to outsmart the rate limiter with delays alone — once your IP is flagged, it’s flagged.
Use youtube-transcript-api for transcript-heavy work. Lighter than yt-dlp, faster for that specific job.
Custom requests + InnerTube only when nothing else covers it. Worth knowing about; not worth starting with.
Build for breakage. Whatever you ship today will break within six months. YouTube changes its frontend, the extractor catches up, you update the dependency. Plan for this.

自2022年以来，究竟发生了哪些变化

For anyone migrating from an older scraping setup, the substantive changes worth flagging:

Dislikes data is gone. YouTube removed public dislike counts in November 2021. No method recovers them; third-party “dislike” estimates are guesses.
Proof of Origin tokens now gate many streaming and detailed metadata endpoints. yt-dlp handles this internally when it can; manual scraping has to deal with it explicitly.
Visitor data cookies are required for most video page loads. A fresh IP without a warmed-up session often hits a consent wall instead of the video.
CSS selectors from old tutorials don’t match. YouTube reorganized its frontend significantly through 2023 and 2024. Any tutorial referencing classes like yt-uix-tile-link is referencing a markup version that hasn’t existed for years.
Quota costs increased for some endpoints. Search is still 100 units, but some other endpoints have shifted. Check the current quota guide before estimating.

The honest summary

YouTube scraping in 2026 is more achievable than it sounds — the tooling has gotten much better since 2022 (yt-dlp is genuinely excellent), and the official API is reliable for jobs that fit. The hard part isn’t the scraping logic; it’s the infrastructure underneath it. Run residential or ISP proxies, update your tools weekly, and design for the inevitable breakages, and the data is yours.

您的业务实力取决于代理服务器的在线时间。切换到企业级静态ISP 代理，享受专属带宽和坚如磐石的可靠性。或者部署轮换式住宅代理，实现 99.9% 的数据抓取成功率。

别再为代理质量担心了

我们的静态 ISP 代理保证干净，且 100% 专为您服务。没有共享负担，只有卓越性能。

获取静态 ISP 代理

更深入地了解网络抓取技术

如何安全管理多个eBay隐身账户而不被封号

电子商务代理

如何在2026年安全管理多个eBay隐身账户而不被封号

在2026年运营一个eBay隐身账户，其风险比大多数卖家想象的要大。容错空间正不断缩小。2024年，eBay因IP问题暂停了超过37,000个账户

代理

代理部署指南：从设置到扩展

通过本份全面指南，了解代理部署配置、扩展策略及最佳实践，以优化您的基础设施

代理

我们差点失去了1500多名忠实客户，以及我们是如何留住他们的

我们最忠实的客户只忠于一件事，那就是他们稳定、快速且静态的“Fresh/Private”IP地址。这些“Fresh/Private”静态IP地址来自以下地址段：

探索网络抓取

别再受阻了。今天就开始扩展业务吧。

加入超过 24,100 家企业的行列，使用最具弹性的家庭和 ISP 代理，大规模收集实时数据。

1亿+ IP地址池

即时激活

全天候专家支持