YouTube is one of the largest sources of public web data on the internet — over 800 million videos, billions of comments, transcripts for most uploads, and structured metadata that powers everything from market research to AI training. It’s also one of the most aggressively defended properties online, and the methods that worked for scraping it three years ago mostly don’t anymore.
The post you’re reading replaces an older guide that was technically accurate in 2022. It isn’t accurate now. YouTube tightened its anti-bot stack significantly through 2025: Proof of Origin tokens now gate many endpoints, visitor data cookies are required for most video page loads, and datacenter IP ranges trigger CAPTCHA challenges within a few hundred requests. The CSS selectors that older tutorials use don’t match YouTube’s current markup. Even the “dislikes” data those tutorials reference hasn’t been publicly accessible since November 2021.
Here’s what does work in 2026, with code that actually runs.
Tired of IP bans stalling your operations? Deploy our Residential Proxies for high-velocity rotation or secure ISP Proxies for total account longevity.
What you can scrape (and what you can’t)
Public data, freely available:
- Video metadata — title, description, view count, upload date, channel, duration, tags
- Channel data — name, description, subscriber count (approximate), video list, About page info
- Comments — including replies, like counts, timestamps
- Transcripts and captions — for any video with captions enabled
- Search results — full SERPs for any query
- Playlists — full contents and metadata
- Live chat replays — for streams that have ended
Off-limits regardless of method:
- Private videos
- Unlisted videos you don’t have the URL for
- Anything behind login the account doesn’t own
- Engagement data on individual viewers
- Exact subscriber counts (YouTube only displays rounded numbers publicly)
The legal layer matters too: YouTube’s Terms of Service restrict automated access. Scraping public data is generally legal in most jurisdictions, but ToS violations can result in account bans and, in some cases, civil action. Don’t scrape behind login. Don’t collect personally identifiable information beyond what’s publicly displayed. Don’t redistribute video content. If you’re collecting for AI training, document the source and respect creator rights — the EU AI Act and similar regulations are increasingly active here.
The four methods worth knowing
There are really four approaches in 2026 — picking the right one depends on what you’re after and at what scale.
| Method | Best for | Auth needed | Scale |
|---|---|---|---|
| YouTube Data API v3 | Structured queries, small to medium volume | API key | Limited (10K quota units/day free) |
| yt-dlp | Everything else — metadata, comments, transcripts, batch | None (cookies for some videos) | Medium to high with proxies |
| youtube-transcript-api | Transcripts only | None | Medium |
| Custom scraping (requests + InnerTube) | Specialized cases the above don’t cover | None | High with infrastructure |
In practice, most serious operations use yt-dlp as the workhorse and reach for the others when yt-dlp can’t cover the case.
Method 1: YouTube Data API v3
The cleanest option when it fits. Google’s official API returns structured JSON for videos, channels, playlists, search, and comments. There’s no anti-bot friction — the request either succeeds or returns a clear quota error.
The catch is quota. The free tier gives you 10,000 units per day. A videos.list call costs 1 unit; a search costs 100. That’s roughly 100 search queries per day, or 10,000 video metadata fetches. Fine for many use cases; useless for anything at scale.
python
from googleapiclient.discovery import build
API_KEY = "YOUR_API_KEY"
youtube = build("youtube", "v3", developerKey=API_KEY)
# Get metadata for a specific video
response = youtube.videos().list(
part="snippet,statistics,contentDetails",
id="dQw4w9WgXcQ"
).execute()
video = response["items"][0]
print(video["snippet"]["title"])
print(video["statistics"]["viewCount"])
print(video["contentDetails"]["duration"])
Use the API when your volume fits the quota and you want clean, stable data. Apply for an increased quota if you have a legitimate business case — Google approves these for real applications, just not for “I’m building a scraper.”
Method 2: yt-dlp
yt-dlp is the de facto standard for everything the API doesn’t cover cleanly. It’s an actively maintained fork of youtube-dl, no API key required, and handles metadata, comments, transcripts, and downloads in one tool.
Install:
bash
pip install yt-dlp
Get metadata for a single video without downloading anything:
python
import yt_dlp
def get_video_metadata(url):
opts = {
"quiet": True,
"skip_download": True,
"no_warnings": True,
}
with yt_dlp.YoutubeDL(opts) as ydl:
info = ydl.extract_info(url, download=False)
return {
"title": info.get("title"),
"channel": info.get("uploader"),
"views": info.get("view_count"),
"likes": info.get("like_count"),
"duration_sec": info.get("duration"),
"upload_date": info.get("upload_date"),
"description": info.get("description"),
"tags": info.get("tags", []),
}
data = get_video_metadata("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
print(data)
Pull comments (including replies):
python
import yt_dlp
def get_comments(url, max_comments=500):
opts = {
"quiet": True,
"skip_download": True,
"getcomments": True,
"extractor_args": {
"youtube": {
"max_comments": [str(max_comments)],
"comment_sort": ["top"],
}
},
}
with yt_dlp.YoutubeDL(opts) as ydl:
info = ydl.extract_info(url, download=False)
return info.get("comments", [])
comments = get_comments("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
for c in comments[:5]:
print(f"{c['author']}: {c['text']} ({c['like_count']} likes)")
Batch-scrape a list of video IDs from a search:
bash
yt-dlp --flat-playlist -j "ytsearch50:web scraping tutorial" | jq -r '.id'
A few practical notes that aren’t in most tutorials:
- Update yt-dlp frequently. YouTube breaks the extractor regularly — at least monthly. Running an old version is the #1 reason scripts mysteriously return empty results.
pip install -U yt-dlpshould be in your maintenance schedule. - Age-restricted videos require cookies. Export cookies from a browser session and pass them via
cookiefileinydl_opts. - High volume needs proxies. Without them, you’ll hit rate limits and IP blocks fast. More on this below.
Method 3: youtube-transcript-api
If transcripts are all you need — and increasingly they are, since transcript text is the most useful input for LLM-based content analysis — youtube-transcript-api is lighter and faster than yt-dlp.
python
from youtube_transcript_api import YouTubeTranscriptApi
def get_transcript(video_id, languages=("en",)):
try:
transcript = YouTubeTranscriptApi.get_transcript(
video_id, languages=list(languages)
)
return " ".join(entry["text"] for entry in transcript)
except Exception as e:
print(f"No transcript available: {e}")
return None
text = get_transcript("dQw4w9WgXcQ")
print(text[:500] if text else "None")
Transcripts pair naturally with sentiment analysis, RAG pipelines, and any LLM-based workflow that needs text content from video. This is one of the fastest-growing scraping use cases in 2026.
Method 4: InnerTube and ytInitialData
For specialized cases — channel About data, specific continuation tokens, anything the above tools don’t cleanly expose — you can hit YouTube’s internal endpoints directly. The frontend uses a private API at /youtubei/v1/, and most video pages embed a ytInitialData JSON object in a <script> tag that contains the rendered page state.
This is more brittle than the other methods — YouTube changes the structure periodically — but it’s also the most flexible:
python
import requests
import re
import json
def extract_initial_data(video_url):
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36",
}
response = requests.get(video_url, headers=headers)
match = re.search(r"var ytInitialData = ({.*?});</script>", response.text)
if not match:
return None
return json.loads(match.group(1))
data = extract_initial_data("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
# data is the full page state — navigate it for whatever you need
Use this approach only when the other methods don’t cover the case. The structure of ytInitialData is enormous and changes regularly; navigating it requires browser-devtools spelunking each time.
The proxy layer
Past a few hundred requests per day from a single IP, YouTube starts pushing back. The pattern in 2026 is consistent: first a 429 Too Many Requests, then a soft block where requests succeed but return degraded data, then a hard block where every request gets a “sign in to confirm you’re not a bot” wall.
Three things determine whether scraping at scale works:
IP type. Datacenter IPs are flagged within a few hundred requests. Residential or ISP proxies route around this — to YouTube, traffic from a residential IP looks like a normal user on a home connection.
Rotation pattern. For yt-dlp, rotating IPs per video request is the standard. For session-based scraping (paginating comments, browsing a channel), sticky sessions (the same IP for 10–30 minutes) look more natural than rotation mid-session.
Geographic distribution. YouTube serves different content to different regions. If you’re collecting region-specific data — trending lists, localized search results, regional video availability — your proxies need to live in those regions.
Connecting a proxy to yt-dlp is straightforward:
bash
yt-dlp --proxy "http://USER:PASS@proxy.example.com:8080" \
--skip-download --write-info-json \
"https://www.youtube.com/watch?v=dQw4w9WgXcQ"
Or in Python:
python
opts = {
"quiet": True,
"skip_download": True,
"proxy": "http://USER:PASS@proxy.example.com:8080",
}
For requests-based scraping (Method 4 above):
python
proxies = {
"http": "http://USER:PASS@proxy.example.com:8080",
"https": "http://USER:PASS@proxy.example.com:8080",
}
response = requests.get(video_url, headers=headers, proxies=proxies)
IPBurger’s residential and ISP proxies are built for this kind of work — clean IPs, country-level targeting, sticky sessions when you need them — and the broader principle applies regardless of provider: at the volumes that matter, the proxy layer is the load-bearing piece. The scraping script is 50 lines; the infrastructure decides whether it runs to completion or stalls at 30%.
A reasonable default workflow
If you’re starting a YouTube scraping project today, this is the path that scales:
- Define the data you actually need. Most projects don’t need everything — narrowing the scope avoids quota waste and reduces the surface for blocks.
- Try the YouTube Data API first if it fits. If your volume is under 10,000 units/day and the API exposes what you need, this is the most reliable path.
- Reach for yt-dlp when the API doesn’t fit. It covers comments, transcripts, batch operations, and anything search-result-based.
- Add a residential proxy layer once you start hitting 429s. Don’t try to outsmart the rate limiter with delays alone — once your IP is flagged, it’s flagged.
- Use youtube-transcript-api for transcript-heavy work. Lighter than yt-dlp, faster for that specific job.
- Custom requests + InnerTube only when nothing else covers it. Worth knowing about; not worth starting with.
- Build for breakage. Whatever you ship today will break within six months. YouTube changes its frontend, the extractor catches up, you update the dependency. Plan for this.
What’s actually changed since 2022
For anyone migrating from an older scraping setup, the substantive changes worth flagging:
- Dislikes data is gone. YouTube removed public dislike counts in November 2021. No method recovers them; third-party “dislike” estimates are guesses.
- Proof of Origin tokens now gate many streaming and detailed metadata endpoints. yt-dlp handles this internally when it can; manual scraping has to deal with it explicitly.
- Visitor data cookies are required for most video page loads. A fresh IP without a warmed-up session often hits a consent wall instead of the video.
- CSS selectors from old tutorials don’t match. YouTube reorganized its frontend significantly through 2023 and 2024. Any tutorial referencing classes like
yt-uix-tile-linkis referencing a markup version that hasn’t existed for years. - Quota costs increased for some endpoints. Search is still 100 units, but some other endpoints have shifted. Check the current quota guide before estimating.
The honest summary
YouTube scraping in 2026 is more achievable than it sounds — the tooling has gotten much better since 2022 (yt-dlp is genuinely excellent), and the official API is reliable for jobs that fit. The hard part isn’t the scraping logic; it’s the infrastructure underneath it. Run residential or ISP proxies, update your tools weekly, and design for the inevitable breakages, and the data is yours.
Your business is only as strong as your proxy uptime. Switch to business-grade Static ISP Proxies for dedicated speeds and unshakeable reliability. OR Deploy Rotating Residential Proxies and achieve a 99.9% scraping success rate.
