What’s a headless browser? Do they help or hinder the data harvesting expedition?
A headless browser is a web browser without a graphical user interface (GUI). It doesn’t have any visible window or user-interface element.
You just use scripts to handle the page content.
This means you can get up to all sorts of no good.
But we’re only going to talk about the whitehat stuff today.
What’s so special about a headless browser?
- Faster interactions with web content – which means you can get more of the data you need.
- Doesn’t require as much bandwidth or time to load a page – which means you don’t have to wait.
- It sounds cool in a Sleepy Hollow sort of way – which is badass.
What’s a headless browser used for?
Headless browsers provide easy access to the webpage DOM (document object model), which is helpful for front-end developers using JavaScript frameworks such as AngularJS.
They’re also commonly used to load and run functional webpages on a headless server during automated testing. This is called headless browser testing.
Headless browser testing
Headless testing is an automation testing technique used for web page validation. A program controls a web browser using automation, scripts, or other tools without rendering graphics or executing visible UI components. The objective of headless browser testing may include:
- Checking the validity of markup, content, and style on an HTML/XHTML page.
- Validating a web or login form.
- Evaluating JavaScript for event handlers and AJAX operations.
This is to verify that dynamic contents display on the screen with correct positioning, etc.
Headless browsers work well with other software too. Besides testing and web development, they can help facilitate data harvesting.
Headless browser scraping
Headless scraping is an automated data extraction technique you can use in conjunction with a web scraper.
A browser in headless mode:
- Scrapes websites and stores web data in a local directory on a disk.
- Retrieves multiple pages from most modern websites.
- Imitates a user agent profile that is necessary to execute javascript rendering.
- Scrapes more efficiently with command-line arguments.
For example, headless browsers are commonly used to scrape data from online catalogs, pricing information from e-commerce websites, or social media widgets/icons embedded into the company’s website.
What are some popular headless browsers?
Headless Chrome
The headless version of Google’s Chrome browser.
Headless Firefox
A Mozilla project that aims to create a headless build of their latest web browser with enabled support for WebGL and JavaScript.
HtlmUnit
A functional testing framework that uses headless browsers to load and validate web pages.
SimpleBrowser
A headless browser made with Net 4 software capable of carrying out browser automation activities. It does not allow JavaScript, but you can modify the user agent, referrer, request headers, form values, and others before submission or navigation.
SlimerJS
A Mozilla project that aims to allow JavaScript and HTML5 in a filesystem-like environment. It uses Gecko (the browser engine behind Firefox) as its core. This will enable it to fully support all JavaScript APIs (e.g., even if they are not implemented or incomplete in some way).
ZombieJS
Zombie provides a JavaScript API and uses Chromium. Chromium can run on various web pages, but it is mainly used to test the DOM APIs and website behavior automatically.
How to use headless browsing to web scrape?
The purpose of a headless browser is automation (of functional tests, web development, etc.). They also make great tools for web scraping due to their ease of use and ability to do tasks unattended/without human intervention.
When using headless browsers to web scrape, you need to feed the browser a list of URLs and then wait for it to load in. When loaded in the headless browser, this can be automated by sending your headless browser commands from the command line.
It gives you complete control over when and how URLs are pulled into your headless browser.
To scrape websites with a headless browser, you will need to add libraries to your application so that the browser can communicate with them. This communication can take place via a command line or by connecting to a web server.
The most common libraries are:
Requests
A necessary Python library to interact with web servers by an HTTP request.
jsdom
This library also requires a package called jsdom-global, which is used to create the global objects needed by the headless browser.
Puppeteer
The puppeteer library makes it easier to control Chrome and Chromium than other libraries such as Selenium/WebDriver. You can install it with npm or yarn in NodeJS applications to run tests or scrape data from web pages. It provides you with methods to specify the URL, download resources, handle cookies, etc.
We need to remember that Puppeteer is a promise-based library: It performs asynchronous calls to the headless Chrome instance in the engine compartment.
Nightmare
This library is used for Electron applications. It provides Electron with automation tools that can be used by an application to drive Electron’s browser process via a remote connection.
Selenium
A handful of libraries are available for headless JavaScript using Selenium bindings such as webdriver js and selenium-web driver.
You may be thinking that it must be hard to set up a library, but it’s not. Let’s take a look at one example.
Basic Puppeteer Tutorial
First, you need to install Puppeteer into the project directory.
Installing Puppeteer
To use Puppeteer in a headless browser, run:
npm i puppeteer
# or "yarn add puppeteer"
Or for a lite version of Puppeteer, run:
npm i puppeteer-core
# or "yarn add puppeteer-core"
As mentioned above, all the puppeteer code runs in NodeJS. You export it as a factory function that will return an object with methods to control Chrome. You can use it to create new browser instances via the launch method, navigate to URLs, handle events, etc.
The following code snippet is for Puppeteer script.
If you want to navigate to https://example.com and save a screenshot as example.png:
Save file as example.js
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
await page.screenshot({ path: 'example.png' });
await browser.close();
})();
Execute script on the command line:
node example.js
If you want to create a PDF:
Save file as hn.js.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://news.ycombinator.com', {
waitUntil: 'networkidle2',
});
await page.pdf({ path: 'hn.pdf', format: 'a4' });
await browser.close();
})();
Piece of cake right?
The above example is just one code snippet. You can find an expanded Puppeteer tutorial for more information here.
What are some limitations using a web scraper with a headless browser?
You generally have less control with a headless server since many tasks require additional plugins or configurations handled by a standard web browser. For example, some headless browsers don’t support CSS selectors, which makes scraping data from the DOM more difficult.
Even though a headless browser is efficient, web blocks may slow you down. Depending on how many pages you plan to scrape, you need to think about using a proxy service. A proxy service will help to save your IP address from a potential block.
The best proxies to use in headless mode
If all your API calls and HTTP requests are made from the same IP address, the entire process will be shut down.
The best proxies to use with web scrapers are rotating proxies. This way, each browser instance is with a different IP address.
Rotating residential proxies, or backconnect proxies, are ideal for automation testing and web applications that harvest data.
Want to know more about proxies for web scraping with a headless browser? Check out this article on how many proxies you need.
In this article, we talked about what headless browsers are and why they might be helpful for those who would like to use them as a scraping tool.
If you have any additional questions about headless browsers or anything related to web scraping, feel free to comment below!
FAQS
How can I start?
Simply choose a headless browser that works well for your web automation needs. There are a variety of headless browsers available, such as Headless Chromium and SlimerJS. You will also need a library. If you plan to use Chrome, for example, you will need to install the Puppeteer library.
Why the name ‘headless’ browser?
A headless browser doesn’t have a graphical user interface (GLU); the only thing you’ll see when it’s running is a command-line interface (CLI). Web developers use this to run automated tasks like functional tests and web scraping.
The CLI of your headless browser allows you to load websites and interact with them, among other things. It essentially acts as a highly configurable proxy between you and the website(s) which you load in your headless server. If you’re new to web scraping or headless browsers, this article will teach you the basics. It may be helpful if you’re just starting out.
Who created Headless Chrome?
Headless Chrome is an open-source project led by Google.
Can I use it for purposes other than automated web scraping?
Yes! You can use it to run automated and functional testing. You should consider all of your options before deciding which headless server best meets your needs.
Since their purpose is automation, headless browsers make great scraping tools, especially when paired with a command-line interface.
Can I run headless browsers from the command line?
Yes. Many headless servers allow you to run them from their respective CLI or through a web UI, such as Chrome. In some cases, you may have to utilize both to get complete control over your headless browser.
How do I know if a headless browser runs on my machine?
You can look up which architecture your local system is running on and compare it with the architecture of any headless browsers you’d like to use. Some headless servers are cross-platform or support multiple architectures, but not all of them work this way. You should check before using one.
Is it true that headless browsers don’t support CSS selectors?
This is true for some (but not all) headless servers; notably, PhantomJS doesn’t support CSS selectors while Headless Chrome does. The developers of each headless browser may choose to add and/or remove different features at any given time, which is why you should always check before assuming anything.
Do headless browsers work with javascript code?
In most cases, headless browsers have no problems executing javascript code. Many web developers use headless browsers to automate their tasks, requiring full support for JavaScript to accomplish their goals.
Since not all headless browsers support JavaScript, you need to research each one and determine whether or not they will meet your specific needs.
Is it true that some headless servers have APIs?
Yes! Some headless servers have APIs which allow you to utilize them in different ways than simply loading websites through a proxy server. You can check what types of APIs any given headless browser offers by reading up on their documentation. This is a good way of determining whether or not you should use them with web scrapers, especially if a proxy API is available.
What headless browsers support CSS selectors?
Headless Chrome currently supports CSS selectors, while PhantomJS’ WebKit engine does not. For more information on system dependencies, you can check out this website.
Why aren’t all web scraping tools considered headless browsers?
Web scraping tools such as Scrapy mimic a regular web browser’s user interface (UI) to scrape data from websites. However, because these tools don’t load pages into their own window, we don’t classify them as headless browsers.
Is it true you can’t use headless browsers to scrape mobile websites?
Headless browsers aren’t guaranteed to work on all website pages, but they can run most of them just fine. Mobile-specific sites tend not to load correctly in headless servers, even if they can display the page.
Can I test my own headless browsers?
You should be able to, but there’s no guarantee that the headless browser will work. Since each headless server has different functionality and features, you may have to modify your own tool before it runs correctly.
For a basic example, headless Chrome requires Javascript, while PhantomJS requires CORS headers for web pages to load correctly.
What are some best practices I can use when scraping with a headless browser?
The goal of web scraping is usually to extract data from entirely or partially inaccessible websites without manually inputting any web data into forms or buttons.
If you’d like to learn more about how web scraping works, check out this article.
You must ethically use your headless browser. Being able to extract data from entirely or partially inaccessible pages isn’t necessarily a bad thing. However, it can be if the site owner doesn’t want you doing so.
Make sure you double-check all Terms of Service and Privacy Policy agreements before you scrape any website page because these terms may change without notice.
What headless browsers are best for web scraping?
Headless Chrome and PhantomJS are great options for web scraping because they’re easy to use and relatively fast.
Why do some developers prefer using a headless browser over a regular one?
Sometimes it may be necessary or more convenient to use a headless browser instead of a regular one, but this usually depends on the project’s specifics. For example, suppose you want to scrape data from websites that require Javascript to work correctly. In that case, you’ll have an easier time with Headless Chrome than regular Google Chrome.
Headless Chrome offers better tracking capabilities than other similar tools; it is beneficial when working with multiple projects.