Web scraping is legal, for the most part.
So is driving a car…until you break the law.
However, the rules aren’t as clear-cut around web scraping as it is with something like traffic laws.
This article will cover what you need to know about web scraping, including its legalities, how it works, and some common misconceptions associated with web scraping.
What’s web scraping?
Web scraping is a data harvesting technique you can use to extract information from the internet.
For a simple introduction to web scraping, this blog post lays the groundwork.
It works by following an HTML source code, extracting unstructured data, and parsing it into structured data. The web crawler follows instructions on how each web page should be traversed, what elements need to be extracted, and where those results should go within their own application.
In essence, this means writing computer programming language instructions and understanding which parts of an HTML document contain specific types of content for extraction. Such as text strings, numbers/dates/currency values, social media links, etc…
Why use web scrapers?
Data that helps to make decisions is valuable. So if I were to list all the reasons for automated data collection, it would take a long time. But here’s a few prevalent examples:
- Extract data about competitors’ websites or online services to get an edge over them
- Improve search engine rankings through link analysis. For example, web crawlers can be set up to follow links between social media posts that provide valuable insights into how users respond to specific topics.
- Organize large amounts of unstructured text into a structured form such as spreadsheets that make it easier to analyze.
- Create web portals for users to search and browse the web by pulling together various kinds of content from many different web pages into one place.
- Collect data that’s not available via APIs or forms such as video, audio, images, etc.…
- Monitor web pages of a particular topic or competitor for changes and automatically update data in other applications
Why the bad rep?
Web scraping is harmless if the data extraction happens without breaking any rules or laws that govern the targets. However, that’s not always the case. Nefarious characters or hackers deliberately exploit web scraping all the time. Among all the violations, data theft is the most widespread.
You don’t have to be a hacker to tick off the site owner.
In the web scraping process, you send many requests to a website to obtain information. Much more than a typical user. Without any regard to the site, it could cause a massive load and crash a server in some cases.
Which can be expensive.
DDOS attacks happen due to overload, so it’s no surprise that request-happy web scrapers are frowned upon.
While web scraping can be very useful, it’s crucial to stay within legal boundaries, so you don’t risk violating any laws!
However, we’re still waiting on a final ruling (at least in the U.S.) as to whether web scraping software constitutes copyright infringement. Some courts have ruled against it while other courts favor its legality.
So until this matter is settled, you should be cautious.
Web scraping court cases
Rulings from court cases set the legal precedent for future cases. As of now, the legality of web scraping seems to be a little ambiguous, but it’s good to be aware of what decisions have been made.
I’ll focus on the flagship scraping cases that set the stage for future scraping legal claims such as copyright infringement or the Computer Fraud and Abuse Act (CFAA).
Facebook vs. Power Ventures (2011)
This is one of many highly controversial disputes with Google in terms of its privacy policies. Facebook sued Power Ventures for collecting their user’s data and displaying it on their own website.
The ruling went to Facebook, who had filed a complaint that Power Ventures violated the CAN-SPAM Act, CFAA, the DMCA, and copyright laws.
In May 2010, the Associated Press sued a digital media monitoring company called Meltwater which used web crawling technology to search for stories.
The A.P. claimed they were not getting paid for their work as it was being duplicated, allowing them access to free content.
In this case, web scrapers were ruled to be illegal because they undermined the value of A.P.’s work by making it available for free.
Ryanair v. PR Aviation (2015)
P.R. Aviation is a flight price aggregation service that uses screen-scraping to capture Ryanair’s online site prices. On January 15, the European Union’s Court of Justice released a decision that has the potential to significantly influence both website database operators and those that conduct “screen-scraping” (such as price comparison sites).
The ruling suggests that site owners can enforce their website’s terms through contractual agreements. This means that even publicly available data can be protected.
HiQ Labs v. LinkedIn (2019)
HiQ labs can collect data from public LinkedIn profiles to offer the business tools to learn employees’ perspectives. HiQ requested the injunction in court. It was granted, leading LinkedIn to stop sending out C&D letters and applying blocking measures against HiQ.
LinkedIn subsequently reversed the decision a day later, saying it had violated Section 2 of the CFAA. The ruling favored scraping companies and reaffirmed the certainty of the recently-adopted court practice regarding the applicability of the act.
Can you really get into trouble scraping data?
The short answer is yes! There are laws protecting companies that own content on their websites against unauthorized access by third parties like scraping bots or other automated software programs.
The long answer depends on where you live, but generally, there are at least five legal issues you should be aware of:
- Copyright infringement
- Defamation of character or business practices
- Right to privacy/publicity rights
- Misappropriation (theft) of web content
- Hacking techniques to access web content
These are the most critical legal issues you need to be aware of when you pursue data collection. However, this is not an exhaustive list but rather a general summary that can vary depending upon where you live and who owns the website in question.
For more detailed information about your geographic location, please consult with an attorney specializing in internet law within your jurisdiction. This article does NOT constitute professional legal advice!
To avoid potentially violating any of these laws, you should make sure what information is public versus private and how they want web harvesting performed on their website. Whether through a web form or API key, for example.
Websites often post legal notices like this one:
“This site may contain copyrighted material which has been used with permission from its owners.” If you see such a notice, it means that this page’s owner does not allow web scraping without prior written consent or an agreement between the parties involved.
The same goes if there is no mention at all of scraper bots. Their webmasters may forbid scraping data on their websites. In such cases, you should not attempt to access them without written permission from the owner(s). It’s always best practice to ask for permission first!
The laws around web scraping
We’ve covered some court cases and how specific laws can arise from them. Here is a summary of infractions you might consider before you begin your next web scraping projects:
- The Digital Millennium Copyright Act (DMCA) is a U.S. law that makes using web scrapers illegal on websites that you don’t own. For example, news sites or any site with user-generated content such as Facebook groups; however, this does not apply if your use falls under fair use.
- The Computer Fraud and Abuse Act (CFAA) is a U.S. law that makes web scraping illegal if you circumvent security measures or intentionally access the web without authorization. However, this does not apply to using applications that are open source, publically available, noncommercial tools which let you pull web data for free. These kinds of web scraping tools fall under fair use, so they’re perfectly legal to use on websites with user-generated content such as Facebook groups.
- Trespass to chattel is a legal term for unjustly using digital property. This can be web scraping if you’re using a web scraper to harvest data without permission.
- Terms of service/privacy policies may prohibit web scraping on specific pages, so always check these before you decide to scrape data.
- Content owners might claim copyright infringement because they believe their work has been copied without permission.
- Web scrapers may be blocked by ISPs (Internet Service Providers) if web scraping is illegal.
- The website owner may file a lawsuit against any company whose high-crawling rate causes a crash of the server or infringes its intellectual property. Make sure the damage is not inflicted in any way. You may not be liable if you cause any damage to this area’s conditions and goods.
Learn how residential proxies can save your butt while you scrape data.
Should websites legally restrict data scraping? That could be true. There’s nothing stopping website operators from drawing up unavoidable contracts to access their content.
Will these provisions actually prove enforceability? The legal theory behind contract enforcement ability is rather complex. Still, it’s worth taking a look at some agreements in circulation.
The agreements can usually be found on the homepage or as a pop-up window. Legal theories generally ignore the legal value of such contracts. (Not everyone allows pop-ups)
However, there are well-received case studies on Wikipedia ruling in favor of browsewrap agreements.
Clickwrap is an honest and reasonable contract that should be enforced if the courts want it. This type of agreement is widespread for online stores and in sign-up forms. Clickwrap agreement requires an action by the user rather than by browsing alone.
As evidenced by an example from the Ryanair case, the courts are readily implementing these decisions.
So is web scraping legal?
6 Questions to ask yourself before you scrape
Ask yourself these 6 practical questions about your web scraping ethics to be more compliant.
Are you scraping copyrighted data?
Many of the internet’s content is subject to some kind of trademark rights. Music, news, blogs, dissertations, pictures, magazines, databases, and logos is potentially copyrightable.
Using copied material or scraped data irresponsibly infringes copyright rights. This may well be considered an ethics-based internet scraping in many jurisdictions. This, however, implies scraping any data copied through another source or distributing them illegally. Some situations call for scraping copyrighted content for analysis purposes. In such cases, you must consider the way you use them.
Are you scraping non-public data?
Websites generally keep their information freely accessible. Publicly accessible data is okay to scrape as long as it’s safe.
Non-public data is something that is not accessible for everybody on the web. If the data comes from pages you need logins to access, then it’s not publicly accessible.
Are you scraping personal data?
Different jurisdictions have different regulations regarding access and usage of personal data. While it might be okay to scrape personal data in some U.S. states, you could be in a bit of trouble in California. The E.U. is very sensitive to personal information. So you might want to review Data Protection Regulations (GDPR) before scraping such data.
Is the crawling rate tolerable?
Scraping websites can overload their servers and crash them. Most websites suggest a “crawl delay” directive on any robot.txt file that they have. Suppose the page does not specify the crawl-delay direction. In that case, the average request time is 20 seconds at the highest possible rate.
ToU agreements may be either browse-over or click-over agreements. The clickwrap agreements consist of those for which the user clicks on buttons, and browsewrap agreements don’t require any user action.
If you follow all the terms set out, you will have no problems with your web scraping activities.
Are you compliant with the robots.txt file?
Robots exclusion protocol is the web standard for web robots. Robots.txt tells you about which parts of a website you can crawl and index, those that should be excluded.