The Ultimate Guide To Ethical
Web Scraping

FindDataLab.com
2 min readMar 31, 2020
Photo by Gautam Krishnan on Unsplash

Web scraping can be one of the best tools for quickly collecting large amounts of data, which will allow you to analyze events and get information about them almost instantly.

Scraping one page is pretty simple. Problems usually arise when we want to collect a large amount of information in a short period of time. Writing a crude script with simple settings in the code is quite easy, but scraping a website with this script you may end up banned and blocked by the webmaster.

In the article linked below, we discuss the 5 main things to keep in mind when scraping: Terms of Use and robots.txt, API, identifying yourself when sending requests, time-outs and responsive delays, and simulating a real-world user.

Before starting anything related to scraping, you should familiarize yourself with the Terms of Use of the site. It is important to find out if the data is copyrighted or if there are any other restrictions. The most popular web sites have a robots.txt file, which specifies crawl-delays and pages that should not be scraped. The Terms of Use page also may contain certain limitations for scraping this web site.

Some websites, especially large ones like Twitter, Facebook, or Google Maps, share with users their APIs. APIs — Application Programming Interfaces — allow you to use web pages in your own projects. Using the API for scraping automatically makes it website friendly, which spares you some headache. Yet, keep in mind that APIs sometimes provide outdated information.

The imitation of the ordinary user can be considered an ethical gray zone. It’s not so easy to block a scraper if it is able to imitate a real user. There are several aspects to creating a scraper like a real user: the user-agent string, IP address, time-outs between requests and the request rate.

Read the whole article with many details about legal web scraping.

--

--

FindDataLab.com

Turn any website into data and have it delivered directly to you in any format.