I deal with a lot of data for my day job as well as for many of my side projects and a lot of that data comes from web scraping. Please note I’m not condoning the use of scraping bots and I’m well aware of the legal grey area that scraping currently sits in.
What is web scraping?
Web scraping is a computer software technique for extracting information from websites. For more detail on all the permutations of scraping check out Wikipedia here.
Tools of the scraping trade
Although there are hundreds of scraping utilities and tools out there I find myself always coming back to scrapy. For this tutorial series I’ll be concentrating on scrapy and scrapyd only.
Scrapy is a python application framework for building spiders. It’s an open source project which uses XPATH/CSS3 selectors to access ‘items’ from a page you want to extract.
Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
Scrapyd is a service that allows you to schedule and run deployed ‘spiders’ and control them via an API and web front end.
Scrapyd is a service for running Scrapy spiders. It allows you to deploy your Scrapy projects and control their spiders using a HTTP JSON API.
Why the cloud?
Sometimes it’s great to be able to spin up an EC2 instance and run a scrape quickly from a new IP address. Amazon provides an amazing infrastructure for developers and if you create a new AWS account there’s not much you can’t do with the Free Usage Tier provided by Amazon with an EC2 instance and perhaps some S3 storage and an RDS database, but more on that later.