Deploying scrapy on EC2

scrapy_and_aws

Welcome to part 3 of my guide to using AWS for scraping.

If you haven’t already make sure you check the first two parts, here and here. We’re going to continue using the same EC2 instance you created in part two.

Some assumptions before we begin

I’m going to assume a few things before we begin.

  1. You’ve installed scrapy and scapyd on an EC2 instance
  2. You’ve saved the key pair you generated while creating the instance
  3. You added a security group as per the last post including SSH and scrapyd (6800)
  4. You’ve got scrapy and setuptools installed on your local or development system

Personally I tend to use a vagrant image for my scrapy development work. This allows me to keep my local system clean and not have any issues with installing packages or dependencies. If you’re looking for an image that’s already setup and has scrapy installed you can always use the Portia image as a starting point although it doesn’t have scrapyd, setup tools or curl installed.

Confirming scrapyd is working

First things first, let’s confirm that scrapyd is indeed running on the EC2 instance. SSH to your instance and check that the service is running.

Then using your browser go to http://public-address-of-your-ec2.com:6800  and you should see the scrapyd page as below.

scrapyd

 

Configuring your local system

OK so scrapy is working on the remote EC2 instance. Now you’ll need to configure your local development system to be able to deploy to the instance.

On your local system go to the root of the crawler project and edit the scrapy.cfg file that has been created. You’ll want to add a new deploy target as well as edit the local target. Below is an example of what it should look like, obviously replacing the EC2 address with your own:

You can list the available targets by running  scrapy deploy -l  from within your project root directory.

Deploying your spider

To deploy your spider you simply use scrapy deploy:

scrapy deploy [ <target:project> | -l <target> | -L ]

Verify that the project was installed correctly by accessing the scrapyd API:

If you refresh the web front end you’ll also see the project now listed under available projects, like so:

deployedAnd that’s it! Put your feet up and crack open a cold beer, you’ve just deployed your first project to AWS.

Scheduling your spider

Now that you’ve deployed your spider the real fun can begin. Again using the API it’s very simple to schedule your spider:

For this very simple example we’re just passing the project name and spider name. If you need to pass more arguments ot the spider, allowed domain or start URLS for example you need to add more -d  arguments.

You’ll be able to monitor the job by going to the jobs page on scrapyd.

jobs

You can also use the API to stop or cancel a job or delete a job or project. You can also use the web front end to view items, a great way to test basic spiders before you deploy them to production.

So that’s it! Next up integrating scrapy with databases, storing scraped data in a database and using a database to store parameters for your spiders.

About This Author

Big Data and Python geek. Writes an eclectic mix of news from the world of Big Data and Telecommunications interspersed with articles on Python, Hadoop, E-Commerce and my continual attempts to learn Russian!

3 Comments

You can post comments in this post.


  • Hi Guy,

    I just stumbled upon this post (and the two that preceded it) and found them very helpful! I’d like to run a set of crawls (~100 to start, but potentially up to 10,000) simultaneously once a day on EC2, and I was wondering if you could answer the following questions:

    (1) to run >100 crawls simultaneously, would I get better performance spreading them over multiple smaller instances or running them all on one larger instance? If it’s the former, how do I tell EC2 which crawlers go to which instance?

    (2) assuming the crawls are finished after 3-4 hours and don’t need to run for another 20-21 hours, is there a way to automatically shut off the EC2 instance(s) so I don’t get billed for the idle time? (I’d be writing the crawler output to persistent storage, so I’m not concerned about data loss).

    (3) To have all the crawls run at the same time each day, do I just write a cron job that runs scrapyd for each single crawler, or is there a more elegant way to do that?

    Any insight from your would be appreciated!

    Thanks,
    Beata

    Beata 3 years ago Reply


  • thnaks for such a good tutorial
    kindly post the next tutorial for connecting with data base to save in data base after deploying ur spiders in scrapd..

    farooq 3 years ago Reply


  • please upload next tutorial

    farooq 3 years ago Reply


Post A Reply