Installing scrapy and scrapyd on AWS EC2

scrapy_and_aws

See the updated version for installing scrapy 1.0 and above here.

This post will cover the basics of getting started with Amazon AWS, creating an account, creating an EC2 instance, installing scrapy and scrapyd and finally making sure you do it all for free!

Getting Started

Keeping It Free!

Before you go any further and start creating instances here and there make sure you take the time so set up billing alerts. The last thing you want to do is get hit with a huge bill because you forgot to turn something off or misconfigured an instance. Setting an alert is easy and gives you the peace of mind to test out AWS and not dent your credit card.

First of all you do have an AWS account eligible for the Free Tier don’t you? The first thing you need to do if you don’t is head over to AWS now and sign up. New users get 12 months of access to the free tier which essentially allows you to use AWS within reason for free.

Once you’ve got your account signed up take a few moments to familiarise yourself with the basic tools, the dashboard, IAM and EC2 in particular. Read the IAM best practices and make sure you know how to create a new user.

Create a new user

Best practice is not to do anything with your main AWS account. Don’t use it for development at all. It’s the account that was created when you signed up and is the account that holds your credit card details. So the first step is to create a new user account and assign the new user admin rights. You’ll use this user for all your development work.

create_new_user_iam

Once the user is created save the security credentials that are generated, that’s your Access Key ID and Secret Access Key, you’ll need those later.

Create a new security group

Login to your AWS account and go to the EC2 dashboard. The first thing you want to do is create a new security group. Security groups are essentially like firewall rules, you want to make sure you allow only the services you’re going to use. So click on Security Groups and then create a new group. At the bottom of the dialog you’ll see tabs for inbound and outbound rules. Set up the following rules as shown in the screenshots below:

Inbound

  • SSH
  • Custom TCP Rule, Port 6800

Set source on both to custom IP and only allow your own IP address if you have a static address. If you don’t I’d advise using a VPN with a static IP and setting it to that. Leaving it open to ‘anywhere’ just creates too many security problems.

security_inbound

Outbound

  • HTTP
  • HTTPS
  • Custom TCP Rule, Port 6800

In the case of outbound rules it’s fine to leave the destination as anywhere as you’ll want to be able to scrape any address.

security_outbound

 

Launch a new instance

OK so now you’re ready to launch a new instance. Head over to the EC2 dashboard and click on the launch instance button. You’ll be presented with a list of AMI base instances to choose from. Select the Ubuntu Server (at the time of writing it’s 14.04). Make sure it says ‘Free Tier Eligible’ before you select the instance.

ami

 

You’ll then be asked to choose the instance type. Again you’re looking for the ‘Free Tier Eligible’ instance. The t2.micro.

Screen Shot 2014-09-02 at 1.42.31 pm

 

You can tick the t2.micro and then click launch and review. No need to customise anything else at the moment. One the instance review screen you’ll be shown a warning that the security group is open to the world! Don’t panic, this is expected. Click Edit Security Groups and choose the security group you created earlier. This will lock down your new instance to the rules you specified before.

Clicking launch again will bring up a dialog box to create a key pair. This is essential to be able to connect to your new instance. Choose create a new key pair from the drop down and give it a name. Clicking download will download a text copy of the private key for you to your computer.

Screen Shot 2014-09-02 at 1.44.51 pm

Double check you’ve downloaded the key successfully and then click launch instance and your new instance will spring to life somewhere on the cloud!

Connecting to your new instance

You’ll be communicating with your new instance using SSH as we defined in the security group earlier. The first thing you need to do is find the private key you downloaded and change it’s permissions. See the session output below.

A couple of things to note. chmod 400  your private key and be sure to add ubuntu@instance_public_address  when connecting or you’ll get a permission denied error. Check the documentation if you have any problems at this point.

Installing Scrapy on EC2 Ubuntu

OK so you’ve managed to create an instance and connect successfully via SSH. Next up you’ll want to install scrapy. In this tutorial I’m just going to install scrapy directly on the system and not use virtualenv. You can choose to do it either way and if you’re planning on using the same instance for a while then it’s probably a good idea to separate your scrapy install from any other python development you might be doing on the same instance.

The first step is to install pip along with the python development tools or you’ll get an error when installing twisted.

Use apt-get update  to download the package lists from the repositories and “update” them to get information on the newest versions of packages and their dependencies. Notice also that all commands are run using sudo  as you don’t have root access on the EC2 instance.

Install build-essential and python-dev packages, you need these to compile twisted when you install scrapy:

Install pip , the python package management system using apt-get :

Now you can use pip  to install scrapy:

And that’s it you now have scrapy installed.

Installing scrapyd

I find it’s easiest to use apt-get to install scrapyd. For this to work you need to add the scrapy repo:

This will allow you to then run apt-get install scrapyd to install scrapyd on the system:

 Finishing Up

OK so you’re done. You’ll now have a new Ubuntu EC2 instance running both scrapy and scrapyd. In the next post I’ll cover deploying our spiders to your new instance and controlling them via scrapyd.

About This Author

Big Data and Python geek. Writes an eclectic mix of news from the world of Big Data and Telecommunications interspersed with articles on Python, Hadoop, E-Commerce and my continual attempts to learn Russian!

3 Comments

You can post comments in this post.


  • I had to install the follow requirements for lxml (scrapy dependency) to build:

    sudo apt-get install libxml2-dev libxslt-dev python-dev lib32z1-dev

    Tim 3 years ago Reply


  • Hi,

    Thank you for your tutorial. I followed your instructions and now have an Ubuntu EC2 instance running both scrapy and scrapyd..

    Next, I clicked your link for the next post: http://neuralfoundry.com/deploying-scrapy-ec2/

    After attempting your 1st step to check if scrapyd is running:

    ubuntu@ip-xxx-xx-xx-xxx:~$ service scrapyd status

    I receive this:

    scrapyd stop/waiting

    Question: Can you tell me how I start scrapyd so I can complete your deploying-scrapy-ec2 tutorial?

    Thank you!!!

    Robert 2 years ago Reply


Post A Reply