Installing scrapy and scrapyd on AWS EC2 – Updated for Scrapy 1.0

scrapy_and_aws

With the release of scrapy 1.0 I thought it was time to update this post to reflect these changes as the older instructions probably won’t work anymore.

Getting Started

Keeping It Free!

Before you go any further and start creating instances here and there make sure you take the time so set up billing alerts. The last thing you want to do is get hit with a huge bill because you forgot to turn something off or misconfigured an instance. Setting an alert is easy and gives you the peace of mind to test out AWS and not dent your credit card.

[/mks_pullquote]

First of all you do have an AWS account eligible for the Free Tier don’t you? The first thing you need to do if you don’t is head over to AWS now and sign up. New users get 12 months of access to the free tier which essentially allows you to use AWS within reason for free.

Once you’ve got your account signed up take a few moments to familiarise yourself with the basic tools, the dashboard, IAM and EC2 in particular. Read the IAM best practices and make sure you know how to create a new user.

Create a new user

Best practice is not to do anything with your main AWS account. Don’t use it for development at all. It’s the account that was created when you signed up and is the account that holds your credit card details. So the first step is to create a new user account and assign the new user admin rights. You’ll use this user for all your development work.

create_new_user_iam

Once the user is created save the security credentials that are generated, that’s your Access Key ID and Secret Access Key, you’ll need those later.

Create a new security group

Login to your AWS account and go to the EC2 dashboard. The first thing you want to do is create a new security group. Security groups are essentially like firewall rules, you want to make sure you allow only the services you’re going to use. So click on Security Groups and then create a new group. At the bottom of the dialog you’ll see tabs for inbound and outbound rules. Set up the following rules as shown in the screenshots below:

Inbound

  • SSH
  • Custom TCP Rule, Port 6800

Set source on both to custom IP and only allow your own IP address if you have a static address. If you don’t I’d advise using a VPN with a static IP and setting it to that. Leaving it open to ‘anywhere’ just creates too many security problems.

security_inbound

Outbound

  • HTTP
  • HTTPS
  • Custom TCP Rule, Port 6800

In the case of outbound rules it’s fine to leave the destination as anywhere as you’ll want to be able to scrape any address.

security_outbound

Launch a new instance

OK so now you’re ready to launch a new instance. Head over to the EC2 dashboard and click on the launch instance button. You’ll be presented with a list of AMI base instances to choose from. Select the Ubuntu Server (at the time of writing it’s 14.04). Make sure it says ‘Free Tier Eligible’ before you select the instance.

ami

You’ll then be asked to choose the instance type. Again you’re looking for the ‘Free Tier Eligible’ instance. The t2.micro.

Screen Shot 2014-09-02 at 1.42.31 pm

You can tick the t2.micro and then click launch and review. No need to customise anything else at the moment. One the instance review screen you’ll be shown a warning that the security group is open to the world! Don’t panic, this is expected. Click Edit Security Groups and choose the security group you created earlier. This will lock down your new instance to the rules you specified before.

Clicking launch again will bring up a dialog box to create a key pair. This is essential to be able to connect to your new instance. Choose create a new key pair from the drop down and give it a name. Clicking download will download a text copy of the private key for you to your computer.

Screen Shot 2014-09-02 at 1.44.51 pm

Double check you’ve downloaded the key successfully and then click launch instance and your new instance will spring to life somewhere on the cloud!

Connecting to your new instance

You’ll be communicating with your new instance using SSH as we defined in the security group earlier. The first thing you need to do is find the private key you downloaded and change it’s permissions. See the session output below.

A couple of things to note. chmod 400  your private key and be sure to add ubuntu@instance_public_address  when connecting or you’ll get a permission denied error. Check the documentation if you have any problems at this point.

Installing Scrapy on EC2 Ubuntu

OK so you’ve managed to create an instance and connect successfully via SSH. Next up you’ll want to install scrapy. In this tutorial I’m just going to install scrapy directly on the system and not use virtualenv. You can choose to do it either way and if you’re planning on using the same instance for a while then it’s probably a good idea to separate your scrapy install from any other python development you might be doing on the same instance.

Ubuntu ships with official Ubuntu packages ready to use in your Ubuntu servers, Scrapinghub publishes apt-gettable packages which are generally fresher than those in Ubuntu, and more stable too since they’re continuously built from Github repo (master & stable branches) and so they contain the latest bug fixes.

To use the packages:

  1. Import the GPG key used to sign Scrapy packages into APT keyring:

  2. Create /etc/apt/sources.list.d/scrapy.list file using the following command:

  3. Update package lists and install the scrapy package:

and that’s it! Really very simple.

 

Installing scrapyd

Now you can also install scrapyd with apt-get from the same repositories, however, at the moment there’s a problem with the version in the repositories and even though the install will look like it works the version doesn’t work with scrapy 1.0

At the time of writing to solve this it’s easiest to just use pip and install scrapyd that way.

and then install scrapyd using pip

 

 Finishing Up

OK so you’re done. You’ll now have a new Ubuntu EC2 instance running both the latest scrapy 1.0.1 and scrapyd 1.1.0.

About This Author

Big Data and Python geek. Writes an eclectic mix of news from the world of Big Data and Telecommunications interspersed with articles on Python, Hadoop, E-Commerce and my continual attempts to learn Russian!

Post A Reply