I’ve recently moved all my AWS instances over to Amazon Linux and wanted to write a short update to installing Scrapy as the process is slightly different from Ubuntu.
Why Amazon Linux?
Amazon Linux is a distribution that evolved from Red Hat Enterprise Linux (RHEL) and CentOS. It is available for use within Amazon EC2: it comes with all the tools needed to interact with Amazon APIs, is optimally configured for the Amazon Web Services ecosystem, and Amazon provides ongoing support and updates.
- Obviously an AWS account
- Create a Role in IAM, I use S3 to store the results of scraping so I create a Role and assign the AmazonS3FullAccess policy.
- When you create the instance assign the IAM Role you created above. This allows you to assign access to other AWS services to the server without embedding you access key and secret key on the server or in scripts. Note that you can only assign a IAM Role while the instance is being created, you can change the permissions once the instance is created to add new services but you need to make sure you add the Role when you launch. If you have an existing instance and want to add a Role, then the easiest way is to shut it down, take a snapshot and launch again with the new Role.
If you used one of the smaller instances such as t2.nano or t2.micro you’ll need to add some swap space for the install to work otherwise the lxml compilation will fail.
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo echo "/swapfile swap swap sw 0 0" >> /etc/fstab
sudo swapon /swapfile
Then it’s as easy as:
yum update -y
yum install python-pip -y
yum install python-devel -y
yum install gcc gcc-devel -y
yum install libxml2 libxml2-devel -y
yum install libxslt libxslt-devel -y
yum install openssl openssl-devel -y
yum install libffi libffi-devel -y
CFLAGS="-O0" pip install lxml
pip install scrapy
At this point you might want to take a snapshot or create an AMI so you can quickly spin up scrapy instances in the future.