Scrapy and DynamoDB on AWS

scrapy_and_aws

Amazon DynamoDB is a fully managed proprietary NoSQL database service that is offered by Amazon.com as part of the Amazon Web Services portfolio.

If you’ve considered using MongoDB for storing your scraped results and if like me you’re doing your scraping from the cloud anyway then why not make use of DynamoDB?

For this example I’m using the Scrapy example dirbot and the AWS Python SDK boto3.

First, you will need to create a new user for AWS and download the credentials. You can follow the tutorials on the AWS site here.

Once you have your user credentials at hand one of the easiest ways to use them is to create a credential file yourself. By default, its location is at ~/.aws/credentials

You’ll then want to create a new DynamoDB table, for this example I used url as the primary key.

Create Table

You’ll then want to assign or create a policy that will allow your new user to load and query the table you just created.

Policy

Once that’s done, it’s a simple enough matter of creating a new pipeline to store the item results in the table.

Don’t forget to enable the new pipeline in settings.py.

Then simply run your newly edited spider scrapy crawl dmoz

Results

About This Author

Big Data and Python geek. Writes an eclectic mix of news from the world of Big Data and Telecommunications interspersed with articles on Python, Hadoop, E-Commerce and my continual attempts to learn Russian!

Post A Reply