Jupyter, Plotly, Pandas, SciPy, NumPy and SciKit-Learn on AWS EC2

7388996

There are times when having your data processing compute power in the cloud really pays off. Store your data on S3? Use Elastic MapReduce? Need the occasional boost in CPU or RAM? Need a Spark cluster? All great reasons to have access to your favourite Python data tools in the cloud.

The beauty of the AWS environment is the ability to set up your EC2 environment just how you like it and then be able to spin up various instance sizes depending on the workload where and whenever you need them. On demand data science computing made easy.

Here’s a list of my current tools of choice:

  • Jupyter – the IPython spin-off notebook that contains the IPython kernel as well as many other languages.
  • Plotly.js – the beautiful visualisation engine now open source.
  • NumPy – the fundamental package for scientific computing with Python, contains N-dimensional array objects.
  • Pandas – the amazing data analysis toolset.
  • scikit-learn – machine learning and data mining toolkit.

There are a number of options for installing these tools, ranging from using your OS package management through to full source builds, another popular choice is going with a fully packaged install such as Anaconda which comes with many of the libraries already installed. If you’re anything like me and you want specific packages or versions then installing them yourself  isn’t too tricky and gives you a lot more control.

For this I’ll be using Python 2.7, it’s still the dominant installed version and to be honest I still much prefer it to 3.

Prerequisites

  • An AWS account, if you still don’t have an AWS account I’m not sure where you’ve been for the lsat few years but head on over to here and enter your details and you’ll need a credit card but with the free tier you’ll be able to run this for free.

The EC2 Instance

I use the Amazon Linux AMI. It’s an EBS-backed, AWS-supported image. The default image includes AWS command line tools, Python, Ruby, Perl, and Java. Most of the options during the creation I leave as default apart from the following:

  • IAM Role – I have a role configured or you can create one that gives access to other AWS services. My default IAM Role gives access to S3 for example. This allows you to access the AWS command line tools without having to store your access id or secret key on the server. I’ve seen too many examples of access id and secret key being accidentally committed to GitHub. Anyone with access to those can run up a huge AWS bill on your account as well as access your data.
  • Security Group – I create a new security group that allows SSH (port 22), HTTP (port 80) and a custom rule for port 8888 for Jupyter. I set the source for all of these to ‘My IP’. This effectively locks down your instance so only you can access it and means you don’t need to use a password for Jupyter or necessarily HTTPS. For production usage I’d recommend using HTTPS as well as a private VPC with VPN access and a public VPC with NAT. The scope of setting this up is beyond this article but that gives you a private cloud with no access from the internet.
  • Swap space – If you’re using a t2.nano or t2.micro you’ll need to assign some swap space or the compilation of SciPy will fail.  Check the first part of my guide to installing Scrapy on Amazon Linux for how to add swap space.

Getting Started

Once your new instance is running SSH into it using the private key you created.

Type yes to add the new host fingerprint and you’ll see something like this:

First things first to fix the locale error run these commands

add these lines:

Then run a yum update to make sure everything is up to date.

Amazon Linux comes with Python as well as the AWS command line tools and boto already isntalled so let’s get started with the other packages:

This will install all the development tools you need to compile some of the packages later along with the Python headers.

There are a couple of other packages we need to install at this point. First tmux, the terminal multiplexer we’ll use to keep Jupyter running when we log out and the ATLAS and LAPACK libraries which NumPy and SciPy both require.

We’re now ready to install Jupyter and plotly, as we already have Python installed that’s as easy as just using pip:

 

Great! You’re almost done. Let’s make sure all the other packages are correctly installed and then we can return and configure Jupyter.

Pandas, NumPy, SciPy Oh My!

Now we’re getting to the meat of the install. Let’s get NumPy working:

sudo pip install numpy

This will take a while to run. Once the install is complete make sure that NumPy is using the ATLAS libraries.

You’ll get output like the following:

If you don’t see output for atlas_threads_info, blas_opt_info, atlas_blas_threads_info, or lapack_opt_info then NumPy did not find the ATLAS libraries.

If you have output like this it means NumPy is installed but will not use the ATLAS libraries so will be a LOT slower than it should be. At this point it’s best to start over and make sure atlas-sse3-devel and lapack-devel are installed before force reinstalling NumPy again.

Next up it’s SciPy. That’s an easy install using pip again. Note the prerequisite for swap space above if you get an error installing SciPy.

Almost done, let’s install Pandas and scikit-learn.

One final step is to install nose and run some tests to make sure everything is working 100%.

And that’s it. We just need to finish the configuration of Juypter and optionally configure a static IP address and domain name to make accessing our notebooks easier.

Jupyter Configuration

First off we need to create a new Jupyter configuration file. You can dot his by running:

You’ll then need to edit the config file and change a couple of settings.

To get Jupyter running there are two line you need to edit.

Make sure you remove all leading whitespace from these lines. The first allows Juypter to listen on all IPs not just 127.0.0.1 and the second stops the automatic launching of the browser when you start Juypter.

You can also set up a password and SSL access in this config file if you need. If you’ve followed my advice about using you own IP (and you have a static IP) you can skip this. The documentation has all the info needed to set up a password.

That’s it for the installs, you’ve now got a fully functioning Python based data science workstation in the cloud. I’d advise shutting down the instance and making an image at this point. This will allow you to spin up different sizes instances with everything preinstalled and configured at any time.

Tmux Basics

Tmux, much like screen gives you the ability to create sessions that continue once you logout. There are a number of great guides to get you started or if you use something like iTerm on the mac there is even menu options to navigate the windows and sessions.

You’ll want to start a session and a new window and you can then start the notebook using the  jupyter notebook command.

You can then access your notebook by using the public IP address listed on the AWS console. http://publicip:8888/

Elastic IPs and Route53

This step is completely optional but if you’re going to be accessing your notebooks regularly then you assign an Elastic IP to the instance to give it a static public IP address and then if you’re feeling fancy you can use Route53 to set up a subdomain so you can access your notebooks from your own easy to remember URL such as jupyter.takeovertheworld.com

Next Steps

Hope you found this guide useful, in the next instalment we’ll look at installing the Spark kernel in Jupyter, accessing a Spark cluster and doing some data analysis with Pandas and SciKit-Learn on S3.

 

About This Author

Big Data and Python geek. Writes an eclectic mix of news from the world of Big Data and Telecommunications interspersed with articles on Python, Hadoop, E-Commerce and my continual attempts to learn Russian!

Post A Reply