Previously I walked through running Spark locally for development but one of the major challenges of learning to use distributed systems is understanding how the various components are installed and interact with each other in a production like environment.
You can use Vagrant or virtual machine images to run a cluster on your local machine but a much easier way, especially if you’re already making use of other Amazon services such as S3 is to run a Spark cluster on Amazon Web Services (AWS) Elastic Compute Cloud (EC2).
In this guide we’ll cover:
- A couple of prerequisites
- A note on costs
- Launching a Cluster
- Running Applications
- Accessing data stored on S3
- Terminating your Spark Cluster
- You should have downloaded the latest version of Spark. At the time of writing that’s 1.6.0. You can take a look at my earlier blog post on installing Spark on El Capitan if you need help downloading and installing Spark.
- This guide assumes you’ve already got an AWS account with a credit card. Seriously, if you haven’t signed up yet go and take advantage of the free tier right now!
- The spark-ec2 script uses boto, if you haven’t already installed it don’t worry the script will download and install it when it’s first run.
- For this exercise we’ll be using the spark-ec2 script that ships with Spark. This only supports launching a standalone cluster on EC2. Running Spark on top of Hadoop on EC2 or using YARN or Apache Mesos is outside the scope of this guide.
- Create an Amazon EC2 key pair for yourself. This can be done by logging into your Amazon Web Services account through the AWS console, clicking Key Pairs on the left sidebar, and creating and downloading a key (.pem). Make sure that you set the permissions for the private key file to 600 (i.e. only you can read and write it) so that ssh will work.
- Whenever you want to use the spark-ec2 script, set the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to your Amazon EC2 access key ID and secret access key. These can be obtained from the AWS homepage by clicking Account > Security Credentials > Access Credentials.
- I’ll be running this tutorial on OS X, if you’re using Windows you’ll need to install and use something like the Cygwin terminal.
A note on costs
Due to the memory requirements of Spark the EC2 instances you’ll need to use for the cluster are not part of the free-tier and so if you follow along with this tutorial you will incur a small cost. The spark-ec2 script defaults to using the m1.large instances which come with 2 CPU cores and 7.5GB of RAM with 2x420GB disks. If you run a 4 node cluster for 4 hours using on-demand pricing this tutorial will cost you around $2.80 on the east coast regions, this can be reduced to under $1 by using spot pricing.
Whatever you do, make sure you shutdown or terminate your cluster after you’ve finished or you’ll find yourself with a nasty bill at the end of the month!
Launching a Cluster
Once you’ve downloaded and unpacked Spark you’ll find the ec2 directory in the Spark directory. Change directory into the “ec2″ directory.
If you haven’t already you’ll need to set the AWS environment variables.
export AWS_SECRET_ACCESS_KEY=xxx your secret access key xxx
export AWS_ACCESS_KEY_ID=xxx your access key id xxx
You can then go ahead and run the script-ec2 command to launch a cluster.
./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> launch <cluster-name>
Before we do that let’s take a look at an example command and some of the more common arguments.
./spark-ec2 --key-pair=awskey --identity-file=awskey.pem --region=us-east-1 --instance-type=c3.4xlarge -s 2 --copy-aws-credentials launch test-cluster
- –key-pair=<key pair> identifies which key pair to use.
- –identity-file=<pem file> SSH private key file to use for logging into instances. Make sure you’ve chmod 600 the .pem private key file before you run the spark-ec2 script or you’ll get a permission error.
- –region=<ec2-region> specifies an EC2 region in which to launch instances. The default region is us-east-1.
- –instance-type=<instance-type> can be used to specify an EC2 instance type to use. For now, the script only supports 64-bit instance types, and the default type is m1.large (which has 2 cores and 7.5 GB RAM). Refer to the Amazon pages about EC2 instance types and EC2 pricing for information about other instance types.
- –s or –slaves=<slaves> sets the number of slaves to launch. The script will launch s+1 instances including the master node.
- –copy-aws-credentials this will copy your AWS credentials to the Cluster allowing access to other AWS services such as S3.
- “test-cluster” is the name of the cluster in this example.
Some of the other common options are listed below:
- –zone=<ec2-zone> can be used to specify an EC2 availability zone to launch instances in. Sometimes, you will get an error because there is not enough capacity in one zone, and you should try to launch in another.
- –ebs-vol-size=<GB> will attach an EBS volume with a given amount of space to each node so that you can have a persistent HDFS cluster on your nodes across cluster restarts.
- –spot-price=<price> will launch the worker nodes as Spot Instances, bidding for the given maximum price (in dollars).
- –spark-version=<version> will pre-load the cluster with the specified version of Spark. The <version> can be a version number (e.g. “0.7.3”) or a specific git hash. By default, a recent version will be used.
- –spark-git-repo=<repository url> will let you run a custom version of Spark that is built from the given git repository. By default, the Apache Github mirror will be used. When using a custom Spark version, –spark-version must be set to git commit hash, such as 317e114, instead of a version number.
If one of your launches fails due to e.g. not having the right permissions on your private key file, you can run launch with the –resume option to restart the setup process on an existing cluster.
Remember you can always run spark-ec2 –help to get a list of all the availble options.
Amazon EC2 Spot instances allow you to bid on spare Amazon EC2 computing capacity. Since Spot instances are often available at a discount compared to On-Demand pricing, you can significantly reduce the cost of running your Spark Cluster. To use Spot Instances you’ll need to provide the –-spot-price argument to the spark-ec2 script. Note that as Spot Instances can be reclaimed by Amazon at any time the script only uses Spot Instances for the worker nodes.
The cluster installation takes about 7-10 minutes. When it is done, the host address of the master node is displayed at the end of the log message as shown in the figure below. At this point your Spark cluster has been installed successfully and you are a ready to start exploring and analyzing your data.
If you get a ERROR:boto:403 Forbidden at this point it’s probably because you haven’t assigned the AmazonEC2FullAccesspolicy to the user account you’re using for the ACCESS_KEY.
You might also see some errors SSH connection error the script will keep retrying until the instances are SSH ready so don’t worry if you see this message a couple of times.
Once the install is finished you’ll see some log messages similar to below:
Once the install is completed you can get the master hostname by running the following command:
./spark-ec2 -i <key_file> -k <key_pair> get-master test-cluster
Virtual Private Cloud
If you already have other AWS services running you are probably making use of Virtual Private Clouds (VPC) that lets you provision a logically isolated section of the Amazon Web Services (AWS) Cloud where you can launch AWS resources in a virtual network that you define. If you’re wanting to launch your Spark cluster in the same VPC as your other services then you need to run ./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> --vpc-id=<vpc-id> --subnet-id=<subnet-id> launch <cluster-name> , where <keypair> is the name of your EC2 key pair (that you gave it when you created it), <key-file> is the private key file for your key pair, <num-slaves> is the number of slave nodes to launch (try 1 at first), <vpc-id> is the name of your VPC, <subnet-id> is the name of your subnet, and <cluster-name> is the name to give to your cluster.
Go into the ec2 directory in the release of Spark you downloaded. and run ./spark-ec2 -k <keypair> -i <key-file> login <cluster-name> to SSH into the cluster, where <keypair> and <key-file> are as above. (This is just for convenience; you could also use the EC2 console.) For example:
./spark-ec2 --key-pair=awskey --identity-file=awskey.pem login test-cluster
To deploy code or data within your cluster, you can log in and use the provided script ~/spark-ec2/copy-dir, which, given a directory path, RSYNCs it to the same location on all the slaves.
If your application needs to access large datasets, the fastest way to do that is to load them from Amazon S3 or an Amazon EBS device into an instance of the Hadoop Distributed File System (HDFS) on your nodes. The spark-ec2 script already sets up a HDFS instance for you. It’s installed in /root/ephemeral-hdfs, and can be accessed using the bin/hadoop script in that directory. Note that the data in this HDFS goes away when you stop and restart a machine.
There is also a persistent HDFS instance in /root/persistent-hdfs that will keep data across cluster restarts. Typically each node has relatively little space of persistent data (about 3 GB), but you can use the –ebs-vol-size option to spark-ec2 to attach a persistent EBS volume to each node for storing the persistent HDFS.
Finally, if you get errors while running your application, look at the slave’s logs for that application inside of the scheduler work directory (/root/spark/work). You can also view the status of the cluster using the web UI: http://<master-hostname>:8080.
Accessing data stored on Amazon S3
If you need access to data stored on S3 you need to remember to use the –copy-aws-credentials option of the spark-ec2 script. You can then access the data on your S3 bucket from your Spark applications:
x = sc.textFile("s3n://<yourbucketname>/*")
Terminating your Spark cluster
Note that there is no way to recover data on EC2 nodes after shutting them down! Make sure you have copied everything important off the nodes before stopping them.
Go into the ec2 directory in the release of Spark you downloaded.
Run ./spark-ec2 destroy <cluster-name> . The script will ask for a confirmation before it deletes anything!
The spark-ec2 script also supports pausing a cluster. In this case, the VMs are stopped but not terminated, so they lose all data on ephemeral disks but keep the data in their root partitions and their persistent-hdfs. Stopped machines will not cost you any EC2 cycles, but will continue to cost money for EBS storage.
To stop one of your clusters, go into the ec2 directory and run
./spark-ec2 --region=<ec2-region> stop <cluster-name> .
To restart it later, run ./spark-ec2 -i <key-file> --region=<ec2-region> start <cluster-name> .
To ultimately destroy the cluster and stop consuming EBS space, run ./spark-ec2 --region=<ec2-region> destroy <cluster-name> as described in the previous section.