Amazon just reminded me that my AWS free tier is getting over tomorrow. I’ve been wanting to write about my EMR experiments for some time. I worked on this a couple of months back when I got a chance to experiment with Hadoop. We used twitter feeds at that time. My objective was to run the same with a large log file from one of our products. I’m going to explain the way EMR can be used in a very basic way by using data stored in S3 and scheduling EMR job with a bunch of scripts.
As usual, I’m going to build the whole experiment over a number of steps. I believe that it is easier to validate your approach in smaller steps like programming. It is always easier to test your program as you build it instead of trying to see how the program works after a few hundred thousands lines are written.
Step 1: Upload your data and scripts
I’m going to use Amazon S3 as the storage for this example. There could be other methods. I think S3 is a good option for up to a few gigabytes of data. As you notice, the
bucket pigdatbucket has all the input and output data folders for this example.
The objective of this exercise is to do a sentiment analysis on a number of tweets from various states in USA. The result will be placed in the folder output once the EMR job is completed.
Step 2: Crate EMR cluster
In this step, we create an EMR cluster. To start with, I leave logging on and use the S3 folder Logs as the place holder for log files. I always find logging helpful to troubleshoot teething problems. I disabled Termination Protection as I couldn’t sufficiently debug script issues when I enable this feature as the cluster terminates automatically.
Amazon provides hadoop 1.0.3 or 2.2.0 and pig 0.11.1.1 (as of this writing). This EMR cluster will be launched in one of your EC2 instances or a VPC. Select the appropriate instance based on your subscription level.
As this example needs only basic Hadoop configuration, this was selected for the Bootstrap Actions. The core of the setup is in the next step where you select the Pig script that you uploaded as the starting Steps.
Notice the S3 locations in the above image. Select the files from the appropriate S3 folders.
You will be able to monitor the running cluster from your Cluster List once the cluster is created. Select one of the clusters to view the status and other configuration details.
This example just uses a basic pig script which I modified for pig 0.11.1.1 that Amazon provides. You may have a need to call an external program from your pig script to work on the data. Amazon provides a way to upload additional jar for this purpose.
Preparation.
I would recommend testing your pig script locally on a test data before uploading to EMR. EMR takes a while to get started and produce the output. The cycle repeats if there are any errors. I used Hortonworks Hadoop VM for testing my data and scripts. Hortonworks provides the entire Hadoop stack as a preconfigured sandbox which is very easy to use. This sandbox also includes Apache Ambari for complete system monitoring. They have a number of easy to do tutorials for anyone to get started quickly on Hadoop, Pig and Hive.
The initial data and scripts for this example came from Manaranjan.