Elastic Map Reduce – A Practical Approach

Amazon just reminded me that my AWS free tier is getting over tomorrow. I’ve been wanting to write about my EMR experiments for some time. I worked on this a couple of months back when I got a chance to experiment with Hadoop. We used twitter feeds at that time. My objective was to run the same with a large log file from one of our products. I’m going to explain the way EMR can be used in a very basic way by using data stored in S3 and scheduling EMR job with a bunch of scripts.

As usual, I’m going to build the whole experiment over a number of steps. I believe that it is easier to validate your approach in smaller steps like programming. It is always easier to test your program as you build it instead of trying to see how the program works after a few hundred thousands lines are written.

Step 1: Upload your data and scripts

pigbucket I’m going to use Amazon S3 as the storage for this example. There could be other methods. I think S3 is a good option for up to a few gigabytes of data. As you notice, the pigdata
pinginput bucket pigdatbucket has all the input and output data folders for this example.

The objective of this exercise is to do a sentiment analysis on a number of tweets from various states in USA. The result will be placed in the folder output once the EMR job is completed.

Step 2: Crate EMR cluster

In this step, we create an EMR cluster. To start with, I leave logging on and use the S3 folder Logs as the place holder for log files. I always find logging helpful to troubleshoot teething problems. I disabled Termination Protection as I couldn’t sufficiently debug script issues when I enable this feature as the cluster terminates automatically.

Amazon provides hadoop 1.0.3 or 2.2.0 and pig 0.11.1.1 (as of this writing). This EMR cluster will be launched in one of your EC2 instances or a VPC. Select the appropriate instance based on your subscription level.

As this example needs only basic Hadoop configuration, this was selected for the Bootstrap Actions. The core of the setup is in the next step where you select the Pig script that you uploaded as the starting Steps.

Notice the S3 locations in the above image. Select the files from the appropriate S3 folders.

You will be able to monitor the running cluster from your Cluster List once the cluster is created. Select one of the clusters to view the status and other configuration details.

This example just uses a basic pig script which I modified for pig 0.11.1.1 that Amazon provides. You may have a need to call an external program from your pig script to work on the data. Amazon provides a way to upload additional jar for this purpose.

Preparation.

I would recommend testing your pig script locally on a test data before uploading to EMR. EMR takes a while to get started and produce the output. The cycle repeats if there are any errors. I used Hortonworks Hadoop VM for testing my data and scripts. Hortonworks provides the entire Hadoop stack as a preconfigured sandbox which is very easy to use. This sandbox also includes Apache Ambari for complete system monitoring. They have a number of easy to do tutorials for anyone to get started quickly on Hadoop, Pig and Hive.

The initial data and scripts for this example came from Manaranjan.

Elastic Map Reduce – A Practical Approach

Step 1: Upload your data and scripts

Step 2: Crate EMR cluster

Preparation.

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112