Saturday, April 18, 2009

Amazon Elastic MapReduce

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (AmazonEC2) and Amazon Simple Storage Service (Amazon S3).

Using Amazon Elastic MapReduce, you can instantly provision as much or as little capacity as you like to perform data-intensive tasks for applications such as web indexing, data mining, log file analysis, machine learning, financial analysis, scientific simulation, and bioinformatics research. Amazon Elastic MapReduce lets you focus on crunching or analyzing your data without having to worry about time-consuming set-up, management or tuning of Hadoop clusters or the compute capacity upon which they sit.

Amazon Elastic MapReduce Functionality

Amazon Elastic MapReduce automatically spins up a Hadoop implementation of the MapReduce framework on AmazonEC2 instances, sub-dividing the data in a job flow into smaller chunks so that they can be processed (the “map” function) in parallel, and eventually recombining the processed data into the final solution (the “reduce” function). Amazon S3 serves as the source for the data being analyzed, and as the output destination for the end results.

To use Amazon Elastic MapReduce, you simply:

  • Develop your data processing application authored in your choice of Java, Ruby, Perl, Python, PHP, R, or C++. There are several code samples available in the Getting Started Guide that will help you get up and running quickly.
  • Upload your data and your processing application into Amazon S3. Amazon S3 provides reliable, scalable, easy-to-use storage for your input and output data.
  • Log in to the AWS Management Console to start an Amazon Elastic MapReduce “job flow.” Simply choose the number and type of Amazon EC2 instances you want, specify the location of your data and/or application on Amazon S3, and then click the “Create Job Flow” button. Alternatively you can start a job flow by specifying the same information mentioned above via our Command Line Tools or APIs.
  • Monitor the progress of your job flow(s) directly from the AWS Management Console, Command Line Tools or APIs. And, after the job flow is done, retrieve the output from Amazon S3.
  • Pay only for the resources that you actually consume. Amazon Elastic MapReduce monitors your job flow, and unless you specify otherwise, shuts down your Amazon EC2 instances after the job completes.

Service Highlights

Elastic – Amazon Elastic MapReduce enables you to use as many or as few compute instances running Hadoop as you want. You can commission one, hundreds, or even thousands of instances to process gigabytes, terabytes, or even petabytes of data. And, you can run as many job flows concurrently as you wish. You can instantly spin up large Hadoop job flows which will start processing within minutes, not hours or days. When your job flow completes, unless you specify otherwise, the service automatically tears down your instances.

Easy to use – You don’t need to worry about setting up, running, or tuning the performance of Hadoop clusters; instead, you can concentrate on data analysis. We provide easy-to-use tools and sample data processing applications that let you get up and running without writing a single line of code. Once you start a job flow, Amazon Elastic MapReduce handles Amazon EC2 instance provisioning, security settings, Hadoop configuration and set-up, log collection, health monitoring, and other hardware-related complexities such as automatically removing faulty instances from your running job flow.

Reliable – Amazon Elastic MapReduce is built on Amazon’s highly reliable infrastructure, and has tuned Hadoop’s performance specifically for Amazon’s infrastructure environment. The service also monitors your job flow execution—retrying failed tasks and shutting down problematic instances.

Seamlessly integrated with other AWS services – Amazon Elastic MapReduce is designed to integrate easily with other AWS services such as Amazon S3 and EC2, providing the infrastructure for data processing applications. The service runs job flows in Amazon EC2 and stores input and output data in Amazon S3.

Secure – Amazon Elastic MapReduce automatically configures Amazon EC2 firewall settings that control network access to and between instances that run your job flows.

Inexpensive – Amazon Elastic MapReduce passes on to you the financial benefits of Amazon’s scale. You pay a very low rate for the resources you actually consume. Compare this with the significant up-front expenditures traditionally required to purchase and maintain hardware and to set up MapReduce clusters. This frees you from many of the complexities of capacity planning, transforms large capital expenditures into much smaller operating costs, and eliminates the need to over-buy capacity that is infrequently used. Amazon Elastic MapReduce is optimized to save you money by monitoring progress of your job flows and turning off resources when a job flow is completed.

Instance Types

To use Amazon Elastic MapReduce, you need to first select the type and quantity of Amazon EC2 instances you want. Amazon Elastic MapReduce works with any Amazon EC2 Linux/Unix instance type running in the US Region of AmazonEC2. It supports both On-Demand and Reserved instances; if you have Reserved Instances they will be used first by your job flows.

Standard Amazon EC2 Instances

Instances of this family are well suited for most applications.

  • Small Instance (Default) 1.7 GB of memory, 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit), 160 GB of instance storage, 32-bit platform
  • Large Instance 7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each), 850 GB of instance storage, 64-bit platform
  • Extra Large Instance 15 GB of memory, 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each), 1690 GB of instance storage, 64-bit platform

High-CPU Amazon EC2 Instances

Instances of this family have proportionally more CPU resources than memory (RAM) and are well suited for compute-intensive applications.

  • High-CPU Medium Instance 1.7 GB of memory, 5 EC2 Compute Units (2 virtual cores with 2.5 EC2 Compute Units each), 350 GB of instance storage, 32-bit platform
  • High-CPU Extra Large Instance 7 GB of memory, 20 EC2 Compute Units (8 virtual cores with 2.5 EC2 Compute Units each), 1690 GB of instance storage, 64-bit platform

EC2 Compute Unit (ECU) – One EC2 Compute Unit (ECU) provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.

1 comment:

Mikayel said...

in addition to all of this you will need reliable monitoring service like in order to be sure that your instance is up and running and not consuming too much resources