Tuesday, November 25, 2008

Improving MapReduce Performance in Heterogeneous Environments

This paper talks about mapping a task in a multi processor environment. MapReduce is the technique of splitting a job into multiple smaller tasks and mapping it so it can scale to thousands of tasks being executed simultaneously. A popular open source implementation, Hadoop, which is developed by Yahoo, has been commonly used for MapReducing tasks in clusters. However, the Hadoop implementation has several inherit assumptions which cause it to perform poorly in certain environments. These assumptions include the assumptions that nodes are somewhat similar in the working environment. As a result, when performance is monitored to be used to speculatively execute tasks on idle nodes, it can cause poor estimation of progress, and waste node computing power.

THis paper proposes a new scheduler for the MapReduce operation. The LATE scheduler takes into account the slowest tasks that will effect response time, and only reschedules the tasks that are farthest away from finishing to duplicate execution on a faster node. This allows the overall response time of the MapReduce operation to be decreased, and in a heterogeneous environment this allows for better estimation and usage of the scheduler. The performance can be up to 2 times faster than the original scheduler.

No comments: