MapReduce has emerged as a model of choice for supporting modern
data-intensive applications, and is a key enabler for cloud computing.
Setting up and operating a large MapReduce cluster entails careful
evaluation of various design choices and run-time parameters to achieve
high efficiency. However, this design space has not been explored 
in detail.  In this talk, I will discuss a simulation approach to 
systematically understanding the performance of MapReduce setups. 
I will present MRPerf, a toolkit that captures such aspects of 
MapReduce setups as node, rack and network configurations, disk 
parameters and performance, data layout and application I/O 
characteristics, among others, and uses this information to 
predict expected application performance. I will also discuss 
the challenges faced in obtaining realistic traces to drive our 
simulations, and present tricks and tips we have used. The overall 
goal is to realize a tool for optimizing existing MapReduce setups 
as well as designing new ones.