Unit 3
Discussion 1
MapReduce was originally developed for cost-efficient use of large clusters of commodity computers to achieve scalable and reliable data processing. It consistently applies two simple but powerful functions—Map and Reduce—in parallel. Along with Hadoop, which is an open-source implementation of MapReduce, MapReduce has become one of the most popular and practical technical solutions to deal with big data analytic tasks. However, like any technical solution, the initial MapReduce and Hadoop also have quite a few weaknesses when applied to handle certain types of data processing applications. Therefore, there is a need to thoroughly study the basic concepts of MapReduce and its Hadoop implementation to fully understand their pros and cons so that when applying them in big data analytic tasks, you will be able to make the right decisions and achieve the desired results.
Complete the reading assignment, and search the Library and Internet to find and study more references that discuss the concepts and applications of MapReduce and Hadoop as needed. Based on the results of your research, discuss the following questions:
· What are the basic concepts of MapReduce?
· What are the top 3 features of Hadoop?
· What are the pros and cons of MapReduce?
Discussion 2
Many data analytic tasks in commonly used Web applications, such as page ranking and social network analysis, are processed iteratively until the computation meets the given condition. However, the original MapReduce framework does not support iterative computation directly. The iterative tasks have to be manually developed through a separate software and use multiple MapReduce jobs to emulate the iteration process. The unchanged data from previous iteration will be reloaded and reprocessed in the next iteration. This approach has increased the performance penalty on computing resources because it does not take advantage of most of the data in the iterations, which is unchanged, and subsequently has no need to reload and reprocess them during the consequent iterations. Another problem with the manual approach is that it depends on detecting the termination condition at each iteration. This requires an extra MapReduce job, which causes extra scheduling, I/O, and will increase network traffic. Obviously, a better solution is required to address these performance penalties.
Complete the reading assignment, and search the Library and Internet to find and study more references that discuss how to address the weakness of MapReduce and Hadoop on iterative computation. Based on the results of your research, discuss the following questions:
· What are the weaknesses of the initial MapReduce framework in iterative computation?
· What are the root causes of the weakness?
· What are the key technical steps to solve the weakness?