Unit 4 – 1
Discussion
Most of data analytic tasks in commonly used large-scale Web data processing applications, such as Web crawls and Web page indexing, are not iterative but incremental. The application usually runs one time as needed. However, there is a common characteristic of data shown in the incremental computation on most large-scale Web data. Most of the data do not change between two different runs. This obviously is an opportunity to improve the data processing performance for MapReduce and its Hadoop implementation because they did not consider this data characteristic in their design and development. For example, if 99% of a large-scale data set is unchanged and if there is a method to allow the MapReduce-based Web application to reuse that data directly in the next run without reprocessing, the data processing performance on this data set will increase greatly. Therefore, it is very important to acquire the knowledge and skills on how to achieve this process with existing MapReduce and Hadoop.
Complete the reading assignment and search the Library and Internet to find and study more references that discuss how to implement incremental computation with existing MapReduce and Hadoop. Based on the results of your research, discuss the following questions:
· What are the principles of using the initial MapReduce framework and Hadoop to improve performance of incremental computation?
· How can these principles be designed and implemented without a need of any big change on the initial MapReduce framework and Hadoop?