MapReduce is the processing layer of Hadoop. MapReduce programming model is designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks. You need to put business logic in the way MapReduce works and rest things will be taken care by the framework. Work (complete job) which is submitted by the user to master is divided into small works (tasks) and assigned to slaves.
MapReduce programs are written in a particular style influenced by functional programming constructs, specifical idioms for processing lists of data. Here in MapReduce, we get inputs from a list and it converts it into output which is again a list. It is the heart of Hadoop. Hadoop is so much powerful and efficient due to MapReduce as here parallel processing is done.
MapReduce programs work in two phases:
How Map and Reduce work Together?
Input data given to mapper is processed through user-defined function written at mapper. All the required complex business logic is implemented at the mapper level so that heavy processing is done by the mapper in parallel as the number of mappers is much more than the number of reducers. Mapper generates an output which is intermediate data and this output goes as input to the reducer.
This intermediate result is then processed by user-defined function written at reducer and final output is generated. Usually, in reducer very light processing is done. This final output is stored in HDFS and replication is done as usual.
The data goes through the following phases
The input to a MapReduce job is divided into fixed-size pieces called input splits Input split is a chunk of the input that is consumed by a single map
This is a very first phase in the execution of the map-reduce program. In this phase data in each split is passed to a mapping function to produce output values. In our example, the job of mapping phase is to count the number of occurrences of each word from input splits (more details about input-split is given below) and prepare a list in the form of <word, frequency>
This phase consumes the output of Mapping phase. Its task is to consolidate the relevant records from Mapping phase output. In our example, the same words are clubbed together along with their respective frequency.
In this phase, output values from the Shuffling phase are aggregated. This phase combines values from Shuffling phase and returns a single output value. In short, this phase summarizes the complete dataset.
The overall process in detail
One map task is created for each split which then executes map function for each record in the split.
It is always beneficial to have multiple splits because the time taken to process a split is small as compared to the time taken for the processing of the whole input. When the splits are smaller, the processing is better to load balanced since we are processing the splits in parallel.
However, it is also not desirable to have splits too small in size. When splits are too small, the overload of managing the splits and map task creation begins to dominate the total job execution time.
For most jobs, it is better to make split size equal to the size of an HDFS block (which is 64 MB, by default).
Execution of map tasks results into writing output to a local disk on the respective node and not to HDFS.
Reason for choosing local disk over HDFS is, to avoid replication which takes place in case of HDFS store operation.
Map output is intermediate output which is processed by reduce tasks to produce the final output.
Once the job is complete, the map output can be thrown away. So, storing it in HDFS with replication becomes overkill.
In the event of node failure, before the map output is consumed by the reduce task, Hadoop reruns the map task on another node and re-creates the map output.
- Reduce task doesn't work on the concept of data locality. The output of every map task is fed to the reduce task. Map output is transferred to the machine where reduce task is running.
- On this machine, the output is merged and then passed to the user-defined reduce function.
- Unlike to the map output, reduce output is stored in HDFS (the first replica is stored on the local node and other replicas are stored on off-rack nodes). So, writing the reduce output.