MapReduce is the core component of Hadoop which provides data processing. MapReduce works by breaking the processing into two phases; Map phase and Reduce phase. The map is the first phase of processing, where we specify all the complex logic/business rules/costly code, whereas the Reduce phase is the second phase of processing, where we specify light-weight processing like aggregation/ summation.
In Hadoop, MapReduce Framework has certain elements such as Counters, Combiners, and Partitioners, which play a key role in improving the performance of data processing.
Hadoop Counters Explained: Hadoop Counters provides a way to measure the progress or the number of operations that occur within map/reduce job. Counters in Hadoop MapReduce are a useful channel for gathering statistics about the MapReduce job: for quality control or for application-level. They are also useful for problem diagnosis.
Counters represent Hadoop global counters, defined either by the MapReduce framework or applications. Each Hadoop counter is named by an “Enum” and has a long for the value. Counters are bunched into groups, each comprising of counters from a particular Enum class.
Hadoop Counters validate that:
- The correct number of bytes was read and written.
- The correct number of tasks was launched and successfully ran.
- The amount of CPU and memory consumed is appropriate for our job and cluster nodes.
Types of Counters in MapReduce
2 types of MapReduce counters are:
- Built-in Counters
- User-Defined Counters/Custom counters
1. Built-in Counters in Hadoop MapReduce
Apache Hadoop maintains some built-in counters for every job. These counters report various metrics. There are counters for the number of bytes and records. Which allow us to confirm that the expected amount of input is consumed and the expected amount of output is produced.
Hadoop Counters are also divided into groups. There are several groups of the built-in counters. Each group also either contains task counters or contain job counter.
Several groups of the built-in counters in Hadoop are as follows:
a) MapReduce Task Counter
Task counter collects specific information about tasks during its execution time. Which include the number of records read and written.
For example, a MAP_INPUT_RECORDS counter is the Task Counter. It also counts the input records read by each map task.
b) File System Counters
This Counter gathers information like a number of bytes read and written by the file system. The name and description of the file system counters are as follows:
- FileSystem bytes read– The number of bytes reads by the filesystem.
- FileSystem bytes written– The number of bytes written to the filesystem.
c) FileInputFormat Counters
d) FileOutputFormat counters
e) Job Counters in MapReduce
Job counter measures the job-level statistics. It does not measure values that change while a task is running. For example, TOTAL_LAUNCHED_MAPS, count the number of map tasks that were launched over the course of a job. Application master also measures the Job counters. So they don’t need to be sent across the network, unlike all other counters, including user-defined ones.
2. User-Defined Counters or Custom Counters in Hadoop MapReduce
In addition to built-in counters, Hadoop MapReduce permits user code to define a set of counters. Then it increments them as desired in the mapper or reducer. Like in Java to define counters it uses, ‘enum’ .
A job may define an arbitrary number of ‘enums’. Each with an arbitrary number of fields. The name of the enum is the group name. The enum’s fields are the counter names.
a) Dynamic Counters in Hadoop
Java enum’s fields are defined at compile time. So we cannot create new counters at runtime using enums. So, we use dynamic counters to create new counters at runtime. But the dynamic counter is not defined at compile time.