Hadoop PIG

Hadoop Pig is nothing but an abstraction over MapReduce. While it comes to analyze large sets of data, as well as to represent them as data flows, we use Apache Pig. Generally, we use it with Hadoop. By using Pig, we can perform all the data manipulation operations in Hadoop.

In addition, Pig offers a high-level language to write data analysis programs which we call as Pig Latin. One of the major advantages of this language is, it offers several operators. Through them, programmers can develop their own functions for reading, writing, and processing data.

Features of Pig

Apache Pig comes with the below unique features:

  • Rich Set of Operators
  • Ease of Programming
  • Optimization opportunities
  • Extensibility
  • User Define Functions (UDF’s)
  • All types of data handling


Architecture of Hadoop Pig

The major components are:

i. Parser

At first, all the Pig Scripts are handled by the Parser. Basically, Parser checks the syntax of the script, does type checking, and other miscellaneous checks. Afterward, Parser’s output will be a DAG (directed acyclic graph). That represents the Pig Latin statements as well as logical operators.

Basically, the logical operators of the script are represented as the nodes and the data flows are represented as edges, in the DAG (the logical plan).

ii. Optimizer

Further, DAG is passed to the logical optimizer. That carries out the logical optimizations. Like projection and push down.

iii. Compiler

It compiles the optimized logical plan into a series of MapReduce jobs.

iv. Execution Engine

At last, MapReduce jobs are submitted to Hadoop in a sorted order. Hence, these MapReduce jobs are executed finally on Hadoop, that produces the desired results.
 

Where can we use Pig?

There are several scenarios, where we can use Pig. Such as:

  • While data loads are time sensitive.
  • Also, while processing various data sources.
  • While we require analytical insights through sampling.


Where Not to Use Pig?

Also, there are some Scenarios, where we can not use. Such as:

  • While the data is completely unstructured. Such as video, audio, and readable text.
  • Where time constraints exist. Since Pig is slower than MapReduce jobs.
  • Also, when more power is required to optimize the codes, we cannot use Pig.

Applications of Pig

For performing tasks involving ad-hoc processing and quick prototyping, data scientists generally use Apache Pig. More of its applications are:

  1. In order to process huge data sources like weblogs.
  2. Also, to perform data processing for search platforms.
  3. Moreover, to process time sensitive data loads.
Was this answer helpful? 0 Users Found This Useful (0 Votes)