Distributed Cache is a facility provided by the Hadoop MapReduce framework. It caches files when needed by the applications. It can cache read-only text files, archives, jar files etc. Once we have cached a file for our job, Hadoop will make it available on each data nodes where map/reduce tasks are running.
Thus, we can access files from all the data nodes in our map and reduce job.
Working and Implementation of Distributed Cache in Hadoop
First of all, an application which needs to use distributed cache to distribute a file:
- Should make sure that the file is available.
- And also make sure that file can be accessed via URLs. URLs can be either hdfs: // or http://.
Now, if the file is present on the above URLs, the user mentions it to be a cache file to the distributed cache. MapReduce job will copy the cache file on all the nodes before starting of tasks on those nodes.
- Copy the requisite file to the HDFS:
$ hdfs dfs-put/user/dataflair/lib/jar_file.jar
- Setup the application’s JobConf:
- Add it in Driver class.
Size of Distributed Cache in Hadoop
With cache size property in mapred-site.xml, it is possible to control the size of the distributed cache. By default size of Hadoop distributed cache is 10 GB.
Benefits of Distributed Cache in Hadoop
Below are some advantages of MapReduce Distributed Cache-
Store Complex Data
It distributes simple, read-only text file and complex types like jars, archives. These achievements are then un-archived at the slave node.
Hadoop Distributed Cache tracks the modification timestamps of cache files. And it notifies that the files should not change until a job is executing. Using the hashing algorithm, the cache engine can always determine on which node a particular key-value pair resides. Since there is always a single state of the cache cluster, it is never inconsistent.
Single point of Failure
A distributed cache runs as an independent process across many nodes. Thus, failure of a single node does not result in a complete failure of the cache.