Thursday, November 14, 2013

What is Hadoop ?

Hadoop is an open-source, Java-based implementation of Google's MapReduce framework. Hadoop is designed for any application which can take advantage of massively parallel distributed-processing, particularly with clusters composed of unreliable hardware.

For example, suppose you have 10 terabytes of data, and you want to process it somehow, (suppose you need to sort it). Using a single computer, this could take a very long time. Traditionally, a high-end super-computer with exotic hardware would be required to do this in a reasonable amount of time.

Hadoop provides a framework to process data of this size using a computing cluster made from normal, commodity hardware. There are two major components to Hadoop: the file system, which is a distributed file system that splits up large files onto multiple computers, and the MapReduce framework, which is an application framework used to process large data stored on the file system.

Of course, there are many distributed computing frameworks, but what is particularly notable about Hadoop (and Google's MapReduce) is the built-in fault tolerance. It is designed to run on commodity hardware, and therefore it expects computers to be breaking frequently. The underlying file system is highly-redundant (blocks of data are replicated across multiple computers) and the MapReduce processing framework automatically handles computer failures which occur during a processing job by reassigning the processing to another computer in the cluster.

No comments: