By Azeema Sadia and Abdul Haseeb Khan
This article is useful to those who are having difficulty in storing and managing the large amounts of data that their business or organization is generating, and also for those who want to store their big data in an efficient manner. This article will not go in to the detail of the infrastructure or mechanics of the Hadoop framework but rather it will give an over view of what it is and how it came into being. Also we will discuss, briefly, some of the components used in the newly emerging technology that is being used by some of the biggest players in the big data industry.
“2013 is going to be the year when we see the Hadoop adoption
Go a lot more mainstream and turn into a tornado.” – Karmasphere CEO Gail Ennis
Big data! Its what everyone has on their minds right now. As businesses grow large it gets hard for them to store all their data in one place, and it becomes almost impossible to scale up with time. It just gets too expensive to keep scaling up and trying to maintain all that data. Now if we look at the giants of the internet such as Yahoo, eBay, LinkedIn and Facebook they generate terabytes of data every day. Where does it all go? How do they manage it, and that too every day?
Lets go back in time, to the year 2002 to be exact, and fast forward back to the present. Back in 2002 Doug Cutting was developing a search engine called Nutch and quite quickly it emerged as a working crawler and search system. They realized that their system wouldnt scale to the billions of pages on the web. Lucky for them, in 2003, Google published a paper that described the architecture of the technology that they were using to tackle the same problem, its called Googles distributed file system or GFS for short. In 2005 they published another paper that introduced MapReduce to the world, and as Google was publishing these papers Doug Cutting and the team were continuously implementing their own versions of the technologies and implementing them to the Nutch engine. In 2006 they moved out of Nutch and made it an independent project called Hadoop. At that time Doug joined Yahoo, who is the biggest contributor to the project still today. So by 2008 Hadoop was being used at other companies besides Yahoo, like Last.fm, Facebook, and the New York Times.
And now in 2012, Hadoop has become the hart of many companies and businesses, some of which are named above. Now, its a well known for treating and taking care of large amounts of data. Hadoop is a high level project of Apache and its also an open source framework.
How will Hadoop help you with your data problems? Well Hadoop has its own distributed file system called HDFS (Hadoop Distributed File System). What it does is, it breaks large data into multiple data blocks and stores them in different machines (nodes) in the cluster. After that it replicates the data blocks and stores these copies in different machines so the failure of a single machine cannot become the cause of the unavailability of data. In the Hadoop cluster each machine that contains data blocks is called a data node. The data nodes dont have to be very high tech or heavy machines they can be every day normal machines also known as commodity hardware. That is the best part of the technology, its very cost effective. There is a machine in the cluster that is known as the name node, this node contains the Meta data of every data node in the cluster. The name node keeps track of all the data nodes and also which node contains which piece of data. Before the latest version of Hadoop, there was only one name node allowed in the cluster. So, all the data nodes could be commodity hardware but the name node had to be beefy, it had to be high in specs and had to be a heavy machine, and this was the only down point of Hadoop. Because if the name node failed, for any reason, the whole cluster would fail. But now its made so there could be multiple name nodes installed in the cluster, and they too can be normal commodity hardware. The uniqueness of Hadoop is its simplified programming model which helps you to query against large amount of data in an efficient manner.
The MapReduce programming model is designed to run in the Hadoop environment, and to process large data in parallel by dividing it into small task. It divides the work across multiple machines in the cluster. Basically MapReduce has two components, one is the Mapper and the other is the Reducer. The Mapper takes the list of input elements and transforms them into output elements, and the Reducer takes the output of the Mapper and combines it into a single value.
There are multiple subprojects of the Hadoop framework such as Hive, Pig, HBase, and ZooKeeper. Of course, Hadoop is not a small topic that could be covered in one article, there is a whole ecosystem with many components that we did not cover in this article. The purpose of this article was to give you an introduction to the technology that is very quickly maturing and making an impact on the lives of many businesses.