A Brief History of Apache Hadoop
Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used tex search library. Hadoop has its origins in Apache Nutch, an open source web search engine itself a part of the Lucene project.
THE ORIGIN OF THE NAME “HADOOP”
The name Hadoop is not an acronym; it’s a made-up name. The project’s creator, Doug Cutting, explains how th name came about :
The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, an not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid’s term .
Projects in the Hadoop ecosystem also tend to have names that are unrelated to their function, often with an elephan or other animal theme (“Pig,” for example). Smaller components are given more descriptive (and therefore mor mundane) names. This is a good principle, as it means you can generally work out what something does from it name. For example, the namenode[10] manages the filesystem namespace.
Building a web search engine from scratch was an ambitious goal, for not only is th software required to crawl and index websites complex to write, but it is also a challeng to run without a dedicated operations team, since there are so many moving parts. It’ expensive, too: Mike Cafarella and Doug Cutting estimated a system supporting a onebillion page index would cost around $500,000 in hardware, with a monthly running cos of $30,000.[11] Nevertheless, they believed it was a worthy goal, as it would open up an ultimately democratize search engine algorithms
Nutch was started in 2002, and a working crawler and search system quickly emerged However, its creators realized that their architecture wouldn’t scale to the billions of page on the Web. Help was at hand with the publication of a paper in 2003 that described th architecture of Google’s distributed filesystem, called GFS, which was being used i production at Google.[12] GFS, or something like it, would solve their storage needs fo the very large files generated as a part of the web crawl and indexing process. I particular, GFS would free up time being spent on administrative tasks such as managin storage nodes. In 2004, Nutch’s developers set about writing an open sourc implementation, the Nutch Distributed Filesystem (NDFS)
In 2004, Google published the paper that introduced MapReduce to the world.[13] Early i 2005, the Nutch developers had a working MapReduce implementation in Nutch, and b the middle of that year all the major Nutch algorithms had been ported to run usin MapReduce and NDFS
NDFS and the MapReduce implementation in Nutch were applicable beyond the realm o search, and in February 2006 they moved out of Nutch to form an independent subprojec of Lucene called Hadoop. At around the same time, Doug Cutting joined Yahoo!, whic provided a dedicated team and the resources to turn Hadoop into a system that ran at we scale (see the following sidebar). This was demonstrated in February 2008 when Yahoo announced that its production search index was being generated by a 10,000-core Hadoo cluster.[14]
HADOOP AT YAHOO
Building Internet-scale search engines requires huge amounts of data and therefore large numbers of machines t process it. Yahoo! Search consists of four primary components: the Crawler, which downloads pages from we servers; the WebMap, which builds a graph of the known Web; the Indexer, which builds a reverse index to the bes pages; and the Runtime, which answers users’ queries. The WebMap is a graph that consists of roughly 1 trillio (1012) edges, each representing a web link, and 100 billion (1011) nodes, each representing distinct URLs. Creatin and analyzing such a large graph requires a large number of computers running for many days. In early 2005, th infrastructure for the WebMap, named Dreadnaught, needed to be redesigned to scale up to more nodes. Dreadnaugh had successfully scaled from 20 to 600 nodes, but required a complete redesign to scale out further. Dreadnaught i similar to MapReduce in many ways, but provides more flexibility and less structure. In particular, each fragment in Dreadnaught job could send output to each of the fragments in the next stage of the job, but the sort was all done i library code. In practice, most of the WebMap phases were pairs that corresponded to MapReduce. Therefore, th WebMap applications would not require extensive refactoring to fit into MapReduce.
Eric Baldeschwieler (aka Eric14) created a small team, and we started designing and prototyping a new framework written in C++ modeled and after GFS and MapReduce, to replace Dreadnaught. Although the immediate need wa for a new framework for WebMap, it was clear that standardization of the batch platform across Yahoo! Search wa critical and that by making the framework general enough to support other users, we could better leverage investmen in the new platform.
At the same time, we were watching Hadoop, which was part of Nutch, and its progress. In January 2006, Yahoo hired Doug Cutting, and a month later we decided to abandon our prototype and adopt Hadoop. The advantage o Hadoop over our prototype and design was that it was already working with a real application (Nutch) on 20 nodes That allowed us to bring up a research cluster two months later and start helping real customers use the ne framework much sooner than we could have otherwise. Another advantage, of course, was that since Hadoop wa already open source, it was easier (although far from easy!) to get permission from Yahoo!’s legal department to wor in open source. So, we set up a 200-node cluster for the researchers in early 2006 and put the WebMap conversio plans on hold while we supported and improved Hadoop for the research users
— Owen O’Malley, 2009
In January 2008, Hadoop was made its own top-level project at Apache, confirming it success and its diverse, active community. By this time, Hadoop was being used by man other companies besides Yahoo!, such as Last.fm, Facebook, and the New York Times.
In one well-publicized feat, the New York Times used Amazon’s EC2 compute cloud t crunch through 4 terabytes of scanned archives from the paper, converting them to PDF for the Web.[15] The processing took less than 24 hours to run using 100 machines, and th project probably wouldn’t have been embarked upon without the combination o Amazon’s pay-by-the-hour model (which allowed the NYT to access a large number o machines for a short period) and Hadoop’s easy-to-use parallel programming model.
In April 2008, Hadoop broke a world record to become the fastest system to sort an entir terabyte of data. Running on a 910-node cluster, Hadoop sorted 1 terabyte in 209 second (just under 3.5 minutes), beating the previous year’s winner of 297 seconds.[16] I November of the same year, Google reported that its MapReduce implementation sorted terabyte in 68 seconds.[17] Then, in April 2009, it was announced that a team at Yahoo had used Hadoop to sort 1 terabyte in 62 seconds.[18]
The trend since then has been to sort even larger volumes of data at ever faster rates. In th 2014 competition, a team from Databricks were joint winners of the Gray Sort benchmark They used a 207-node Spark cluster to sort 100 terabytes of data in 1,406 seconds, a rat of 4.27 terabytes per minute.[19]
Today, Hadoop is widely used in mainstream enterprises. Hadoop’s role as a generalpurpos storage and analysis platform for big data has been recognized by the industry and this fact is reflected in the number of products that use or incorporate Hadoop in som way. Commercial Hadoop support is available from large, established enterprise vendors including EMC, IBM, Microsoft, and Oracle, as well as from specialist Hadoo companies such as Cloudera, Hortonworks, and MapR.