What’s in This Book?

The book is divided into five main parts: Parts I to III are about core Hadoop, Part I covers related projects in the Hadoop ecosystem, and Part V contains Hadoop case studies You can read the book from cover to cover, but there are alternative pathways through th book that allow you to skip chapters that aren’t needed to read later ones. See Figure 1-1.

Part I is made up of five chapters that cover the fundamental components in Hadoop an should be read before tackling later chapters. Chapter 1 (this chapter) is a high-leve introduction to Hadoop. Chapter 2 provides an introduction to MapReduce. Chapter looks at Hadoop filesystems, and in particular HDFS, in depth. Chapter 4 discusse YARN, Hadoop’s cluster resource management system. Chapter 5 covers the I/O buildin blocks in Hadoop: data integrity, compression, serialization, and file-based data structures.

Part II has four chapters that cover MapReduce in depth. They provide usefu understanding for later chapters (such as the data processing chapters in Part IV), bu could be skipped on a first reading. Chapter 6 goes through the practical steps needed t develop a MapReduce application. Chapter 7 looks at how MapReduce is implemented i Hadoop, from the point of view of a user. Chapter 8 is about the MapReduce programmin model and the various data formats that MapReduce can work with. Chapter 9 is o advanced MapReduce topics, including sorting and joining data.

Part III concerns the administration of Hadoop: Chapters 10 and 11 describe how to set u and maintain a Hadoop cluster running HDFS and MapReduce on YARN Part IV of the book is dedicated to projects that build on Hadoop or are closely related t it. Each chapter covers one project and is largely independent of the other chapters in thi part, so they can be read in any order.

The first two chapters in this part are about data formats. Chapter 12 looks at Avro, cross-language data serialization library for Hadoop, and Chapter 13 covers Parquet, a efficient columnar storage format for nested data.

The next two chapters look at data ingestion, or how to get your data into Hadoop Chapter 14 is about Flume, for high-volume ingestion of streaming data. Chapter 15 i about Sqoop, for efficient bulk transfer of data between structured data stores (lik relational databases) and HDFS.

The common theme of the next four chapters is data processing, and in particular usin higher-level abstractions than MapReduce. Pig (Chapter 16) is a data flow language fo exploring very large datasets. Hive (Chapter 17) is a data warehouse for managing dat stored in HDFS and provides a query language based on SQL. Crunch (Chapter 18) is high-level Java API for writing data processing pipelines that can run on MapReduce o Spark. Spark (Chapter 19) is a cluster computing framework for large-scale dat processing; it provides a directed acyclic graph (DAG) engine, and APIs in Scala, Java and Python.

Chapter 20 is an introduction to HBase, a distributed column-oriented real-time databas that uses HDFS for its underlying storage. And Chapter 21 is about ZooKeeper, distributed, highly available coordination service that provides useful primitives fo building distributed applications.

Finally, Part V is a collection of case studies contributed by people using Hadoop i interesting ways.

Supplementary information about Hadoop, such as how to install it on your machine, ca be found in the appendixes.

results matching ""

    No results matching ""