Relational Database Management Systems
Why can’t we use databases with lots of disks to do large-scale analysis? Why is Hadoop needed?
The answer to these questions comes from another trend in disk drives: seek time is improving more slowly than transfer rate. Seeking is the process of moving the disk’s head to a particular place on the disk to read or write data. It characterizes the latency of a disk operation, whereas the transfer rate corresponds to a disk’s bandwidth.
If the data access pattern is dominated by seeks, it will take longer to read or write large portions of the dataset than streaming through it, which operates at the transfer rate. On the other hand, for updating a small proportion of records in a database, a traditional BTree (the data structure used in relational databases, which is limited by the rate at which it can perform seeks) works well. For updating the majority of a database, a B-Tree is less efficient than MapReduce, which uses Sort/Merge to rebuild the database.
In many ways, MapReduce can be seen as a complement to a Relational Database Management System (RDBMS). (The differences between the two systems are shown in Table 1-1.) MapReduce is a good fit for problems that need to analyze the whole dataset in a batch fashion, particularly for ad hoc analysis. An RDBMS is good for point queries or updates, where the dataset has been indexed to deliver low-latency retrieval and update times of a relatively small amount of data. MapReduce suits applications where the data is written once and read many times, whereas a relational database is good for datasets that are continually updated.[7]
Traditional RDBMS | MapReduce | |
---|---|---|
Data size | Gigabytes | Petabytes |
Access | Interactive and batch | Batch |
Updates | Read and write many times | Write once, read many times |
Transactions | ACID | None |
Structure | Schema-on-write | Schema-on-read |
Integrity | High | Low |
Scaling | Nonlinear | Linear |
However, the differences between relational databases and Hadoop systems are blurring. Relational databases have started incorporating some of the ideas from Hadoop, and from the other direction, Hadoop systems such as Hive are becoming more interactive (by moving away from MapReduce) and adding features like indexes and transactions that make them look more and more like traditional RDBMSs.
Another difference between Hadoop and an RDBMS is the amount of structure in the datasets on which they operate. Structured data is organized into entities that have a defined format, such as XML documents or database tables that conform to a particular predefined schema. This is the realm of the RDBMS. Semi-structured data, on the other hand, is looser, and though there may be a schema, it is often ignored, so it may be used only as a guide to the structure of the data: for example, a spreadsheet, in which the structure is the grid of cells, although the cells themselves may hold any form of data. Unstructured data does not have any particular internal structure: for example, plain text or image data. Hadoop works well on unstructured or semi-structured data because it is designed to interpret the data at processing time (so called schema-on-read). This provides flexibility and avoids the costly data loading phase of an RDBMS, since in Hadoop it is just a file copy.
Relational data is often normalized to retain its integrity and remove redundancy Normalization poses problems for Hadoop processing because it makes reading a record nonlocal operation, and one of the central assumptions that Hadoop makes is that it i possible to perform (high-speed) streaming reads and writes
A web server log is a good example of a set of records that is not normalized (for example the client hostnames are specified in full each time, even though the same client ma appear many times), and this is one reason that logfiles of all kinds are particularly wel suited to analysis with Hadoop. Note that Hadoop can perform joins; it’s just that they ar not used as much as in the relational world
MapReduce — and the other processing models in Hadoop — scales linearly with the siz of the data. Data is partitioned, and the functional primitives (like map and reduce) ca work in parallel on separate partitions. This means that if you double the size of the inpu data, a job will run twice as slowly. But if you also double the size of the cluster, a job wil run as fast as the original one. This is not generally true of SQL queries.