Running a Distributed MapReduce Job
The same program will run, without alteration, on a full dataset. This is the point of MapReduce: it scales to the size of your data and the size of your hardware. Here’s one data point: on a 10-node EC2 cluster running High-CPU Extra Large instances, the program took six minutes to run.[21]
We’ll go through the mechanics of running programs on a cluster in Chapter 6.