Running a Distributed MapReduce Job

The same program will run, without alteration, on a full dataset. This is the point of MapReduce: it scales to the size of your data and the size of your hardware. Here’s one data point: on a 10-node EC2 cluster running High-CPU Extra Large instances, the program took six minutes to run.[21]

We’ll go through the mechanics of running programs on a cluster in Chapter 6.

results matching ""

    No results matching ""