
Partitioner
The partitioner takes the intermediate key/value pairs from the mapper (or combiner if it is being used) and splits them up into shards, one shard per reducer.
Each map function output is allocated to a particular reducer by the application's partition function for sharding purposes. The partition function, is given the key and the number of reducers and returns the index of the desired reducer.
A typical default is to hash the key and use the hash value to module the number of reducers:
partitionId = hash(key) % R, where R is number of Reducers
It is important to pick a partition function that gives an approximately uniform distribution of data per shard for load-balancing purposes; otherwise the MapReduce operation can be held up waiting for slow reducers to finish (that is, the reducers assigned the larger shares of the skewed data).
Between the map and reduce stages, the data is shuffled (parallel sorted and then exchanged between nodes) in order to move the data from the map node that produced them to the shard in which they will be reduced. The shuffle can sometimes take longer than the computation time, depending on network bandwidth, CPU speeds, data produced, and time taken by map and reduce computations.
By default, the partitioner computes the hash code of each object, which is typically an md5 checksum. Then, it randomly distributes the keyspace evenly over the reducers, but still ensures that keys with the same values in different mappers end up at the same reducer. The default behavior of the partitioner can be customized with operations such as sorting. The partitioned data is written to the local filesystem for each map task and waits to be pulled by its corresponding reducer.