PySpark Cookbook
上QQ阅读APP看书,第一时间看更新

.reduceByKey(...) transformation

The reduceByKey(f) transformation reduces the elements of the RDD using f by the key. The f function should be commutative and associative so that it can be computed correctly in parallel.

Look at the following code snippet:

# Determine delays by originating city
# - remove header row via zipWithIndex()
# and map()
(
flights
.zipWithIndex()
.filter(lambda (row, idx): idx > 0)
.map(lambda (row, idx): row)
.map(lambda c: (c[3], int(c[1])))
.reduceByKey(lambda x, y: x + y)
.take(5)
)

This will generate the following output:

# Output
[(u'JFK', 387929),
(u'MIA', 169373),
(u'LIH', -646),
(u'LIT', 34489),
(u'RDM', 3445)]