Frank Kane's Taming Big Data with Apache Spark and Python

上QQ阅读APP看书，第一时间看更新

Perform an action - count by value

Finally, we're going to perform an action on our RDD. So far we've transformed the RDD into the form that we want. We took our raw input data and created an RDD that contains nothing but ratings as its values. Now we can perform an action with this line of code:

result = ratings.countByValue()

What we're doing is calling our ratings RDD, which includes just the rating values in our example, 3, 3, 1, 2, and 1. Then we call an action method on that RDD, countByValue. This is a very easy way to cheat and quickly create something like a histogram:

What it does is count up how many times each unique value in the RDD occurs. In this particular example, we know that the rating 3 occurs twice, the rating 1 occurs twice, and the rating 2 only occurs once-this is the output we'll get. We get these pair values, these tuples if you will, of rating and then the number of times that occurred:

This is what will end up in our result object. All that's left to do at this point is to print that out. Now, countByValue is an action, so it's actually returning just a plain old Python object at this point, that's no longer an RDD. We can do what we want to do in order to sort those results, which is the final thing we do.