Frank Kane's Taming Big Data with Apache Spark and Python

上QQ阅读APP看书，第一时间看更新

Parsing (mapping) the input data

Our first step is just to parse our input data into what we need; nothing special here, we're going to start by creating a lines RDD as shown here. This is calling textFile on our SparkContext object with our source data, and that's just going to give us an RDD where every individual line of that comma-separated value list is an individual entry in our RDD:

lines = sc.textFile("file:///SparkCourse/fakefriends.csv")

Now things get interesting; I'm going to transform my lines RDD into an rdd RDD (very creatively named) by calling map on it. Also, I'm passing in the parseLine function to actually conduct that mapping:

rdd = lines.map(parseLine)

So every line from my lines RDD will be passed into parseLine one at a time, and I'm going to parse it out, as shown here:

def parseLine(line): 
    fields = line.split(',') 
    age = int(fields[2]) 
    numFriends = int(fields[3]) 
    return (age, numFriends)

The first thing we're going to do is split it based on commas and that'll bust out the different fields we need:

fields = line.split(',')

I will then extract the fields that I'm interested in. If I'm just trying to figure out the number of friends by age, all I care about is the number of friends and age information, the user IDs and the usernames are irrelevant, so I'm just going to discard those. I will extract the age from the field number 2, which is actually the third field because remember we start counting from zero. It is important to note that I'm actually casting it to an integer value because I want to treat this as a numerical value, and that allows me to do arithmetic operations on it later:

age = int(fields[2])

Now if I didn't do that, it would just keep treating it as a string, so I wouldn't be able to do things like add them up and divide them, which I'm going to have to do if I want to get averages at the end of the day. Similarly, in the next line, I'm going to cast the number of friends to an integer value as well, using the correct syntax, in between parentheses. Fields 3 will give me back a string value of some number, and int will actually make sure Python knows that it's a number, that I should treat it as such, and I can perform arithmetic on it:

numFriends = int(fields[3])

The next line is where we actually transform things into a key/value RDD. Instead of returning a single value, I'm returning a key/value pair of the age and the number of friends:

return (age, numFriends)

The RDD I'm creating with the parseLine mapper function creates a new RDD that is a key/value RDD, with a key of age and a value of numFriends.

Hope you are with me so far. For example, if we transform our original data:

The output will be a key/value pair RDD that contains something like this:

The first user in our data had an age of 33 and 385 friends, the second user had an age of 33 and 2 friends, the third user was 55 years old and had 221 friends, and so on and so forth. This is an important concept to grasp, so go over this as many times as you need in order to let it sink in. Let's move on.