The Applied Data Science Workshop

上QQ阅读APP看书，第一时间看更新

Our First Analysis – the Boston Housing Dataset

The dataset we'll be looking at in this section is the so-called Boston Housing dataset. It contains US census data concerning houses in various areas around the city of Boston. Each sample corresponds to a unique area and has about a dozen measures. We should think of samples as rows and measures as columns. This data was first published in 1978 and is quite small, containing only about 500 samples.

Now that we know something about the context of the dataset, let's decide on a rough plan for the exploration and analysis stages. If applicable, this plan will accommodate the relevant questions under study. In this case, the goal is not to answer a question, but to show Jupyter in action and illustrate some basic data analysis methods.

Our general approach to this analysis will be to do the following:

Load the data into Jupyter using a pandas DataFrame
Quantitatively understand the features
Look for patterns and generate questions
Answer the questions to the problems

Before we start this analysis, let's take a moment to set up our notebook environment.

Exercise 2.01: Importing Data Science Libraries and Setting Up the Notebook Plotting Environment

In this exercise, we'll set the stage so that we're ready to load data into the notebook. Perform the following steps to complete this exercise:

If you haven't done so already, start up one of the following platforms for running Jupyter Notebooks:
JupyterLab (run jupyter lab)
Jupyter Notebook (run jupyter notebook)
Then, open your chosen platform in a web browser by copying and pasting the URL, as prompted in the Terminal.
Import pandas and pull up the docstring for a pandas DataFrame, the main data structure you will use to store tables and run calculations on them. This should be familiar to you from Activity 1.01, Using Jupyter to Learn about Pandas DataFrames, from Chapter 1, Introduction to Jupyter Notebooks, where you learned about some basics of DataFrames:
import pandas as pd
pd.DataFrame?
The output is as follows:

Figure 2.1: The docstring for pd.DataFrame
Import seaborn and pull up examples of the plot functions you have access to:
import seaborn as sns
sns.*?
You'll see familiar charts in the list, such as barplot, boxplot, and scatterplot, as follows:

Figure 2.2: The output of running sns.*?
Load the library dependencies and set up the plotting environment for the notebook.
Note
You can type these libraries into new cells and play with the Jupyter tab completion features.
The code for importing the libraries is as follows:
# Common standard libraries
import datetime
import time
import os
# Common external libraries
import pandas as pd
import numpy as np
import sklearn # scikit-learn
import requests
from bs4 import BeautifulSoup
# Visualization libraries
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
Libraries do not need to be imported at the top of the notebook, but it's good practice to do this in order to summarize the dependencies in one place. Sometimes, though, it makes sense to load things midway through the notebook, and that is completely fine.
Note
To access the source code for this specific section, please refer to https://packt.live/2N1riF2.
You can also run this example online at https://packt.live/37DzuVK.
Set the plot's appearance, as follows:
%config InlineBackend.figure_format='retina'
sns.set() # Revert to matplotlib defaults
plt.rcParams['figure.figsize'] = (9, 6)
plt.rcParams['axes.labelpad'] = 10
sns.set_style("darkgrid")
Note
It's easy to run into issues of non-reproducibility when working in a notebook by running cells out of order. For this reason, you can keep an eye on the cell run count, which is listed to the left of the cell in square brackets and increments by one each time a cell is run.
Consider, for example, if you import pandas (import pandas as pd) halfway down the notebook. Since Jupyter notebooks have shared memory between cells, you can use pandas in any cell, even those above it. Now, say you did this and used pandas to write some code above the import cell. In this case, if you were to rerun the notebook from scratch (or send to someone else to run), then it will throw a NameError exception and stop executing when it hits that cell. This will happen because the notebook is attempting to execute pandas code before importing the library.
A similar issue can appear when transforming data, such as rerunning charts on data that was transformed later in the notebook. In this case, the solution is to be careful about the order in which you run things. Try to rerun your notebook from scratch occasionally if time permits so that you can catch errors like this before they creep too far into your analysis.

Now that we've loaded our external libraries, we can move on to the analysis. Please keep the notebook open and carry on to the next exercise to continue running through the notebook.

Loading the Data into Jupyter Using a pandas DataFrame

Often, data is stored in tables, which means it can be saved as a comma-separated variable (CSV) file. This format, and many others, can be read into Python as a DataFrame object using the pandas library. Other common formats include tab-separated variable (TSV), Structured Query Language (SQL) tables, and JavaScript Object Notation (JSON) data structures. Indeed, pandas has support for all of these.

We won't need to worry about loading data like this right now, since our first dataset is available directly through scikit-learn; however, we will see some examples of loading and saving these types of data structures later in this book.

Note

An important step after loading data for analysis is to ensure that it's clean. For example, we would generally need to deal with missing data and ensure that all the columns have the correct data types. The dataset we'll be using in this section has already been cleaned, so we won't need to worry about this. However, we'll see an example of data that requires cleaning in Chapter 3, Preparing Data for Predictive Modeling, and explore techniques for dealing with it.

Exercise 2.02: Loading the Boston Housing Dataset

To get started with the Boston Housing dataset, you will load the table into a pandas DataFrame and look at its fields and some summary statistics. Perform the following steps to complete this exercise:

Load the Boston Housing dataset from the sklearn.datasets module using the load_boston method:
from sklearn import datasets
boston = datasets.load_boston()
Check the data structure type, as follows:
type(boston)
The output is as follows:
sklearn.utils.Bunch
The output indicates that it's a scikit-learn Bunch object, but you still need to get some more information about this to understand what you are dealing with.
Import the base object from scikit-learn utils and print the docstring in your notebook:
from sklearn.utils import Bunch
Bunch?
The output is as follows:

Figure 2.3: The docstring for sklearn.utils.Bunch
Print the field names (that is, the keys to the dictionary) as follows:
boston.keys()
The output is as follows:
dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])
Print the dataset description contained in boston['DESCR'] as follows:
print(boston['DESCR'])
Note that, in this call, you want to explicitly print the field value so that the notebook renders the content in a more readable format than the string representation (that is, if we just type boston['DESCR'] without wrapping it in a print statement). By doing this, we can see the dataset information as we summarized it previously:

Figure 2.4: Description of the Boston dataset
Briefly read through the feature descriptions and/or describe them yourself. For the purposes of this book, the most important fields to understand are RM, AGE, LSTAT, and MEDV. Of particular importance, here, are the feature descriptions (under Attribute Information). You will use this as a reference during your analysis.
Now, create a pandas DataFrame that contains the data. This is beneficial for a few reasons: all of our data will be contained in one object, there are useful and computationally efficient DataFrame methods we can use, and other libraries, such as seaborn, have tools that integrate nicely with DataFrames.
In this case, you will create your DataFrame with the standard constructor method. As a reminder of how this is done, pull up the docstring one more time:
import pandas as pd
pd.DataFrame?
The output is as follows:

Figure 2.5: The docstring for pd.DataFrame
The docstring reveals the DataFrame input parameters. You want to feed in boston['data'] for the data and use boston['feature_names'] for the headers.
Print the data, as follows:
boston['data']
The output is as follows:

Figure 2.6: Printing the data
Print the shape of the data, as follows:
boston['data'].shape
The output is as follows:
(506, 13)
Print the feature names, as follows:
boston['feature_names']
The output is as follows:
array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS',
'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')
Looking at the output, you can see that your data is in a 2D NumPy array. Running the boston['data'].shape command returns the length (number of samples) and the number of features as the first and second outputs, respectively.
Load the data into a pandas DataFrame, df, by running the following code:
# Load the data
df = pd.DataFrame(data=boston['data'], \
columns=boston['feature_names'],)
In machine learning, the variable that is being modeled is called the target variable; it's what you are trying to predict, given the features. For this dataset, the suggested target is MEDV, which is the median house value in thousands of dollars.
Check the shape of the target, as follows:
boston['target'].shape
The output is (506,).
You can see that it has the same length as the features, which is expected. Therefore, it can be added as a new column to the DataFrame.
Add the target variable to df, as follows:
df['MEDV'] = boston['target']
Move the target variable to the front of df, as follows:
y = df['MEDV'].copy()
del df['MEDV']
df = pd.concat((y, df), axis=1)
This is done to distinguish the target from our features by storing it at the front of our DataFrame.
Here, you've introduced a dummy variable, y, to hold a copy of the target column before we remove it from the DataFrame. Then, you used the pandas concatenation function to combine it with the remaining DataFrame along the 1st axis (as opposed to the 0th axis, which combines rows).
Note
You will often see dot notation used to reference DataFrame columns. For example, previously, we could have done y = df.MEDV.copy(). While this can work well for accessing column data, it cannot be used generally in place of the square bracket notation, such as when you have column names with spaces. Additionally, you cannot delete columns using dot notation. For example, del df.MEDV would raise an error. The correct way to delete the MEDV column would be to run del df['MEDV'].
Implement df.head() or df.tail() to glimpse the data and len(df) to verify that the number of samples is what we expect. Run the next few cells to see the head, tail, and length of df:

Figure 2.7: The first 10 and last 10 rows of the Boston Housing dataset
Verify that the number of samples is what you expect, as follows:
len(df)
The output is 506.
Each row is labeled with an index value, as seen in bold on the left-hand side of the table. By default, these are a set of integers starting at 0 and incrementing by one for each row.
Check the data type contained within each column, as follows:
df.dtypes
The output is as follows:

Figure 2.8: Data types of each column in the Boston Housing dataset
For this dataset, you can see that every field is a float, including the target. Looking at a sample of the target (MEDV) specifically, you can clearly see that it's distributed over a continuous range, as would be expected of a float data type. This makes sense, considering the field represents the median house value of each record. Given this information, you can determine that predicting this target variable is a regression problem.
You can run df.isnull() to clean the dataset as pandas automatically sets missing data as NaN values. To get the number of NaN values per column, execute the following code:
df.isnull().sum()
The output is as follows:

Figure 2.9: Number of missing records in each column of the Boston Housing dataset
df.isnull() returns a Boolean frame of the same length as df.
For this dataset, you can see that there are no NaN values, which means you have no immediate work to do in terms of cleaning the data and can move on.
Remove some columns, as follows:
for col in ['ZN', 'NOX', 'RAD', 'PTRATIO', 'B']:
del df[col]
This is done to simplify the analysis. We will focus on the remaining columns in more detail when we pe into exploring the data.
Note
To access the source code for this specific section, please refer to https://packt.live/2N1riF2.
You can also run this example online at https://packt.live/37DzuVK.

Data Exploration

Since this is an entirely new dataset that we've never seen before, the first goal here is to understand the data. We've already seen the textual description of the data, which is important for qualitative understanding. Now, we'll compute a quantitative description.

Exercise 2.03: Analyzing the Boston Housing Dataset

In this exercise, you will learn more about the dataset from a top-down perspective, starting with summary metrics and then digging into more details of particular columns. You will explore relationships between the various fields, plot out visualizations to increase our understanding, and look into questions that arise. Perform the following steps to complete this exercise:

Check some statistics of the DataFrame, as follows:
df.describe().T
The output is as follows:

Figure 2.10: Summary statistics for the Boston Housing dataset
The preceding command computes various properties, including the mean, standard deviation, minimum, and maximum for each column. This table gives us a high-level idea of how everything is distributed. Note that you have taken the transform of the result by adding a .T suffix to the output; this swaps the rows and columns.
Going forward with the analysis, you will specify a set of columns to focus on.
Define a specific set of columns, as follows:
cols = ['RM', 'AGE', 'TAX', 'LSTAT', 'MEDV']
Display this subset as the columns of the DataFrame, as follows:
df[cols].tail()
The output is as follows:

Figure 2.11: A subset of columns from the Boston Housing dataset
Recall what each of these columns represents. From the dataset documentation, you have the following:
RM: Average number of rooms per dwelling
AGE: Proportion of owner-occupied units built prior to 1940
TAX: Full-value property tax rate per $10,000
LSTAT: Percentage of the population that is classified as "low status"
MEDV: Median value of owner-occupied homes in $1000s
To look for patterns in this data, you can start by calculating the pairwise correlations using pd.DataFrame.corr.
Calculate the pairwise correlations for your selected columns, as follows:
df[cols].corr()
The output of this command is as follows:

Figure 2.12: Pairwise correlation scores between select columns
This resulting table shows the correlation score between each pair of columns. Large positive scores indicate a strong positive (that is, in the same direction) correlation. As expected, you can see maximum values of 1 on the diagonal.
By default, pandas calculates the standard correlation coefficient for each pair of columns, which is also called the Pearson coefficient. This is defined as the covariance between two variables, pided by the product of their standard deviations:

Figure 2.13: The Pearson correlation coefficient equation
Here, you should think of X as one column, and Y as another. The standard deviations are calculated in the usual way, by summing up the squared differences between each data point and the average for that column.
The covariance, in turn, is defined as follows:

Figure 2.14: The covariance equation
Here, n is the number of records (that is, the number of rows in the table), xi and yi correspond to the inpidual values of each record being summed over, and x̄ and ȳ correspond to the average values of the records for columns X and Y, respectively.
Returning to the analysis, visualize the correlation coefficients that you calculated previously, using a heatmap. This will produce a better result for you to look at, rather than having to strain your eyes to interpret the preceding table.
Import the libraries that are required to plot a heatmap and set the appearance settings as follows:
# Visualization libraries
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# Setting plot appearance
%config InlineBackend.figure_format='retina'
sns.set() # Revert to matplotlib defaults
plt.rcParams['figure.figsize'] = (9, 6)
plt.rcParams['axes.labelpad'] = 10
sns.set_style("darkgrid")
To create the heatmap using Seaborn, execute the following code:
ax = sns.heatmap(df[cols].corr(), \
                 cmap=sns.cubehelix_palette(20, \
                                            light=0.95, \
                                            dark=0.15),)
ax.xaxis.tick_top() # move labels to the top
plt.savefig('../figures/chapter-2-boston-housing-corr.png', \
            bbox_inches='tight', dpi=300,)
Note
To save the image using the preceding code, you will need to ensure you have a figures folder set up in your working directory. Alternatively, you can edit the path in the code to save the image to a different folder of your choice.
The following screenshot shows the heatmap:

Figure 2.15: Heatmap of correlation scores
Call sns.heatmap and pass the pairwise correlation matrix as input. You will use a custom color palette here to override the Seaborn default. The function returns a matplotlib.axes object, which is referenced by the ax variable.
The final figure is then saved as a high-resolution PNG to the figures folder.
Note
Each chart is exported as a PNG file using the plt.savefig function. We set the bbox_inches='tight' and dpi=300 argument in order to save a high-quality image.
For the final step in our dataset exploration exercise, we'll visualize our data using seaborn's pairplot function.
Visualize the DataFrame using Sseaborn's pairplot function, as follows:
sns.pairplot(df[cols], plot_kws={'alpha': 0.6}, \
diag_kws={'bins': 30},)
plt.savefig('../figures/chapter-2-boston-housing-pairplot.png', \
bbox_inches='tight', dpi=300,)
Here's the output of the visualization:

Figure 2.16: Pairplot visualization of the Boston Housing dataset

Looking at the histograms on the diagonal, you can see the following:

a: RM and MEDV have the closest shape to normal distributions.
b: AGE is skewed to the left and LSTAT is skewed to the right (this may seem counterintuitive, but skew is defined in terms of where the mean is positioned in relation to the max).
c: For TAX, we find that a large amount of the distribution is around 700. This is also evident from the scatter plots.

Taking a closer look at the MEDV histogram on the bottom right, we can actually see something similar to TAX, where there is a large upper-limit bin around $50,000. Recall that, when we used df.describe(), the min and max of MEDV were 5k and 50k, respectively. This suggests that MEDV (which represents the median house values for each community) has been capped at 5k on the low end and 50k on the high end. This was likely done for ethical reasons, to help protect the privacy of the outlier communities.

Note

To access the source code for this specific section, please refer to https://packt.live/2N1riF2.

You can also run this example online at https://packt.live/37DzuVK.

Now that we have a good start regarding exploring the data, let's turn our attention to a modeling problem offered by the dataset.

Introduction to Predictive Analytics with Jupyter Notebooks

Continuing our analysis of the Boston Housing dataset, we can see that it presents us with a regression problem. In regression, we try to predict a numerical target variable, given a set of features. In this case, we'll be predicting the median house value (MEDV) using some features seen in the pairplot from the previous exercise.

We'll train models that take only one feature as input to make this prediction. This way, the models will be conceptually simple to understand, and we can focus more on the technical details of the scikit-learn API. Then, in the next chapter, you'll be more comfortable dealing with the relatively complicated models we are going to train there.

Exercise 2.04: Training Linear Models with Seaborn and scikit-learn

You will now start training the model with Seaborn. Perform the following steps to do so:

Take a look at the pairplot that you created in the last step of Exercise 2.03, Analyzing the Boston Housing Dataset. In particular, look at the scatter plots in the bottom-left corner:

Figure 2.17: Scatter plots for MEDV and LSTAT
Note how the number of rooms per house (RM) and the % of the population that is lower class (LSTAT) are highly correlated with the median house value (MDEV). Let's pose the following question: how well can we predict MDEV given these variables?
To help answer this, visualize the relationships using seaborn. You will draw the scatter plots along with the line of best fit linear models.
To use the sns.regplot function (which stands for "regression" plot), pull up that docstring, as follows:
sns.regplot?
The output is as follows:

Figure 2.18: The docstring for sns.regplot
Read about the first few arguments and notice how the function can accept a DataFrame as input.
Draw scatter plots for the linear models, as follows:
fig, ax = plt.subplots(1, 2)
sns.regplot(x='RM', y='MEDV', data=df, \
            ax=ax[0], scatter_kws={'alpha': 0.4},)
sns.regplot(x='LSTAT', y='MEDV', data=df, \
            ax=ax[1], scatter_kws={'alpha': 0.4},)
plt.savefig('../figures/chapter-2-boston-housing-scatter.png', \
            bbox_inches='tight', dpi=300,)
The output is as follows:

Figure 2.19: Scatter plots with linear regression trends
The line of best fit is calculated by minimizing the ordinary least squares error function, which is something seaborn does automatically when we call the regplot function. Also, take note of the shaded areas around the lines, which represent 95% confidence intervals.
Note
The 95% confidence intervals that are rendered by the sns.regplot function, as seen in the preceding plot, are calculated by taking the standard deviation of data in bins perpendicular to the line of best fit, effectively determining the confidence intervals at each point along the line of best fit. In practice, this involves seaborn bootstrapping the data, a process where new data is created through random sampling with replacement. The number of bootstrapped samples is automatically determined based on the size of the dataset or can be manually set as well by passing the n_boot argument.
Plot the residuals using seaborn, as follows:
fig, ax = plt.subplots(1, 2)
ax[0] = sns.residplot(x='RM', y='MEDV', data=df, \
                      ax=ax[0], scatter_kws={'alpha': 0.4},)
ax[0].set_ylabel('MDEV residuals $(y-\hat{y})$')
ax[1] = sns.residplot(x='LSTAT', y='MEDV', data=df, \
                      ax=ax[1], scatter_kws={'alpha': 0.4},)
ax[1].set_ylabel('')
plt.ylim(-25, 40)
plt.savefig('../figures/chapter-2-boston-housing-residuals.png',\
            bbox_inches='tight', dpi=300,)
This will result in the following chart:

Figure 2.20: Residual charts for linear regression models
Each point on these residual plots is the difference between the observed value (y) and the linear model prediction (ŷ). Residuals greater than zero are data points that would be underestimated by the model. Likewise, residuals less than zero are data points that would be overestimated by the model.
Patterns in these plots can indicate suboptimal modeling. In each preceding case, you can see diagonally arranged scatter points in the positive region. These are caused by the $50,000 cap on MEDV. Both residual charts show data that's largely clustered around 0, which is an indicator of good fit. However, LSTAT appears to cluster slightly below 0, indicating that a linear model may not be the best choice. This same fact can be seen by looking at the scatter chart above this, where the line of best fit for LSTAT can be seen passing above the bulk of the points.
Define a function using scikit-learn that calculates the line of best fit and mean squared error (MSE):
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
def get_mse(df, feature, target='MEDV'):
    # Get x, y to model
    y = df[target].values
    x = df[feature].values.reshape(-1,1)
    print('{} ~ {}'.format(target, feature))
    # Build and fit the model
    lm = LinearRegression()
    lm.fit(x, y)
    msg = ('model: y = {:.3f} + {:.3f}x' \
           .format(lm.intercept_, lm.coef_[0]))
    print(msg)
Note
The complete code for this step can be found at https://packt.live/2UIzwq8.
In the get_mse function, you assign the y and x variables to the target MDEV and the dependent feature, respectively. These are cast as NumPy arrays by calling the values attribute. The dependent features array is reshaped to the format expected by scikit-learn; this step is only necessary when modeling a one-dimensional feature space. The model is then instantiated and fitted on the data. For linear regression, fitting consists of computing the model parameters using the ordinary least squares method (minimizing the sum of squared errors for each sample). Finally, after determining the parameters, you will predict the target variable and use the results to calculate the MSE. The MSE is simply the sum of squared errors for each data point, where the error is defined as the difference between the observed value and the prediction.
Call the get_mse function for both RM and LSTAT, as follows:
get_mse(df, 'RM')
get_mse(df, 'LSTAT')
The output of calling RM and LSTAT is as follows:
MEDV ~ RM
model: y = -34.671 + 9.102x
mse = 43.60
MEDV ~ LSTAT
model: y = 34.554 + -0.950x
mse = 38.48
By comparing the MSE, it transpires that the error is slightly lower for LSTAT than for RM. Looking back at the scatter plots, however, it appears that we might have even better success using a polynomial model for LSTAT. In Activity 2.01, Building a Third-Order Polynomial Model, we will test this by computing a third-order polynomial model with scikit-learn.
Note
To access the source code for this specific section, please refer to https://packt.live/2N1riF2.
You can also run this example online at https://packt.live/37DzuVK.

Now that we've had an introduction to modeling data with scikit-learn, let's return to our exploration of the Boston Housing dataset. In the next section, we'll introduce some ideas around categorical features in general, and then apply these concepts to our dataset in order to explore the variable relationships in more detail.

Using Categorical Features for Segmentation Analysis

Often, we find datasets where there is a mix of continuous and categorical fields. In such cases, we can learn about our data and find patterns by segmenting the continuous variables with the categorical fields.

As a specific example, imagine you are evaluating the return on investment from an ad campaign. The data you have access to contains measures of some calculated return on investment (ROI) metric. These values were calculated and recorded daily, and you are analyzing data from the previous year. You have been tasked with finding data-driven insights on ways to improve the ad campaign. Looking at the ROI daily time series, you see a weekly oscillation in the data. Segmenting by day of the week, you find the following ROI distributions (where 0 represents the first day of the week and 6 represents the last):

Figure 2.21: A violin plot with ROI on the vertical axis and the day of the week on the horizontal axis

Since we don't have any categorical fields in the Boston Housing dataset we are working with, we'll create one by effectively discretizing a continuous field. In our case, this will involve binning the data into "low", "medium", and "high" categories. It's important to note that we are not simply creating a categorical data field to illustrate the data analysis concepts in this section. As will be seen, doing this can reveal insights from the data that would otherwise be difficult to notice or altogether unavailable.

Exercise 2.05: Creating Categorical Fields from Continuous Variables and Making Segmented Visualizations

Before you get started with this exercise, take another look at the final pairplot of Exercise 2.03, Analyzing the Boston Housing Dataset, where you compared MEDV, LSTAT, TAX, AGE, and RM:

Figure 2.22: Scatter charts comparing AGE to MEDV, LSTAT, TAX, and RM

Take a look at the panels containing AGE. As a reminder, this feature is defined as the proportion of owner-occupied units built prior to 1940. We are going to convert this feature into a categorical variable. Once it's been converted, you will be able to replot this figure with each panel segmented by color according to the age category. Perform the following steps to complete this exercise:

Plot the AGE cumulative distribution, as follows:
sns.distplot(df.AGE.values, bins=100, \
             hist_kws={'cumulative': True}, \
             kde_kws={'lw': 0},)
plt.xlabel('AGE')
plt.ylabel('CDF')
plt.axhline(0.33, color='red')
plt.axhline(0.66, color='red')
plt.xlim(0, df.AGE.max())
plt.savefig('../figures/chapter-2-boston-housing-age-cdf.png',\
            bbox_inches='tight', dpi=300,)
The output is as follows:

Figure 2.23: Plot for cumulative distribution of AGE
Here, you use sns.distplot to plot the distribution and set hist_kws={'cumulative': True} in order to calculate and chart the cumulative distribution. In these types of distributions, each bar counts how many values lie in the current x-axis bin and every bin to the left, thereby showing the cumulative amount at that point.
Note
Various properties of Seaborn charts can be tuned using x_kws, where x represents an argument name such as hist. Often, these are used to set aesthetic elements of the chart. For example, in the preceding plot, we set kde_kws={'lw': 0} in order to bypass plotting the kernel density estimate, which would have been a smooth line corresponding to the density.
Looking at the plot, there are very few samples with a low AGE distribution value, whereas there are far more with a very large AGE. This is indicated by the steepness of the distribution on the far right-hand side.
The red lines indicate 1/3 and 2/3 points in the cumulative distribution. Looking at the places where our distribution intercepts these horizontal lines, we can see that only about 33% of the samples have the value of AGE less than 55, and 33% of the samples have the value of AGE greater than 90. In other words, a third of the housing communities have fewer than 55% of their homes built prior to 1940. These would be considered relatively new communities. At the other end of the spectrum, another third of the housing communities have over 90% of their homes built prior to 1940. These would be considered very old. You will use the places where the red horizontal lines intercept the distribution as a guide to split the feature into categories: Relatively New, Relatively Old, and Very Old.
Create a new categorical feature and set the segmentation points, as follows:
def get_age_category(x):
    if x < 50:
        age = 'Relatively New'
    elif 50 <= x < 85:
        age = 'Relatively Old'
    else:
        age = 'Very Old'
    return age
df['AGE_category'] = df.AGE.apply(get_age_category)
Here, you are using the very handy pandas apply method, which applies a function to a given column or set of columns. The function being applied—in this case, get_age_category—should take one argument representing a row of data and return one value for the new column. In this case, the row of data being passed is just a single value, that is, the AGE of the sample.
Note
The apply method is great because it can solve a variety of problems and allows for easily readable code. Often, though, vectorized methods such as pd.Series.str can accomplish the same thing much faster. Therefore, it's advised to avoid using it if possible, especially when working with large datasets. We'll see some examples of vectorized methods in Chapter 3, Preparing Data for Predictive Modeling.
Verify the number of samples you've grouped into each age category, as follows:
df.groupby('AGE_category').size()
The output is as follows:
AGE_category
Relatively New 147
Relatively Old 149
Very Old 210
dtype: int64
Looking at the result, you can see that two class sizes are nearly equal and that the Very Old group is about 40% larger. You are interested in keeping the classes comparable in size so that each is well represented and it's straightforward to make inferences from the analysis.
Note
It may not always be possible to assign samples to classes evenly, and in real-world situations, it's very common to find highly imbalanced classes. In such cases, it's important to keep in mind that it will be difficult to make statistically significant claims with respect to the under-represented class. Predictive analytics with imbalanced classes can be particularly difficult. The following blog post offers an excellent summary regarding the methods for handling imbalanced classes when performing machine learning: https://svds.com/learning-imbalanced-classes/.
Now see how the target variable is distributed when segmented by the new feature, AGE_category.
Construct a violin plot, as follows:
sns.violinplot(x='MEDV', y='AGE_category', data=df, \
               order=['Relatively New', 'Relatively Old', \
                      'Very Old'],)
plt.xlim(0, 55)
plt.savefig('../figures/chapter-2-boston-housing-'\
            'age-medv-violin.png', \
            bbox_inches='tight', dpi=300,)
The output is as follows:

Figure 2.24: Violin plot comparing MEDV distributions by age
The preceding violin plot shows a kernel density estimate of the median house value distribution for each age category. You can compare the distribution of each categorical level with a normal distribution, and you can observe that all three have some degree of skewness. The Very Old group contains the lowest median house value records and has a relatively large width, whereas the other groups are more tightly centered on their average. The young group is skewed to the high end, which is evident from the enlarged right half and the position of the white dot in the thick black line within the body of the distribution.
This white dot represents the mean, and the thick black line spans roughly 50% of the population (it fills to the first quantile on either side of the white dot). The thin black line represents boxplot whiskers and spans 95% of the population. This inner visualization can be modified to show the inpidual data points instead by passing inner='point' to sns.violinplot().
Reconstruct the violin plot by adding the inner='point' argument to the sns.violinplot call. The code for this is as follows:
sns.violinplot(x='MEDV', y='AGE_category', data=df, \
               order=['Relatively New', 'Relatively Old', \
                      'Very Old'], \
               inner='point',)
plt.xlim(0, 55)
plt.savefig('../figures/chapter-2-boston-housing-'\
            'age-medv-violin-points.png',\
            bbox_inches='tight', dpi=300,)
The output is as follows:

Figure 2.25: Including points for inpidual samples in the violin plot
It's good to make plots like this for test purposes in order to see how the underlying data connects to the visual. You can see, for example, how there are no median house values lower than roughly $16,000 for the Relatively New segment, and therefore, the distribution tail actually contains no data. Due to the small size of our dataset (only about 500 rows), you can see that this is the case for each segment.
Reconstruct the pairplot visualization from earlier, but now include color labels for each AGE category. This is done by simply passing the hue argument, as follows:
cols = ['RM', 'AGE', 'TAX', 'LSTAT', 'MEDV', 'AGE_category']
sns.pairplot(df[cols], hue='AGE_category',\
             hue_order=['Relatively New', 'Relatively Old', \
                        'Very Old'], \
             plot_kws={'alpha': 0.5},)
plt.savefig('../figures/chapter-2-boston-housing-'\
            'age-pairplot.png', \
            bbox_inches='tight', dpi=300,)
The output is as follows:

Figure 2.26: Pairplot visualization of the Boston Housing dataset, with AGE segments
Looking at the preceding histograms, you can see that the underlying distributions of each segment appear similar for RM and TAX. The LSTAT distributions, on the other hand, look more distinct. You can focus on them in more detail by using a violin plot.
Reconstruct a violin plot to compare the LSTAT distributions for each AGE_category segment, as follows:
sns.violinplot(x='LSTAT', y='AGE_category', data=df, \
               order=['Relatively New', 'Relatively Old', \
                      'Very Old'],)
plt.xlim(-5, 40)
plt.savefig('../figures/chapter-2-boston-housing-'\
            'lstat-violin.png',\
            bbox_inches='tight', dpi=300,)
The output is as follows:

Figure 2.27: Violin plot comparing LSTAT distributions by age

Unlike the MEDV violin plot, where each distribution had roughly the same width, here, you can see the width increasing along with AGE. Communities with primarily old houses (the Very Old segment) contain anywhere from very few to many lower class residents, whereas Relatively New communities are much more likely to be predominantly higher class, with over 95% of samples having less lower class percentages than the Very Old communities. This makes sense, because Relatively New neighborhoods would be more expensive.

Note

To access the source code for this specific section, please refer to https://packt.live/2N1riF2.

You can also run this example online at https://packt.live/37DzuVK.

Activity 2.01: Building a Third-Order Polynomial Model

Previously in this chapter, you used scikit-learn to create linear models for the median house value (MEDV) as a function of RM and LSTAT, independently. In particular, with MEDV as a function of RM, you know that modeling would like to build a third-order polynomial model to compare against the linear one. Recall the actual problem we are trying to solve: predicting the median house value, given the lower class population percentage. This model could benefit a prospective Boston House purchaser who cares about how much of their community would be lower class.

The aim here is to use scikit-learn to fit a polynomial regression model to predict the median house value (MEDV), given the LSTAT values, and to build a model that has a lower MSE. In order to achieve this, the following steps have to be executed:

Load the necessary libraries and set up the plot settings for the notebook.
Note
While completing this activity, you will need to use many cells in the notebook. Please insert new cells as required.
Load the Boston Housing dataset into a pandas DataFrame, as you did earlier in this chapter. Recall that you accessed the data from a built-in scikit-learn dataset. Don't forget to add the target variable column called MEDV.
Pull out your dependent feature and target variable from df. Assign the target to the y variable and the feature to the x variable. Make sure that x that has the proper shape for training scikit-learn models. It should resemble something like this: [[x1], [x2], [x3], ...].
Verify that x has the proper shape by printing the first three samples.
Import the PolynomialFeatures class from scikit-learn's preprocessing folder. Then, instantiate it with degree=3. After this has been done, display your instantiated object in the notebook.
Transform the LSTAT feature (assigned to the x variable) by running the fit_transform method. Assign the newly created polynomial feature set to the x_poly variable.
Verify what x_poly looks like by printing the first few items. They should each have a dimensionality of 4, in order to represent polynomials of degrees 0 through 3.
Import the LinearRegression class from scikit-learn and train a linear classification model the same way we did while we calculated the MSE. When instantiating the model, set fit_intercept=False and assign it the name clf.
Extract the model coefficients using the coef_ attribute of clf. Use these coefficients to write out the equation of our polynomial model (that is, y = a + bx + cx2 + dx3).
Determine the predicted values for each sample and then use these values to calculate the residuals.
Print the first 10 residuals. You should see some greater than 0 and some less than 0.
Calculate the MSE for your third-order polynomial model.
Create a plot of the polynomial model, as a smooth line, overlaying a scatter chart of the samples.
Plot the residuals as a scatter chart. Compare this to the linear model residual chart we plotted earlier for this feature. The polynomial model should yield a better fit.
Note
The detailed steps, along with the solution to this activity, are presented on page 266.