## The one where we correct overfitting

This article is a continuation of my series on linear regression and bootstrap and Bayesian statistics.

Previously I talked at length about linear regression, and now I am going to continue that topic. As I hinted at previously, I am going to bring up the topic of regularization. And what regularization does, is it simplifies the model. And the reason why you want to have a simpler model is that, usually, a simpler will model will perform better in most if not pretty much all of the tests that can be run.

N.B: It is not kosher to use training data as the test data. You only clearly see the full effects of regularization when you test your models with test data, which is data that the model has not yet seen.

My writing is primarily coming from the statistical modeling of things with less in the machine learning effectiveness. For simplicity, many of the following examples break this rule and we evaluate our models using training data and not test data.

As such,** the coefficients of determinations in our models are not truly representative of the models' effectiveness, rather they are only correlation coefficients**.

## Finding the best model

Model selection — sometimes referred to as feature selection, is when we select the best model for the data we have using different features/attributes/columns of available data. Choosing different features for your model essentially creates a different model.

Either term is fine. After choosing our model we will then regularize it. Regularize is to simplify it: **Regularization = Simplification**.

The reason why we want a simpler model is that we want to avoid a phenomenon called overfitting or over-parameterization. Overfitting or over parameterization is a phenomenon that occurs very often, especially if you have a lot of parameters.

There are a variety of regularization techniques to use, including:

- Stepwise
- Ridge regression
- Lasso
- PCR — PCA + regression (principal component analysis and regularization)

PCR is the most important technique to use for regularization.

## Model fitness

Broadly speaking, a good model “fits the data well.” However, there is nuance to this as there a**re different uses for different models**. If you have a data set and you want to know if this data set can be modeled; if you can use a mathematical formulation to explain the model, the data that you have — and this is all you want to do with it- then it is very good and legitimate to use that same data and evaluate your model using that data, or evaluating our model with our training data.

However, if you want to be able to do any kind of prediction and determine how powerful this model is at predicting something — and that is usually what people want to do with a model: they want to actually make use of it — then **we need to test the model with test data**.

A good model needs to generalize well. For example, it needs to generalize from your training data to your test data. Consider the following examples:

**Overfitting**: model works great for training data but does not generalize to real world**Underfitting**: model is too simple, generalizable but not meaningful. Less variance but more bias towards the wrong answers

Why do I call the left-most image a great fit, the one that is just right? Well, because when we eyeball this, the black line seems to explain those blue dots pretty well. If we were to do a simple linear regression, we might end up with something like the line on the center image. And we would be calling that underfitting. We would have a lack of variance in our model, and we would have too much bias in our model; that is what we would say.

Now, on the right-hand side, we are overfitting. Here, we have too much variance in the model. So the model variance is pretty clear because it’s going up and down, jumping all around. That variance is helping with the fitting of this particular data set. But we call this overfitting, we say it is fitting the data too well. You might ask well, how/why is that a bad thing? Well, it is a bad thing because if I were to come along and put in test data, the test data probably would fit worse than either of the other two models. Because the test data would not be exactly at these peaks or troughs of these maximas or minimas of this black line.

## Overfitting

Over-parameterization leads often to overfitting and it also carries some other problems with it:

- Number of parameters > dimensionality of data
- Linear dependency of input features
- Works great on the training data set but has poor (if not abysmal) performance in the real world (test data)

There is a problem with what we call linear dependency of input features. So if one of your input features can be explained by other input features in a linear manner, then we call that a linear dependency. Translated that means if you can linearly combine (that means to add some set of columns to come up with or determine another column that already is there, just through addition, subtraction, and multiplication) then that means that you have a linear dependency; that means one of the columns, one or more of the columns is just a combination of the other columns. And that leads to problems when you do calculations and determine your model.

## Prevent overfitting

The tool to prevent overfitting is regularization. There are different ways of doing regularization. From earlier, stepwise regression is one form of doing regularization. What is stepwise regression? Removing/adding features in a stepwise fashion. There are two flavors, forward and backward. Forward stepwise regression pretty much works like this: you choose what you think is your best feature. Say you have input features x_1, x_2, x_3, x_4, and so on; you choose the one that you think is best (there are a variety of ways that you could choose it.) And then you come, you will create a linear regression, based off of that feature, say it was x_2, which now is the input and y is the output. Then you select your next best, say x_4, and you try that one out with x_2. And if the model gets better, by some amount, then you keep the second x value. If it was not good, if the model performance does not improve, then you do not keep it and then you are back to having only x_2. Then you try out another one of these features. This is called forward stepwise regression.

Backward stepwise regression is the more common way of doing things. You take all of your values like x_1, x_2, x_3, x_4, and create a linear regression from those four inputs to come up with a function that explains y; that calculates/ predicts y. And then you remove one of the columns and see if the model has improved. And typically, you will fit things less well. But your model might improve based on criteria that are not just purely fitting, they might fit better, because you also put in a penalty for parameters or you use a measure called information criterion that says you need to gain so and so much information from any given input parameter. Continue removing features until the model performance stops improving. This is called backward stepwise regression.

Other regularization techniques are you put a penalty in for every column you have; a penalty for the coefficient that is associated with that column. Those are other regularization techniques.

## Measuring the model

How do we measure the model? There is the Akaike information criterion (AIC):

- k: number of estimated parameters
- L: likelihood

This is the general favorite metric to determine if a model has improved. As the AIC becomes smaller and smaller for comparable models you say that the AIC is improving and that your model is also improving. So by adding more columns, or by removing more columns, if your AIC is improving, then you know you are going in the right direction. If your AIC does not improve, that means non-decreasing, then you know that you took a step in the wrong direction; meaning the last addition of a column or the last removal of a column should be undone, you go back and try something else.

Another model metric is the Bayesian information criterion (BIC), but it is rarely used in this context — it is very similar to AIC:

- k: number of estimated parameters
- L: likelihood
- N: number of recorded measurements

Regardless of the criterion you choose, your model of best fit is the one that has the minimal AIC or BIC.

## Cost

When you determine any function, any machine learning algorithm has a so-called cost function. And this is a very fundamental concept. A cost function basically says, “how bad is your model?”

I discussed previously on linear regression, badness would be the square of the residuals. As a reminder, a residual is the difference between the predicted value and the actual value of y, of the output; then you square that, and then you sum up all those squares. This is a cost function but was not explicitly called one before.

N.B: In some literature, SSR is not the residual sum of squares, but the sum of squares regression — a different quantity. For the rest of the article, I stick with RSS to avoid confusion about the above quantity.

There are a few bells and whistles on this that we change here and there, we can divide this by the number of cases, which means the number of values that are being used — the number of data points. We can put in a constant here, but overall we want to minimize this:

That is probably one of the most common cost functions. There are others, especially for logistic regression. However, for the most part, this is our primary cost function — it might be modified in some minor ways.

A cost function calculates the penalty. So the penalty here would be the sum of those squares: calculating how bad things are. This is a penalty for a given trained modeling data set. And the idea is to minimize this penalty or to minimize this cost. And so you need a few things to watch out for:

- models need to be compared to each other
- a cost function determines the variance
- depending on the dataset, the target variable, and purpose (business question to be answered), selecting the right cost function is critical and essential

It is important that similar models have to be compared to each other. So you cannot compare the root mean square error (RMSE), which is very closely related to the cost function above, you cannot really compare that from one model to the next unless they are very, very similar models — unless there are a lot of things the same.

If you have just two different models from different sources doing different having different purposes,** comparing the RMSE will not mean anything**. These root mean squared errors, or these costs, only are understandable in terms of when they are compared to another cost of a very similar model.

## Variance and bias

A good cost function tries to balance variance and bias to make it the best combination of variance and bias.

What are variance and bias?

In the overfitting example on the right-hand side, there is a strong variance, there is too much variance; while in the center underfitting example there is a strong bias, there is too much bias.

## Cost in regression

The measure of fit is another way of saying minimizing the cost function. So far we have been minimizing the residual sum of squares. Very confusingly this can be known as SSR and RSS, I will use RSS. Another component of the total cost is the weight of our coefficients, just another way of saying coefficients.

To remind you, a coefficient in this particular situation where we, let us say we have two columns: column x_1and x_2. Well, then we would have:

Since the coefficients are part of our regression (we treated them as 1 earlier) we should include them in our total cost function. Actually, we add the square of those in, and we get another penalty for just having that coefficient.

Now if that coefficient goes to zero, it basically wipes out the column. So that way, we can actually get rid of columns. Or get rid of this penalty, or part of this penalty. Naturally, if we get rid of all the columns, and we have nothing left, then all we have is the mean of y itself.

So we need, when we add a column, we are going to say, “okay, we can add a column and we can use that information. But we are going to accept a penalty for adding that column.” **Whatever coefficient that column gets, is going to be our penalty**. That means if it is a pretty big, important column, it is going to have a big coefficient, so it had better be worth it. Because that coefficient is also going to count as a part of this so-called penalty, it will add to the total cost. And by putting this into our calculations what happens is we no longer actually look for the best fit anymore, **we look for a balance between the best fit and the smaller coefficients**. And a very, very small coefficient would be zero. Negative coefficients are also considered large.

**Total cost **= measure of fit + measure of the magnitude of coefficients:

- When measure of fit (RSS)is small = good fit
- When the magnitude of the coefficients (sum of squared error) is small = good fit

This weight (w²)is actually called the L_2 norm if you know your linear algebra. All that matters for this story is just thinking about that as the square of the coefficients, sometimes called the square of the weights, but it is a coefficient.

From a previous article, we introduced epsilon, the error term. The terms m and b are coefficients (slope and y_intercept)

It looks sort of like we have only one input column, namely x and one, coefficient b, but actually, depending on how we think about this, x could be a whole vector of columns, and could be a whole vector of coefficients.

## Feature selection

The process of selecting a subset of features that are good predictors of the target is known as feature selection. As mentioned above, it is another term for model selection. The reason the two are the same is if you have a different set of features you actually have a different model.

It is useful for:

- controlling complexity of the model
- improving model performance
- improving generalization capability
- speeding up model learning w/o reducing accuracy

My personal preference is to use all the features, as much as I can, and then once I have all the features I can whittle down and find only the important features. And so I want to get rid of as many features as possible. Why do I want to do this? I want to have a simpler model. Why do I want a simpler model? Well, because a simpler model generalizes better. What does generalization mean? **It means that you can actually use it in the real world**. The model does not just explain the training data but it also has a better chance of explaining data that we have not seen before. First your test data and then the data that you’ve never seen before, namely your operational data.

As mentioned earlier stepwise regression is an important regularization tool. Backward is the most common, as in starting with all the features and removing columns that explain the least variance until model performance peaks. Forward is starting from just the mean and adding features until model performance peaks. Then there is “both” — at each step check whether add a feature or remove a feature. Say you are going forward, you added a column. When you go backward. Now from that point, it may not be that the best column to remove would be the one you just added. It could be another one. The reason for that is, is that there are complex interactions between the columns that often cannot be predicted. This may not be the case in the kind of linear regressions that I am doing. But in other types of regressions: trees and neural nets, that certainly can be the case.

I venture to say everyone uses stepwise regression. Now, this does not mean you only use stepwise regression. Other types of regularization methods include ridge, lasso, and something called elastic net, which is a combination of ridge and lasso. And then I on top of that, I do the stepwise regression. So I think using a combination of these is the typical way that one progresses, that one works the best.

The following section walks through an example of stepwise regression using Galton’s height dataset, the same dataset I used in my bootstrap resampling article.

# importsimport pandas as pd

import numpy as np

import statsmodels.formula.api as sm

import seaborn as sns

import matplotlib.pyplot as plt%matplotlib inline# load data# I previously saved the Galton data as data

family_data = pd.read_csv(data, delimiter=’\t’)# column labels

family_data.columns = [“family”,”father”,”mother”,”gender”,”childHeight”, “kids”]# check data is well-imported

family_data.head()

`# check data types`

family_data.dtypes

# subset the data with a Boolean flag, capture male children

isMale = family_data.loc[:,”gender”] == “M”# create just the male children df

male_only = family_data[isMale].copy()

male_only.head()

`print(‘Number of rows: {}, Number of Males: {}’.format(len(family_data), len(male_only)))`

# create new df for new feature

male_df = male_only.copy()# add in squares of mother and father heights

male_df[‘father_sqr’] = male_df[‘father’] **2

male_df[‘mother_sqr’] = male_df[‘mother’] **2# drop columns family, gender, kids

Drop = [“family”, “gender”, “kids”]

for x in Drop:

male_df = male_df.drop(x, axis=1)# reset index

male_df=male_df.reset_index(drop=True)

male_df.head()

N.B: The childHeight feature is just for sons, we copied and captured just the sons earlier. We do PCR analysis with daughters later.

Next, we will z-score standardize the data. Why do we want to standardize the data? Larger values, like father_sqr, will have small coefficients, while smaller values will have large coefficients. This is not “fair”. To make it fair we scale the data to make them on par with each other — and one way to do that is z-score standardization whereby we subtract from the height the mean of childHeight and divide by the standard deviation of childHeight. This is the z-score:

# scale all columns but the individual height (childHeight)

# normalization function for a column in a pandas df

def scale(col):

mean_col = np.mean(col)

sd_col = np.std(col)

std = (col — mean_col) / sd_col

return std# add scaled x to data frame.

male_df[‘father’] = scale(male_df[‘father’])

male_df[‘mother’] = scale(male_df[‘mother’])

male_df[‘father_sqr’] = scale(male_df[‘father_sqr’])

male_df[‘mother_sqr’] = scale(male_df[‘mother_sqr’])

male_df.head()

Note that the first father's height is four. Four what? Four standard deviations away from the mean, which is pretty tall.

## Compute model with all features

The following code computes the height of all male children using all available features. Up to this point, this is all I have shown, but what is special about today is that now we can find features that we can toss out. Why do we want to find features we can toss out? We want to make the model simpler. Why do we want to make the model simpler? It turns out that simpler models are better models.

N.B: the ‘`+ 1`

’ in `ols_model`

means that there is going to be an offset. `sklearn`

does this automatically. There is always an intercept, another term for the offset. Since everything is scaled the intercept, in this case, is the mean height of the children.

ols_model = sm.ols(formula = ‘childHeight ~ father + mother + father_sqr + mother_sqr + 1’, data=male_df)results = ols_model.fit()

n_points = male_df.shape[0]

y_output = male_df[‘childHeight’].values.reshape(n_points, 1)

print(‘Intercept, Slopes : \n{}’.format(results.params))# print hypothesis test stats

print(‘Intercept t-value, Slope t-values: \n{}’.format(results.tvalues))

print(‘\nHypothesis test summary for each coefficient if they differ from zero:’)

print(results.pvalues)print(‘\nSSE, SST, SSR, and RMSE:’)

mean_y = np.mean(y_output)

sst = np.sum((y_output — mean_y)**2)

sse = sst — results.ssr

print(‘SSE: {}’.format(sse))

print(‘SST: {}’.format(sst))

print(‘SSR: {}’.format(results.ssr))

print(‘RMSE: {}’.format(np.sqrt(results.mse_resid))) # ESHx# print summary stats

print(results.summary())# plot a histogram of the residuals

sns.distplot(results.resid, hist=True)

plt.xlabel(‘Residual’)

plt.ylabel(‘Frequency’)

plt.title(‘Residual Histogram’)

plt.show()

Now we have a baseline model, can we improve it with regularization techniques?

## Apply stepwise regression

From earlier, I defined the metric to use to define the fit of our model: AIC

- k: number of estimated parameters
- L: likelihood

We can apply stepwise regression with this function, using backward feature selection:

def backward_selected(data, response):

“””Linear model designed by backward selection. Feature selection based on AICParameters:

— — — — — -

data : pandas dfwith all possible predictors and responseresponse: string, name of response column in dataReturns:

— — — —

model: an “optimal” fitted statsmodels linear model

with an intercept

selected by backward selection

“””

# begin with all factors and intercept

possible_factors = set(data.columns)

possible_factors.remove(response)

formula = “{} ~ {} + 1”.format(response, ‘ + ‘.join(possible_factors))

best_aic = sm.ols(formula, data).fit().aic

current_aic = best_aic# create a non-empty set of columns that will be labeled as “to remove and try”

to_try_remove = possible_factors# check which features remain

while to_try_remove and current_aic == best_aic:

aic_candidates = []

for candidate in to_try_remove:columns = possible_factors — set([candidate])

# removing the candidate column

formula = “{} ~ {} + 1”.format(response, ‘ + ‘.join(columns))

# print AIC

aic = sm.ols(formula, data).fit().aic# append tuple of the form (aic, response)

aic_candidates.append((aic, candidate))# sort all the pairs by the first entry of tuple

aic_candidates.sort()

# change sort/pop order!

best_new_aic, best_candidate = aic_candidates.pop(0)# check if we have something better:

if best_new_aic < current_aic:

# Remove the best candidate’s name from possible_factorspossible_factors.remove(best_candidate)

current_aic = best_new_aic# repeat the process with all the remaining candidate columns

# final formula

formula = “{} ~ {} + 1”.format(response, ‘ + ‘.join(possible_factors))

# model object

model = sm.ols(formula, data).fit()

return modelbackwards_model = backward_selected(male_df, ‘childHeight’)print(backwards_model.model.formula)print(‘Adjusted R-Squared: {}’.format(backwards_model.rsquared_adj))

print(‘AIC: {}’.format(backwards_model.aic))

The backward selected formula `childHeight ~ father + mother + mother_sqr + 1`

is now fed into a standard `sm`

template:

ols_model_forward = sm.ols(formula = ‘childHeight ~ father + mother + mother_sqr + 1’, data=male_df)

results = ols_model_forward.fit()

n_points = male_df.shape[0]

y_output = male_df[‘childHeight’].values.reshape(n_points, 1)# slope (m) and y-intercept (b)

print(‘Intercept, Slopes : \n{}’.format(results.params))# t-values (hypothesis test statistics)

print(‘Intercept t-value, Slope t-values: \n{}’.format(results.tvalues))# p-values for above t-value statistics

print(‘\nHypothesis test summary for each coefficient if they differ from zero:’)

print(results.pvalues)print(‘\nSSE, SST, SSR, and RMSE:’)

mean_y = np.mean(y_output)

sst = np.sum((y_output — mean_y)**2)

sse = sst — results.ssr

print(‘SSE: {}’.format(sse))

print(‘SST: {}’.format(sst))

print(‘SSR: {}’.format(results.ssr))

print(‘RMSE: {}’.format(np.sqrt(results.mse_resid))) # ESHx

print(results.summary())# plot a histogram of the residuals:

sns.distplot(results.resid, hist=True)

plt.xlabel(‘Residual’)

plt.ylabel(‘Frequency’)

plt.title(‘Residual Histogram’)

plt.show()

N.B: Stepwise regression appears to be a simple method for feature selection; be aware that **stepwise regression does not scale well**. As with any multiple comparison method, stepwise regression suffers from a high probability of false-positive results. In this case, a feature that should be dropped might not be, because of a low p-value or AIC.

There is a lot of linear algebra that underlies PCR that I have omitted for brevity. The trick is to do a PCA, a principal component analysis. The PCA will help you determine which of the principal components are the best. This is done by creating linear combinations of the initial variables and creating a new one that explains the most amount of variance. Then another one is created that explains slightly less variance, and so on and so forth.

From the models point of view, we have regularized it. No information has been thrown away, every PC contains linear combinations of all features. But the model sees new features that make it easy to determine which features should be thrown away.

N.B: This relies on SVD behind the scenes.

The transformations in PCA create linearly independent features with an optimal amount of independence between them. PC’s with small amounts of variance are likely unimportant to our model and thus we have made our model simpler by removing features with minuscule variance.

First I will demonstrate PCR before applying it to the Galton dataset:

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScalernp.random.seed(21)# right-multiply a 2x2 matrix to a 2x100 matrix to transform the points

X = np.dot(np.random.randn(100, 2), np.random.rand(2, 2))

# scale data

X = StandardScaler().fit_transform(X)plt.scatter(X[:, 0], X[:, 1])

plt.grid()

plt.xlabel(‘x’)

plt.ylabel(‘y’)

plt.title(‘Generated x-y data with dependence’)

plt.show()

Now we will create linear combinations of variables to create PC1 and PC2 — principal component one and principal component two. PC1 will have the most amount of variance. Remember that there is a 1:1 mapping of input features to created principal components in PCA

pca = PCA(n_components=2)

pca_result = pca.fit_transform(X)

pca_df = pd.DataFrame(data = pca_result , columns = [‘pc1’, ‘pc2’])print(pca_df.head())

print(pca_df.shape)

Plot the output

`plt.axis(‘equal’) `

plt.scatter(pca_df.loc[:, ‘pc1’], pca_df.loc[:, ‘pc2’])

plt.ylabel(‘Component 2’)

plt.xlabel(‘Component 1’)

plt.title(‘Two Dimensional PCA’)

plt.grid()

plt.show()

PC1 and PC2 are linear combinations of the original features with some weightings attached in our new coordinate system (PC1 and PC2) as opposed to x and y.

`print(pca.explained_variance_)`

print(pca.components_)

exp_var = pca.explained_variance_

components = pca.components_# plot vector

# = plot scale * component direction * component length

v1 = 2 * components[0] * np.sqrt(exp_var[0])

v2 = 2 * components[1] * np.sqrt(exp_var[1])

c = (0, 0) # Center is at 0,0 because of our standardizationplt.scatter(X[:, 0], X[:, 1], alpha=0.5)

plt.annotate(‘’, c + v1, c, arrowprops={‘arrowstyle’: ‘->’, ‘shrinkA’: 0,

‘shrinkB’: 0, ‘linewidth’: 3})

plt.annotate(‘’, c + v2, c, arrowprops={‘arrowstyle’: ‘->’, ‘shrinkA’: 0,

‘shrinkB’: 0, ‘linewidth’: 3})

plt.axis(‘equal’)

plt.grid()

plt.title(‘F-vector functions (Principal Components)’)

plt.xlabel(‘x’)

plt.ylabel(‘y’)

plt.show()

From PCR we have created two linearly independent features, as shown in the graph above. Because PC1 explains so much more variance than PC2, this model can likely be simplified by removing PC2.

This can be extended to higher-dimensional datasets. By calculating the principal components we have removed any multi-collinearity issues in the data.

Now, let us use PCR on the Galton dataset — looking at daughters. Since there are four features in the df, there will be four principal components after PCA.

# subset the data with a Boolean flag to capture daughters

isFemale = family_data.loc[:,”gender”] == “F”# create just the female df

female_data = family_data[isFemale].copy()# create new df for new feature set

female_df = female_data.copy()# feature engineer squares of mother and father heights

female_df[‘father_sqr’] = female_df[‘father’] **2

female_df[‘mother_sqr’] = female_df[‘mother’] **2# drop columns for family, gender, kids

Drop= [“family”, “gender”, “kids”]

for x in Drop:

female_df = female_df.drop(x, axis=1)# reset the index

# add scaled x to df

female_df=female_df.reset_index(drop=True)

female_df[‘father’] = scale(female_df[‘father’])

female_df[‘mother’] = scale(female_df[‘mother’])

female_df[‘father_sqr’] = scale(female_df[‘father_sqr’])

female_df[‘mother_sqr’] = scale(female_df[‘mother_sqr’])

female_df.head()

# calculate all the principal components (4)

X = female_df[[‘father’, ‘mother’, ‘father_sqr’, ‘mother_sqr’]].values

y = female_df[‘childHeight’]

pca = PCA()

pca_result = pca.fit_transform(X)

pca_df = pd.DataFrame(data = pca_result, columns=[‘pc1’, ‘pc2’, ‘pc3’, ‘pc4’])# our data projected onto the four principal components.

print(pca_df.head())

print(pca_df.shape)pca_df[‘childHeight’] = female_df[‘childHeight’]

Scree plot of variance:

`plt.plot([i + 1 for i in range(4)], pca.explained_variance_)`

plt.title(‘Scree plot for PCA decoposition’)

plt.xlabel(‘Component’)

plt.ylabel(‘Variance explained’)

plt.show()

`pca.explained_variance_`

Now that we have all our data projected on our 4 total principal components, we can look at the explained variance. The first two PC’s explain the most variance and thus we can likely drop PC3 and PC4 since their explained variance is so low.

pcr_model = sm.ols(formula = ‘childHeight ~ pc1 + pc2 + pc3 + pc4’, data=pca_df)results = pcr_model.fit()

n_points = pca_df.shape[0]

y_output = pca_df[‘childHeight’].values.reshape(n_points, 1)

print(results.summary())# plot a histogram of the residuals

sns.distplot(results.resid, hist=True)

plt.xlabel(‘Residual’)

plt.ylabel(‘Frequency’)

plt.title(‘Residual Histogram’)

plt.show()

Here, you may see that the last two principal components are not needed for a significant fit. You could also have noticed that from the ‘explained variance’ for each component- and how the last two components explained nearly 4 orders of magnitude less variance.

Now, we can try the PCA regression with only two components; the components that explain most of the variance:

pcr_model = sm.ols(formula = ‘childHeight ~ pc1 + pc2’, data=pca_df)results = pcr_model.fit()

n_points = pca_df.shape[0]

y_output = pca_df[‘childHeight’].values.reshape(n_points, 1)

print(results.summary())# plot a histogram of the residuals

sns.distplot(results.resid, hist=True)

plt.xlabel(‘Residual’)

plt.ylabel(‘Frequency’)

plt.title(‘Residual Histogram’)

plt.show()

The p-values of PC3 and PC4 are large and thus not significant. Also, their respective 95% confidence intervals straddle zero. Thus they do not help the model explain daughter heights, which we suspected from the low variance they explained. If we had not performed PCR we likely could not have thrown out any of the initial features (father, mother, father_sqr, mother_sqr).

If you notice that the R² is not better on the improved model, remember that we committed a non-kosher act: we tested our model on training data. A big no-no. Almost certainly, if we had fed previously unseen test data into the last model, the one with only two PC’s, the R² would have increased. This is because this is a simpler model, one less prone to overfitting. If the data were train/tune/test split then the model improvements would be demonstrable.

The reduction in dimensionality has worked as expected! The adjusted R² and AIC are the lowest for this model. Regularization has helped us create better, simpler, more generalizable models.

If you are curious as to why parent heights are rather poor indicators of child heights, this is where the phrase “regression to the mean” comes from. Contrary to popular belief in the 18th century, Galton’s family analysis showed that tall parents tend to have children whose heights regress to the population mean.

Next, I will talk about regularization with Ridge, LASSO, and Elasticnet regressions!

Find me on Linkedin

*Physicist cum Data Scientist- Available for new opportunity | SaaS | Sports | Start-ups | Scale-ups*