Machine Learning with sklearn

sklearn is a best-in-breed machine learning library for Python that we will use extensively in this class. It also has one of the best APIs designs out there (with a paper even written about the design) and is very modular and flexible. As such it has a bit of a learning curve, but once you can think in the sklearn way for one algorithm/model you can apply that general knowledge to any model.

In [31]:
import pandas as pd
import numpy as np

Getting Data

Typically you have an external dataset that you will be working with and even if it is clean, you will need to manipulate/transform it to create features. And as such you will load your dataset with something like numpy or pandas

We will be performing a simple linear regression on a Lending Club dataset of interest rates for individual loans. To start we will need to slightly prepare our data with pandas to get it ready for our model.

In [32]:
df = pd.read_csv('loanf.csv')

df.head()
Out[32]:
Interest.Rate FICO.Score Loan.Length Monthly.Income Loan.Amount
6 15.31 670 36 4891.67 6000
11 19.72 670 36 3575.00 2000
12 14.27 665 36 4250.00 10625
13 21.67 670 60 14166.67 28000
21 21.98 665 36 6666.67 22000
In [33]:
np.sum(df.isnull())
Out[33]:
Interest.Rate     0
FICO.Score        0
Loan.Length       0
Monthly.Income    1
Loan.Amount       0
dtype: int64
In [34]:
df = df.dropna(axis=0)
In [35]:
np.sum(df.isnull())
Out[35]:
Interest.Rate     0
FICO.Score        0
Loan.Length       0
Monthly.Income    0
Loan.Amount       0
dtype: int64

Getting a feature matrix

Remember from lecture that for any machine learning model we have Features (or a feature matrix) and a Target (or response/dependent variable from statistics parlance). In the sklearn API we need to separate these from our initial data matrix.

NOTE: sklearn expects as input a numpy array/matrix. Often if you pass in a DataFrame Python can convert/coerce the DataFrame into a numpy array alright, but it is a best practice to do this conversion yourself

In [36]:
features = df.iloc[:, 1:]
features.head()
Out[36]:
FICO.Score Loan.Length Monthly.Income Loan.Amount
6 670 36 4891.67 6000
11 670 36 3575.00 2000
12 665 36 4250.00 10625
13 670 60 14166.67 28000
21 665 36 6666.67 22000
In [37]:
labels = df.iloc[:, 0]
labels.head()
Out[37]:
6     15.31
11    19.72
12    14.27
13    21.67
21    21.98
Name: Interest.Rate, dtype: float64
In [38]:
X = features.as_matrix()
y = labels.as_matrix()
In [39]:
print ("Features: \n", X)
print ("\n\nLabels: \n", y)
Features: 
 [[   670.       36.     4891.67   6000.  ]
 [   670.       36.     3575.     2000.  ]
 [   665.       36.     4250.    10625.  ]
 ..., 
 [   810.       36.     9250.    27000.  ]
 [   765.       36.     7083.33  25000.  ]
 [   740.       60.     8903.25  16000.  ]]


Labels: 
 [ 15.31  19.72  14.27 ...,   6.62  10.75  14.09]

The API

sklearn has a very Object Oriented interface and it is import to be aware of this when building models. It is important to note that (almost) every model/transform/object in sklearn is an Estimator object. What is an Estimator?

In [40]:
class Estimator(object):
  
    def fit(self, X, y=None):
        """Fit model to data X (and y)"""
        self.some_attribute = self.some_fitting_method(X, y)
        return self
            
    def predict(self, X_test):
        """Make prediction based on passed features"""
        pred = self.make_prediction(X_test)
        return pred
    
model = Estimator()

The Estimator class defines a fit() method as well as a predict() method. For an instance of an Estimator stored in a variable model:

  • model.fit: fits the model with the passed in training data. For supervised models, it also accepts a second argument y that corresponds to the labels (model.fit(X, y). For unsupervised models, there are no labels so you only need to pass in the feature matrix (model.fit(X))

    Since the interface is very OO, the instance itself stores the results of the fit internally. And as such you must always fit() before you predict() on the same object.

  • model.predict: predicts new labels for any new datapoints passed in (model.predict(X_test)) and returns an array equal in length to the number of rows of what is passed in containing the predicted labels.

There are 3(ish) types of subclass of estimator:

  • Supervised
  • Unsupervised
  • Feature Processing

Supervised

Supervised estimators in addition to the above methods typically also have:

  • model.predict_proba: For classifiers that have a notion of probability (or some measure of confidence in a prediction) this method returns those "probabilities". The label with the highest probability is what is returned by themodel.predict()` mehod from above.
  • model.score: For both classification and regression models, this method returns some measure of validation of the model (which is configurable). For example, in regression the default is typically R^2 and classification it is accuracy.

Unsupervised

Some estimators in the library implement what is referred to as the transformer interface. Unsupervised in this case refers to any method that does not need labels, including (but not limited to) unsupervised classifiers, preprocessing (like tf-idf), dimensionality reduction, etc.

The transformer interface defines (usually) two additional methods:

  • model.transform: Given an unsupervised model, transform the input into a new basis (or feature space). This accepts on argument (usually a feature matrix) and returns a matrix of the input transformed. Note: You need to fit() the model before you transform it.
  • model.fit_transform: For some models you may not need to fit() and transform() separately. In these cases it is more convenient to do both at the same time. And that is precisely what fit_transform() does!

Let's see this in action!

We will be trying to predict the loan interest rate based on the FICO score, loan length, monthly income, and loan amount:

$$Interest.Rate = \beta_0 + \beta_1 \cdot FICO.Score + \beta_2 \cdot Loan.Length + \beta_3 \cdot Monthly.Income + \beta_4 \cdot Loan.Amount$$
In [41]:
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import train_test_split
In [42]:
X_train, X_test, y_train, y_test = train_test_split(X, y)
In [43]:
print ("The training split: \n")
print (len(X_train), len(y_train))
print ("\n\nThe testing split: \n")
print (len(X_test), len(y_test))
The training split: 

1874 1874


The testing split: 

625 625
In [44]:
# create an instance of an estimator
clf = LinearRegression()

# fit the estimator (notice I do not save any return value in a variable)
clf.fit(X_train, y_train)
Out[44]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [45]:
# predict (but only after we have trained!)
predictions = clf.predict(X_test)
print (len(predictions))
625
In [46]:
# The coefficients
print ('Coefficients: \n', clf.coef_)
# The mean square error
print("\n\nResidual sum of squares: %.2f"
      % np.mean((predictions - y_test) ** 2))

# Explained variance score: 1 is perfect prediction
print('\n\nVariance score: %.2f' % clf.score(X_test, y_test))
Coefficients: 
 [ -8.82565405e-02   1.41563070e-01  -1.78254703e-05   1.41033756e-04]


Residual sum of squares: 4.73


Variance score: 0.73
In [ ]: