%matplotlib inline
import matplotlib
import seaborn as sns
sns.set()
matplotlib.rcParams['figure.dpi'] = 144
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
Scikit-learn
is the most popular Python package for machine learning. It has a plethora of machine learning models and provides functions that are often needed for a machine learning workflow. As you will see, it has a nice and intuitive interface. It makes creating complicated machine learning workflows very easy. For this notebook, we will use the California housing data. The data set contains the median house value for each census block group in California.
from sklearn.datasets import fetch_california_housing
# get data
data = fetch_california_housing()
X = data['data']
y = data['target']
#print(data['DESCR'])
data.feature_names
['MedInc',
'HouseAge',
'AveRooms',
'AveBedrms',
'Population',
'AveOccup',
'Latitude',
'Longitude']
Scikit-learn
relies heavily on object-oriented programming principles. It implements machine learning algorithms as classes and users create objects from these "recipes". For example, Ridge
is a class representing the ridge regression model. To create a Ridge
object, we simply create an instance of the class. In Python, the convention is that class names use CamelCase, the first letter of each word is capitalized. Scikit-learn
adopts the convention, making it easy to distinguish what is a class.
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=0.1)
In the above code, we set alpha=0.1
. Here, alpha
is a hyperparameter of the ridge model. Hyperparameters are model parameters that govern the learning process. In terms of hierarchy, they reside "above" the regular model parameters. They control what values the model parameters are equal to after undergoing training. They can be easily identified as they are the parameters that are set prior to learning. In scikit-learn
, hyperparameters are set when creating an instance of the class. The default values that scikit-learn
uses are usually a good set of initial values but this is not always the case. It is important to understand the hyperparameters available and how they affect model performance.
Scikit-learn
refers to machine learning algorithms as estimators. There are three different types of estimators: classifiers, regressors, and transformers. Programmatically, scikit-learn
has a base class called BaseEstimator
that all estimators inherit. The models inherit an additional class, either RegressorMixin
, ClassifierMixin
, and TransformerMixin
. The inheritance of the second class determines what type of estimator the model represents. We'll divide the estimators into two groups based up on their interface. These two groups are predictors and transformers.
As the name suggests, predictors are models that make predictions. There are two main methods.
fit(X, y)
: trains/fit the object to the feature matrix and label vector .predict(X)
: makes predictions on the passed data set .mine = pd.DataFrame(X)
mine.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
---|---|---|---|---|---|---|---|---|
0 | 8.3252 | 41.0 | 6.984127 | 1.023810 | 322.0 | 2.555556 | 37.88 | -122.23 |
1 | 8.3014 | 21.0 | 6.238137 | 0.971880 | 2401.0 | 2.109842 | 37.86 | -122.22 |
2 | 7.2574 | 52.0 | 8.288136 | 1.073446 | 496.0 | 2.802260 | 37.85 | -122.24 |
3 | 5.6431 | 52.0 | 5.817352 | 1.073059 | 558.0 | 2.547945 | 37.85 | -122.25 |
4 | 3.8462 | 52.0 | 6.281853 | 1.081081 | 565.0 | 2.181467 | 37.85 | -122.25 |
from sklearn.linear_model import LinearRegression
# create model and train/fit
model = LinearRegression()
model.fit(X, y)
# predict label values on X
y_pred = model.predict(X)
print(y_pred)
print("shape of the prediction array: {}".format(y_pred.shape))
print("shape of the training set: {}".format(X.shape))
[4.13164983 3.97660644 3.67657094 ... 0.17125141 0.31910524 0.51580363]
shape of the prediction array: (20640,)
shape of the training set: (20640, 8)
Note, the output of predict(X)
is a NumPy array of one dimension. The array has the same size as the number of rows of the data that was passed to the predict
method.
Since we are using linear regression and our data has eight features, our model is
The coefficients are stored in the fitted model as an object's attribute. Scikit-learn
adopts a convention where all attributes that are determined/calculated after fitting end in an underscore. The model coefficients and intercept are retrieved using the coefs_
and the intercept_
attributes, respectively.
print("β_0: {}".format(model.intercept_))
for i in range(8):
print(f"β_{i+1}: {model.coef_[i]}")
β_0: -36.941920207184516
β_1: 0.4366932931343246
β_2: 0.009435778033238069
β_3: -0.10732204139090418
β_4: 0.6450656935198134
β_5: -3.976389421207444e-06
β_6: -0.00378654265497081
β_7: -0.42131437752714374
β_8: -0.43451375467477804
If we wanted to know how well the model performs making predictions with a data set, we can use the score(X, y)
method. It works by
predict(X)
to produce predicted values.The evaluation equation varies depending if the model is a regressor or classifier. For regression, it is the value while for classification, it is accuracy.
print("R^2: {:g}".format(model.score(X, y)))
R^2: 0.606233
We used a rather simple model, linear regression. What if we wanted to use a more complicated model? All we need to do is an easy substitution; there is minimum code rewrite as the models have the same interface. Of course, different models have different hyperparameters so we need to be careful when swapping out algorithms. Let's use a more complicated model and train it.
from sklearn.ensemble import GradientBoostingRegressor
# create model and train/fit
model = GradientBoostingRegressor()
model.fit(X, y)
# predict label values on X
y_pred = model.predict(X)
print(y_pred)
print("R^2: {:g}".format(model.score(X, y)))
[4.26432728 3.87864519 3.92074556 ... 0.63664692 0.74759279 0.7994969 ]
R^2: 0.803324
Transformers are models that process and transform a data set. These transformers are very useful because rarely is our data in a form to feed directly to a machine learning model for both training and predicting. For example, a lot of machine learning models work best when the features have similar scales. All transformers have the same interface:
fit(X)
: trains/fits the object to the feature matrix .transform(X)
: applies the transformation on using any parameters learnedfit_transform(X)
: applies both fit(X)
and then transform(X)
.Let's demonstrate transformers with StandardScaler
, which scales each feature to have zero mean and unit variance. The transformed feature is equal to
We'll use pandas to summarize the results of deploying the StandardScaler
on the California housing data.
from sklearn.preprocessing import StandardScaler
# create and fit scaler
scaler = StandardScaler()
scaler.fit(X)
# scale data set
Xt = scaler.transform(X)
# create data frame with results
stats = np.vstack((X.mean(axis=0), X.var(axis=0), Xt.mean(axis=0), Xt.var(axis=0))).T
feature_names = data['feature_names']
columns = ['unscaled mean', 'unscaled variance', 'scaled mean', 'scaled variance']
df = pd.DataFrame(stats, index=feature_names, columns=columns)
df
unscaled mean | unscaled variance | scaled mean | scaled variance | |
---|---|---|---|---|
MedInc | 3.870671 | 3.609148e+00 | 6.609700e-17 | 1.0 |
HouseAge | 28.639486 | 1.583886e+02 | 5.508083e-18 | 1.0 |
AveRooms | 5.429000 | 6.121236e+00 | 6.609700e-17 | 1.0 |
AveBedrms | 1.096675 | 2.245806e-01 | -1.060306e-16 | 1.0 |
Population | 1425.476744 | 1.282408e+06 | -1.101617e-17 | 1.0 |
AveOccup | 3.070655 | 1.078648e+02 | 3.442552e-18 | 1.0 |
Latitude | 35.631861 | 4.562072e+00 | -1.079584e-15 | 1.0 |
Longitude | -119.569704 | 4.013945e+00 | -8.526513e-15 | 1.0 |
The data frame shows how our features have wildly different scales; the average population is over 1000 but the average room is slightly over 5. Now, our features each have zero mean and a variance of one.
As our analysis and workflow becomes more complicated, we need a tool that helps with scaling up. For example, you may need to apply multiple transformations to your data before it is ready for a supervised machine learning model. You can apply the transformations explicitly, creating intermediate variables of the transformed data. Pipelines are an approach that helps prevent keeping track of intermediate transformations and help scale our code for more complicated analysis. Pipelines are made with the Pipeline
class. Essentially, a pipeline is an estimator object that holds a series of transformers with a final estimator.
For this example, we want to
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
# construct pipeline
scaler = StandardScaler() #transformer
poly_features = PolynomialFeatures(degree=2) #transformer
lin_reg = LinearRegression() #predictor
#pipeline accepts lists that hold all steps in our workflow
#for that section pipe
pipe = Pipeline([
('scaler', scaler), #0-index is name of the step
('poly', poly_features), #1-index is the estimator object
('regressor', lin_reg)
])
#with pipelines, all steps before the predictor must be transformers
#the last step doesnt have to be a predictor, it might be a transformer
The pipeline was created by passing a list of tuples representing all the steps in the workflow. Each tuple contains a string that refers to the name of the step and an estimator object. The steps of the pipeline are referred to using the name of the step. The name_steps
attribute returns a dictionary where the keys are the names for the steps and the values are the estimators for the steps.
pipe.named_steps #access the estimator used in the pipeline
{'scaler': StandardScaler(),
'poly': PolynomialFeatures(),
'regressor': LinearRegression()}
Pipeline
objects are estimators; the following lists the behaviors when calling the standard methods.
fit(X, y)
: calls fit_transform(X, y)
sequentially on all transformers and fits the last estimator with the transformed data set.predict(X)
: transforms X
sequentially with all transformers and predicts using the last estimator with the transformed data set.transform(X)
: transforms X
sequentially with all transformers, only works if the last estimator is None
.For the above constructed pipeline, when calling pipe.fit(X, y)
, the following process occurs:
Xt = scaler.fit_transform(X)
Xt = poly.fit_transform(Xt)
lin_reg.fit(Xt, y)
When calling pipe.predict(X, y)
, the data set X
will flow through the transformers and be used to make predictions with the predictor in the last stage.
Xt = scaler.transform(X)
Xt = poly.transform(Xt)
y_pred = lin_reg.predict(Xt)
Because we have encapsulated the entire workflow through a Pipeline
object, we avoid manually calling the fitting, transformations, and predictions steps. We could even initialize the estimator objects inside of the pipeline to further reduce code volume.
# fit/train model and predict labels
pipe.fit(X, y)
y_pred = pipe.predict(X)
print(y_pred)
print("R^2: {}".format(pipe.score(X, y)))
[4.00298901 3.92349228 3.99012926 ... 0.83369975 0.88801566 0.97559649]
R^2: 0.6832976293317492
When you are working on a machine learning workflow, your data may require different transformation processes for certain features. What if the "raw" data set may have numerical, categorical, and text data. Each of these types require different processing/transformations. You can handle these sorts of situations using a special type of transformer called a ColumnTransformer
. For example, maybe you want to use the StandardScaler
on all the California housing features except Latitude and Longitude. In this case, you would select the columns that would be scaled while letting the others "pass through" using the remainder=
argument.
df.head()
unscaled mean | unscaled variance | scaled mean | scaled variance | |
---|---|---|---|---|
MedInc | 3.870671 | 3.609148e+00 | 6.609700e-17 | 1.0 |
HouseAge | 28.639486 | 1.583886e+02 | 5.508083e-18 | 1.0 |
AveRooms | 5.429000 | 6.121236e+00 | 6.609700e-17 | 1.0 |
AveBedrms | 1.096675 | 2.245806e-01 | -1.060306e-16 | 1.0 |
Population | 1425.476744 | 1.282408e+06 | -1.101617e-17 | 1.0 |
from sklearn.compose import ColumnTransformer
X = pd.DataFrame(X, columns=data.feature_names)
X.head()
MedInc | HouseAge | AveRooms | AveBedrms | Population | AveOccup | Latitude | Longitude | |
---|---|---|---|---|---|---|---|---|
0 | 8.3252 | 41.0 | 6.984127 | 1.023810 | 322.0 | 2.555556 | 37.88 | -122.23 |
1 | 8.3014 | 21.0 | 6.238137 | 0.971880 | 2401.0 | 2.109842 | 37.86 | -122.22 |
2 | 7.2574 | 52.0 | 8.288136 | 1.073446 | 496.0 | 2.802260 | 37.85 | -122.24 |
3 | 5.6431 | 52.0 | 5.817352 | 1.073059 | 558.0 | 2.547945 | 37.85 | -122.25 |
4 | 3.8462 | 52.0 | 6.281853 | 1.081081 | 565.0 | 2.181467 | 37.85 | -122.25 |
ct = ColumnTransformer([('scaler', StandardScaler(),['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup'])],
remainder='passthrough')
X_new = pd.DataFrame(ct.fit_transform(X), columns=X.columns)
X_new.head()
MedInc | HouseAge | AveRooms | AveBedrms | Population | AveOccup | Latitude | Longitude | |
---|---|---|---|---|---|---|---|---|
0 | 2.344766 | 0.982143 | 0.628559 | -0.153758 | -0.974429 | -0.049597 | 37.88 | -122.23 |
1 | 2.332238 | -0.607019 | 0.327041 | -0.263336 | 0.861439 | -0.092512 | 37.86 | -122.22 |
2 | 1.782699 | 1.856182 | 1.155620 | -0.049016 | -0.820777 | -0.025843 | 37.85 | -122.24 |
3 | 0.932968 | 1.856182 | 0.156966 | -0.049833 | -0.766028 | -0.050329 | 37.85 | -122.25 |
4 | -0.012881 | 1.856182 | 0.344711 | -0.032906 | -0.759847 | -0.085616 | 37.85 | -122.25 |
from sklearn.compose import ColumnTransformer
col_transformer = ColumnTransformer(
remainder='passthrough',
transformers=[
('scaler', StandardScaler(), slice(0,6)) # first 6 columns
]
)
col_transformer.fit(X)
Xt = col_transformer.transform(X)
print('MedInc mean before transformation?', X.mean(axis=0)[0])
print('MedInc mean after transformation?', Xt.mean(axis=0)[0], '\n')
print('Longitude mean before transformation?', X.mean(axis=0)[-1])
print('Longitude mean after transformation?', Xt.mean(axis=0)[-1])
MedInc mean before transformation? 3.8706710029070246
MedInc mean after transformation? 6.609699867535816e-17
Longitude mean before transformation? -119.56970445736148
Longitude mean after transformation? -119.56970445736432
Column transformers also enable you to let some columns pass through while dropping others. For example, if I learned that the information in 'MedInc'
had been corrupted and should be excluded from my model, I could rewrite my column transformer to drop 'MedInc'
, let 'Latitude'
and 'Longitude'
pass through, and scale all remaining features.
col_transformer = ColumnTransformer(
remainder='passthrough',
transformers=[
('remove', 'drop', 0),
('scaler', StandardScaler(), slice(1,6))
]
)
Xt = col_transformer.fit_transform(X)
print('Number of features in X:', X.shape[1])
print('Number of features Xt:', Xt.shape[1])
Number of features in X: 8
Number of features Xt: 7
A FeatureUnion
is another tool for dealing with situations where your data requires different transformation processes for different features. Like ColumnTransformer
, it processes features separately and combines the results into a single feature matrix. Unlike ColumnTransformer
, it can handle more complex workflows where you need to use distinct transformers and estimators together before you can pass the complete feature matrix to a final estimator.
When you are working on a machine learning workflow, your data may require different transformation processes for certain features. What if the "raw" data set may have numerical, categorical, and text data. Each of these types require different processing/transformations. After the separate processing, we need a convenient way to combine the results of the separate transformations steps. In scikit-learn, this is done with FeatureUnion
While the Pipeline
objects arrange estimators in a series, FeatureUnion
objects arrange transformers in parallel. A FeatureUnion
object combines the output of the each of the transformers in parallel to generate one output matrix. Using a combination of Pipeline
and FeatureUnion
objects, we can construct complicated machine learning workflows all within a single scikit-learn
estimator object.
To illustrate FeatureUnion
, we will apply the PCA
and SelectKBest
transformers. The PCA
, principal component analysis, transformer returns a new set of uncorrelated features based on the original features while SelectKBest
returns the k best features based on a passed criterion. For the example, the selector will return the 2 features with the largest correlation with the labels. When using PCA
, the data needs to have zero mean. As a result, we create a pipeline object that represents the required two step process. We will have the PCA
object return 4 uncorrelated features. The result of the union between PCA
and SelectKBest
will be a data set of 6 features.
from sklearn.feature_selection import f_regression, SelectKBest
from sklearn.decomposition import PCA
from sklearn.pipeline import FeatureUnion
scaler = StandardScaler()
pca = PCA(n_components=4)
selector = SelectKBest(f_regression, k=2)
pca_pipe = Pipeline([('scaler', scaler), ('dim_red', pca)])
union = FeatureUnion([('pca_pipe', pca_pipe), ('selector', selector)])
pipe = Pipeline([('union', union), ('regressor', lin_reg)])
pipe.fit(X, y)
print("number of columns/features in the original data set: {}".format(X.shape[-1]))
print("number of columns/features in the new data set: {}".format(union.transform(X).shape[-1]))
print("R^2: {}".format(pipe.score(X, y)))
number of columns/features in the original data set: 8
number of columns/features in the new data set: 6
R^2: 0.5288130088767813
Copyright © 2022 Marconi Lab. All rights reserved.