Machine Learning Steps

filepath = '/home/naji/Desktop/github-repos/machine-learning/nbs/data/'
pima = 'pima-indians-diabetes.csv'
housing = 'housing.csv'

seed = 7
pima_names = ['preg','plas','pres','skin','test','mass','pedi','age','class']

from pandas import read_csv
from sklearn.model_selection import KFold, cross_val_score

def load_pima_data():
    dataframe = read_csv(filepath+pima, names=pima_names)
    array = dataframe.values
    X = array[:, 0:8]
    Y = array[:, 8]
    return X, Y

Analyze Data

Load Machine Learning Data

You must be able to load your data before you can start your machine learning project. The most common format for machine learning data is CSV files. There are a number of ways to load a CSV file in Python. In this lesson you will learn three ways that you can use to load your CSV data in Python: 1. Load CSV Files with the Python Standard Library. 2. Load CSV Files with NumPy. 3. Load CSV Files with Pandas.

Considerations When Loading CSV Data

File Header
Does your data have a file header? If so this can help in automatically assigning names to each column of data. If not, you may need to name your attributes manually. Either way, you should explicitly specify whether or not your CSV file has a file header when loading your data.

Comments
Does your data have comments? Comments in a CSV file are indicated by a hash (#) at the start of a line. If you have comments in your file, depending on the method used to load your data, you may need to indicate whether or not to expect comments and the character to expect to signify a comment line.

Delimiter
The standard delimiter that separates values in fields is the comma (,) character. Your file could use a different delimiter like tab or white space in which case you must specify it explicitly.

Quotes
Sometimes field values can have spaces. In these CSV files the values are often quoted. The default quote character is the double quotation marks character. Other characters can be used, and you must specify the quote character used in your file.

Pima Indians Dataset

https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv

Load CSV Files with the Python Standard Library

The Python API provides the module CSV and the function reader() that can be used to load CSV files. Once loaded, you can convert the CSV data to a NumPy array and use it for machine learning.

import csv
import numpy

raw_data = open(filepath+filename, 'rt')
reader = csv.reader(raw_data, delimiter=',', quoting=csv.QUOTE_NONE)

x = list(reader)
x[:10]

data = numpy.array(x).astype('float')
print(data.shape)

data

Load CSV Files with NumPy

You can load your CSV data using NumPy and the numpy.loadtxt() function. This function assumes no header row and all data has the same format.

from numpy import loadtxt

raw_data = open(filepath + filename, 'rt')

data = loadtxt(raw_data, delimiter=',')

print(data.shape)

This example can be modified to load the same dataset directly from a URL as follows:

from urllib.request import urlopen

url = 'https://goo.gl/bDdBiA'

raw_data = urlopen(url)

data = loadtxt(raw_data, delimiter=',')

print(data.shape)

Load CSV Files with Pandas

You can load your CSV data using Pandas and the pandas.read csv() function. This function is very flexible and is perhaps my recommended approach for loading your machine learning data. The function returns a pandas.DataFrame 6 that you can immediately start summarizing and plotting.

from pandas import read_csv

names = ['preg','plas','pres','skin','test','mass','pedi','age','class']

data = read_csv(filepath + filename, names=names)

data.shape

data

We can also modify this example to load CSV data directly from a URL.

url='https://goo.gl/bDdBiA'

data = read_csv(url, names=names)

data.shape

Generally I recommend that you load your data with Pandas in practice and all subsequent examples in this book will use this method.

Understand Your Data With Descriptive Statistics

You must understand your data in order to get the best results. In this chapter you will discover 7 recipes that you can use in Python to better understand your machine learning data. After reading this lesson you will know how to: 1. Take a peek at your raw data. 2. Review the dimensions of your dataset. 3. Review the data types of attributes in your data. 4. Summarize the distribution of instances across classes in your dataset. 5. Summarize your data using descriptive statistics. 6. Understand the relationships in your data using correlations. 7. Review the skew of the distributions of each attribute.

Peek at Your Data

There is no substitute for looking at the raw data. Looking at the raw data can reveal insights that you cannot get any other way. It can also plant seeds that may later grow into ideas on how to better pre-process and handle the data for machine learning tasks.

from pandas import read_csv

names = ['preg','plas','pres','skin','test','mass','pedi','age','class']

data = read_csv(filepath + filename, names=names)

data.head(20)

Dimensions of Your Data

You must have a very good handle on how much data you have, both in terms of rows and columns. Too many rows and algorithms may take too long to train. Too few and perhaps you do not have enough data to train the algorithms. Too many features and some algorithms can be distracted or suffer poor performance due to the curse of dimensionality.

print(data.shape)

Data Type For Each Attribute

The type of each attribute is important. Strings may need to be converted to floating point values or integers to represent categorical or ordinal values. You can get an idea of the types of attributes by peeking at the raw data, as above.

data.dtypes

Descriptive Statistics

Descriptive statistics can give you great insight into the properties of each attribute. Often you can create more summaries than you have time to review. The describe() function on the Pandas DataFrame lists 8 statistical properties of each attribute. They are: Count. Mean. Standard Deviation. Minimum Value. 25th Percentile. 50th Percentile (Median). 75th Percentile. Maximum Value.

from pandas import set_option

set_option('display.width', 100)
set_option('display.precision', 3)

data.describe()

Class Distribution (Classification Only)

On classification problems you need to know how balanced the class values are. Highly imbalanced problems (a lot more observations for one class than another) are common and may need special handling in the data preparation stage of your project.

class_coutns = data.groupby('class').size()

print(class_coutns)

Correlations Between Attributes

Correlation refers to the relationship between two variables and how they may or may not change together. The most common method for calculating correlation is Pearson’s Correlation Coefficient, that assumes a normal distribution of the attributes involved. A correlation of -1 or 1 shows a full negative or positive correlation respectively. Whereas a value of 0 shows no correlation at all. Some machine learning algorithms like linear and logistic regression can suffer poor performance if there are highly correlated attributes in your dataset. As such, it is a good idea to review all of the pairwise correlations of the attributes in your dataset.

set_option('display.width', 100)
set_option('display.precision', 3)

data.corr(method='pearson')

Skew of Univariate Distributions

Skew refers to a distribution that is assumed Gaussian (normal or bell curve) that is shifted or squashed in one direction or another. Many machine learning algorithms assume a Gaussian distribution. Knowing that an attribute has a skew may allow you to perform data preparation to correct the skew and later improve the accuracy of your models.

data.skew()

The skew results show a positive (right) or negative (left) skew. Values closer to zero show less skew.

Tips To Remember

This section gives you some tips to remember when reviewing your data using summary statistics. Review the numbers. Generating the summary statistics is not enough. Take a moment to pause, read and really think about the numbers you are seeing. Ask why. Review your numbers and ask a lot of questions. How and why are you seeing specific values. Think about how the numbers relate to the problem domain in general and specific entities that observations relate to. Write down ideas. Write down your observations and ideas. Keep a small text file or note pad and jot down all of the ideas for how variables may relate, for what numbers mean, and ideas for techniques to try later. The things you write down now while the data is fresh will be very valuable later when you are trying to think up new things to try.

Summary

In this chapter you discovered the importance of describing your dataset before you start work on your machine learning project. You discovered 7 different ways to summarize your dataset using Python and Pandas: Peek At Your Data. Dimensions of Your Data. Data Types. Class Distribution. Data Summary. Correlations. Skewness.

Understand Your Data With Visualization

You must understand your data in order to get the best results from machine learning algorithms. The fastest way to learn more about your data is to use data visualization. In this chapter you will discover exactly how you can visualize your machine learning data in Python using Pandas.

Univariate Plots

In this section we will look at three techniques that you can use to understand each attribute of your dataset independently.

Histograms.
Density Plots.
Box and Whisker Plots.

Histograms

from matplotlib import pyplot

A fast way to get an idea of the distribution of each attribute is to look at histograms. Histograms group data into bins and provide you a count of the number of observations in each bin. From the shape of the bins you can quickly get a feeling for whether an attribute is Gaussian, skewed or even has an exponential distribution. It can also help you see possible outliers.

data

data.hist(figsize=[10,10])
pyplot.show()

We can see that perhaps the attributes age, pedi and test may have an exponential distribution. We can also see that perhaps the mass and pres and plas attributes may have a Gaussian or nearly Gaussian distribution. This is interesting because many machine learning techniques assume a Gaussian univariate distribution on the input variables.

Density Plots

Density plots are another way of getting a quick idea of the distribution of each attribute. The plots look like an abstracted histogram with a smooth curve drawn through the top of each bin, much like your eye tried to do with the histograms.

data.plot(kind='density', subplots=True, layout=(3,3), sharex=False)
pyplot.show()

Box and Whisker Plots

Another useful way to review the distribution of each attribute is to use Box and Whisker Plots or boxplots for short. Boxplots summarize the distribution of each attribute, drawing a line for the median (middle value) and a box around the 25th and 75th percentiles (the middle 50% of the data). The whiskers give an idea of the spread of the data and dots outside of the whiskers show candidate outlier values (values that are 1.5 times greater than the size of spread of the middle 50% of the data).

data.plot(kind='box', subplots=True, layout=(3,3), sharex=False, sharey=False)
pyplot.show()

We can see that the spread of attributes is quite different. Some like age, test and skin appear quite skewed towards smaller values.

Multivariate Plots

This section provides examples of two plots that show the interactions between multiple variables in your dataset. Correlation Matrix Plot. Scatter Plot Matrix.

Correlation Matrix Plot

Correlation gives an indication of how related the changes are between two variables. If two variables change in the same direction they are positively correlated. If they change in opposite directions together (one goes up, one goes down), then they are negatively correlated. You can calculate the correlation between each pair of attributes. This is called a correlation matrix. You can then plot the correlation matrix and get an idea of which variables have a high correlation with each other. This is useful to know, because some machine learning algorithms like linear and logistic regression can have poor performance if there are highly correlated input variables in your data.

# plot correlation matrix

fig = pyplot.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(data.corr(), vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = numpy.arange(0,9,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
pyplot.show()

We can see that the matrix is symmetrical, i.e. the bottom left of the matrix is the same as the top right. This is useful as we can see two different views on the same data in one plot. We can also see that each variable is perfectly positively correlated with itself (as you would have expected) in the diagonal line from top left to bottom right.

The example is not generic in that it specifies the names for the attributes along the axes as well as the number of ticks. This recipe can be made more generic by removing these aspects as follows:

# plot correlation matrix
fig = pyplot.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(data.corr(), vmin=-1, vmax=1)
fig.colorbar(cax)
pyplot.show()

Generating the plot, you can see that it gives the same information although making it a little harder to see what attributes are correlated by name. Use this generic plot as a first cut to understand the correlations in your dataset and customize it like the first example in order to read off more specific data if needed.

Scatter Plot Matrix

A scatter plot shows the relationship between two variables as dots in two dimensions, one axis for each attribute. You can create a scatter plot for each pair of attributes in your data. Drawing all these scatter plots together is called a scatter plot matrix. Scatter plots are useful for spotting structured relationships between variables, like whether you could summarize the relationship between two variables with a line. Attributes with structured relationships may also be correlated and good candidates for removal from your dataset.

from pandas.plotting import scatter_matrix

scatter_matrix(data)
pyplot.show()

Like the Correlation Matrix Plot above, the scatter plot matrix is symmetrical. This is useful to look at the pairwise relationships from different perspectives. Because there is little point of drawing a scatter plot of each variable with itself, the diagonal shows histograms of each attribute.

In this chapter you discovered a number of ways that you can better understand your machine learning data in Python using Pandas. Specifically, you learned how to plot your data using: Histograms. Density Plots. Box and Whisker Plots. Correlation Matrix Plot. Scatter Plot Matrix.

Prepare Data

Many machine learning algorithms make assumptions about your data. It is often a very good idea to prepare your data in such a way to best expose the structure of the problem to the machine learning algorithms that you intend to use. In this chapter you will discover how to prepare your data for machine learning in Python using scikit-learn. After completing this lesson you will know how to: 1. Rescale data. 2. Standardize data. 3. Normalize data. 4. Binarize data.

Need For Data Pre-processing

You almost always need to pre-process your data. It is a required step. A difficulty is that different algorithms make different assumptions about your data and may require different transforms. Further, when you follow all of the rules and prepare your data, sometimes algorithms can deliver better results without pre-processing.

Generally, I would recommend creating many different views and transforms of your data, then exercise a handful of algorithms on each view of your dataset. This will help you to flush out which data transforms might be better at exposing the structure of your problem in general.

Data Transforms

In this lesson you will work through 4 different data pre-processing recipes for machine learning. The Pima Indian diabetes dataset is used in each recipe. Each recipe follows the same structure:

Load the dataset. Split the dataset into the input and output variables for machine learning. Apply a pre-processing transform to the input variables. Summarize the data to show the change.

The scikit-learn library provides two standard idioms for transforming data. Each are useful in different circumstances. The transforms are calculated in such a way that they can be applied to your training data and any samples of data you may have in the future. The scikit-learn documentation has some information on how to use various different pre-processing methods:

Fit and Multiple Transform. Combined Fit-And-Transform.

The Fit and Multiple Transform method is the preferred approach. You call the fit() function to prepare the parameters of the transform once on your data. Then later you can use the transform() function on the same data to prepare it for modeling and again on the test or validation dataset or new data that you may see in the future. The Combined Fit-And-Transform is a convenience that you can use for one off tasks. This might be useful if you are interested in plotting or summarizing the transformed data. You can review the preprocess API in scikit-learn here.

Rescale Data

When your data is comprised of attributes with varying scales, many machine learning algorithms can benefit from rescaling the attributes to all have the same scale. Often this is referred to as normalization and attributes are often rescaled into the range between 0 and 1. This is useful for optimization algorithms used in the core of machine learning algorithms like gradient descent. It is also useful for algorithms that weight inputs like regression and neural networks and algorithms that use distance measures like k-Nearest Neighbors. You can rescale your data using scikit-learn using the MinMaxScaler class.

from pandas import read_csv
from numpy import set_printoptions
from sklearn.preprocessing import MinMaxScaler

names = ['preg','plas','pres','skin','test','mass','pedi','age','class']

dataframe = read_csv(filepath+filename, names=names)
dataframe

array = dataframe.values
array

# separate array into input and output components

X = array[:, 0:8]
Y = array[:, 8]
scaler = MinMaxScaler(feature_range=(0,1))
rescaledX = scaler.fit_transform(X)

set_printoptions(precision=3)

rescaledX[0:5, :]

Standardize Data

Standardization is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1. It is most suitable for techniques that assume a Gaussian distribution in the input variables and work better with rescaled data, such as linear regression, logistic regression and linear discriminate analysis. You can standardize data using scikit-learn with the StandardScaler class

from sklearn.preprocessing import StandardScaler
from pandas import read_csv
from numpy import set_printoptions

names = ['preg','plas','pres','skin','test','mass','pedi','age','class']

dataframe = read_csv(filepath+filename, names=names)

array = dataframe.values

# separate array into input and output components
X = array[:, 0:8]
Y = array[:, 8]
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)

set_printoptions(precision=3)

rescaledX[0:5, :]

The values for each attribute now have a mean value of 0 and a standard deviation of 1.

Normalize Data

Normalizing in scikit-learn refers to rescaling each observation (row) to have a length of 1 (called a unit norm or a vector with the length of 1 in linear algebra). This pre-processing method can be useful for sparse datasets (lots of zeros) with attributes of varying scales when using algorithms that weight input values such as neural networks and algorithms that use distance measures such as k-Nearest Neighbors. You can normalize data in Python with scikit-learn using the Normalizer class

from sklearn.preprocessing import Normalizer
from pandas import read_csv
from numpy import set_printoptions

names = ['preg','plas','pres','skin','test','mass','pedi','age','class']

dataframe = read_csv(filepath+filename, names=names)
array = dataframe.values

X = array[:, 0:8]
Y = array[:, 8]
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)

set_printoptions(precision=3)

normalizedX[0:5, :]

Binarize Data (Make Binary)

You can transform your data using a binary threshold. All values above the threshold are marked 1 and all equal to or below are marked as 0. This is called binarizing your data or thresholding your data. It can be useful when you have probabilities that you want to make into crisp values. It is also useful when feature engineering and you want to add new features that indicate something meaningful. You can create new binary attributes in Python using scikit-learn with the Binarizer class

from sklearn.preprocessing import Binarizer
from pandas import read_csv
from numpy import set_printoptions

names = ['preg','plas','pres','skin','test','mass','pedi','age','class']

dataframe = read_csv(filepath+filename, names=names)
array = dataframe.values

X = array[:, 0:8]
Y = array[:, 8]
binarizer = Binarizer(threshold=0).fit(X)
binaryX = binarizer.transform(X)

set_printoptions(precision=3)
binaryX[0:5, :]

In this chapter you discovered how you can prepare your data for machine learning in Python using scikit-learn. You now have recipes to:

Rescale data. Standardize data. Normalize data. Binarize data.

You now know how to transform your data to best expose the structure of your problem to the modeling algorithms. In the next lesson you will discover how to select the features of your data that are most relevant to making predictions.

Evaluate Algorithms

Evaluate the Performance of Machine Learning Algorithms with Resampling

You need to know how well your algorithms perform on unseen data. The best way to evaluate the performance of an algorithm would be to make predictions for new data to which you already know the answers. The second best way is to use clever techniques from statistics called resampling methods that allow you to make accurate estimates for how well your algorithm will perform on new data. In this chapter you will discover how you can estimate the accuracy of your machine learning algorithms using resampling methods in Python and scikit-learn on the Pima Indians dataset.

Evaluate Machine Learning Algorithms

Why can’t you prepare your machine learning algorithm on your training dataset and use predictions from this same dataset to evaluate performance? The simple answer is overfitting. Imagine an algorithm that remembers every observation it is shown during training. If you evaluated your machine learning algorithm on the same dataset used to train the algorithm, then an algorithm like this would have a perfect score on the training dataset. But the predictions it made on new data would be terrible. We must evaluate our machine learning algorithms on data that is not used to train the algorithm.

A model evaluation is an estimate that we can use to talk about how well we think the method may actually do in practice. It is not a guarantee of performance. Once we estimate the performance of our algorithm, we can then re-train the final algorithm on the entire training dataset and get it ready for operational use. Next up we are going to look at four different techniques that we can use to split up our training dataset and create useful estimates of performance for our machine learning algorithms:

Train and Test Sets.
k-fold Cross-Validation.
Leave One Out Cross-Validation.
Repeated Random Test-Train Splits.

Split into Train and Test Sets

The simplest method that we can use to evaluate the performance of a machine learning algorithm is to use different training and testing datasets. We can take our original dataset and split it into two parts. Train the algorithm on the first part, make predictions on the second part and evaluate the predictions against the expected results. The size of the split can depend on the size and specifics of your dataset, although it is common to use 67% of the data for training and the remaining 33% for testing.

This algorithm evaluation technique is very fast. It is ideal for large datasets (millions of records) where there is strong evidence that both splits of the data are representative of the underlying problem. Because of the speed, it is useful to use this approach when the algorithm you are investigating is slow to train. A downside of this technique is that it can have a high variance. This means that differences in the training and test dataset can result in meaningful differences in the estimate of accuracy. In the example below we split the Pima Indians dataset into 67%/33% splits for training and test and evaluate the accuracy of a Logistic Regression model.

from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

names = ['preg','plas','pres','skin','test','mass','pedi','age','class']

dataframe = read_csv(filepath+filename, names=names)
array = dataframe.values
X = array[:, 0:8]
Y = array[:, 8]

test_size = 0.33
seed = 7

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)

model = LogisticRegression(solver='liblinear')
model.fit(X_train, Y_train)

result = model.score(X_test, Y_test)
print(f'Accuracy: {result*100: 0.3f}%')

We can see that the estimated accuracy for the model was approximately 75%. Note that in addition to specifying the size of the split, we also specify the random seed. Because the split of the data is random, we want to ensure that the results are reproducible. By specifying the random seed we ensure that we get the same random numbers each time we run the code and in turn the same split of data. This is important if we want to compare this result to the estimated accuracy of another machine learning algorithm or the same algorithm with a different configuration. To ensure the comparison was apples-for-apples, we must ensure that they are trained and tested on exactly the same data.

K-fold Cross-Validation

Cross-validation is an approach that you can use to estimate the performance of a machine learning algorithm with less variance than a single train-test set split. It works by splitting the dataset into k-parts (e.g. k = 5 or k = 10). Each split of the data is called a fold. The algorithm is trained on k − 1 folds with one held back and tested on the held back fold. This is repeated so that each fold of the dataset is given a chance to be the held back test set. After running cross-validation you end up with k different performance scores that you can summarize using a mean and a standard deviation.

The result is a more reliable estimate of the performance of the algorithm on new data. It is more accurate because the algorithm is trained and evaluated multiple times on different data. The choice of k must allow the size of each test partition to be large enough to be a reasonable sample of the problem, whilst allowing enough repetitions of the train-test evaluation of the algorithm to provide a fair estimate of the algorithms performance on unseen data. For modest sized datasets in the thousands or tens of thousands of records, k values of 3, 5 and 10 are common. In the example below we use 10-fold cross-validation

from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

names = ['preg','plas','pres','skin','test','mass','pedi','age','class']

dataframe = read_csv(filepath+filename, names=names)
array = dataframe.values
X = array[:, 0:8]
Y = array[:, 8]

kfold = KFold(n_splits=10, random_state=7, shuffle=True)

model = LogisticRegression(solver='liblinear')

results = cross_val_score(model, X, Y, cv=kfold)

print(f'Accuracy: {results.mean()*100 : .3f} ({results.std()*100 : .3f})')

You can see that we report both the mean and the standard deviation of the performance measure. When summarizing performance measures, it is a good practice to summarize the distribution of the measures, in this case assuming a Gaussian distribution of performance (a very reasonable assumption) and recording the mean and standard deviation.

Leave One Out Cross-Validation

You can configure cross-validation so that the size of the fold is 1 (k is set to the number of observations in your dataset). This variation of cross-validation is called leave-one-out cross- validation. The result is a large number of performance measures that can be summarized in an effort to give a more reasonable estimate of the accuracy of your model on unseen data. A downside is that it can be a computationally more expensive procedure than k-fold cross-validation.

from pandas import read_csv
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

names = ['preg','plas','pres','skin','test','mass','pedi','age','class']

dataframe = read_csv(filepath+filename, names=names)

array = dataframe.values
X = array[:, 0:8]
Y = array[:, 8]

loov = LeaveOneOut()
model = LogisticRegression(solver='liblinear')
results = cross_val_score(model,X, Y, cv=loov)

print(f'Accuracy: {results.mean()*100: 0.3f}% ({results.std()*100: 0.3f})')

You can see in the standard deviation that the score has more variance than the k-fold cross-validation results described above

Repeated Random Test-Train Splits

Another variation on k-fold cross-validation is to create a random split of the data like the train/test split described above, but repeat the process of splitting and evaluation of the algorithm multiple times, like cross-validation. This has the speed of using a train/test split and the reduction in variance in the estimated performance of k-fold cross-validation. You can also repeat the process many more times as needed to improve the accuracy. A down side is that repetitions may include much of the same data in the train or the test split from run to run, introducing redundancy into the evaluation. The example below splits the data into a 67%/33% train/test split and repeats the process 10 times.

from pandas import read_csv
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

names = ['preg','plas','pres','skin','test','mass','pedi','age','class']

dataframe = read_csv(filepath+filename, names=names)
array = dataframe.values
X = array[:, 0:8]
Y = array[:, 8]

n_splits = 10
test_size = 0.33
seed = 7

kfold = ShuffleSplit(n_splits=n_splits, test_size=test_size, random_state=seed)

model = LogisticRegression(solver='liblinear')
results = cross_val_score(model, X, Y, cv=kfold)

print(f'Accuracy: {results.mean()*100 : 0.3f}% ({results.std()*100 : 0.3f})')

What Techniques to Use When

This section lists some tips to consider what resampling technique to use in different circumstances.

Generally k-fold cross-validation is the gold standard for evaluating the performance of a machine learning algorithm on unseen data with k set to 3, 5, or 10.
Using a train/test split is good for speed when using a slow algorithm and produces performance estimates with lower bias when using large datasets.
Techniques like leave-one-out cross-validation and repeated random splits can be useful intermediates when trying to balance variance in the estimated performance, model training speed and dataset size.

The best advice is to experiment and find a technique for your problem that is fast and produces reasonable estimates of performance that you can use to make decisions. If in doubt, use 10-fold cross-validation.

Machine Learning Algorithm Performance Metrics

The metrics that you choose to evaluate your machine learning algorithms are very important. Choice of metrics influences how the performance of machine learning algorithms is measured and compared. They influence how you weight the importance of different characteristics in the results and your ultimate choice of which algorithm to choose. In this chapter you will discover how to select and use different machine learning performance metrics in Python with scikit-learn.

Classification Metrics

Classification problems are perhaps the most common type of machine learning problem and as such there is a myriad of metrics that can be used to evaluate predictions for these problems. In this section we will review how to use the following metrics:

Classification Accuracy.
Logistic Loss.
Area Under ROC Curve.
Confusion Matrix.
Classification Report.

from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

Classification Accuracy

Classification accuracy is the number of correct predictions made as a ratio of all predictions made. This is the most common evaluation metric for classification problems, it is also the most misused. It is really only suitable when there are an equal number of observations in each class (which is rarely the case) and that all predictions and prediction errors are equally important, which is often not the case. Below is an example of calculating classification accuracy.

from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

names = ['preg','plas','pres','skin','test','mass','pedi','age','class']

dataframe = read_csv(filepath+filename, names=names )
dataframe

array = dataframe.values
array

X = array[:, 0:8]
Y = array[:, 8]

kfold = KFold(n_splits=10, shuffle=True, random_state=7)

model = LogisticRegression(solver='liblinear')

results = cross_val_score(model, X, Y, cv=kfold, scoring='accuracy')

print(f'Accuracy is: {results.mean()*100: .3f} ({results.std(): .3f})')

Logistic Loss

Logistic loss (or logloss) is a performance metric for evaluating the predictions of probabilities of membership to a given class. The scalar probability between 0 and 1 can be seen as a measure of confidence for a prediction by an algorithm. Predictions that are correct or incorrect are rewarded or punished proportionally to the confidence of the prediction.

from pandas import read_csv
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LogisticRegression

names = ['preg','plas','pres','skin','test','mass','pedi','age','class']

dataframe = read_csv(filepath+filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:, 8]

cv = KFold(n_splits=10, random_state=7, shuffle=True)

model = LogisticRegression(solver='liblinear')

results = cross_val_score(model, X, Y, cv=cv, scoring='neg_log_loss')

print(f'Accuracy is: {results.mean()*100: .3f} ({results.std(): .3f})')

Smaller logloss is better with 0 representing a perfect logloss. As mentioned above, the measure is inverted to be ascending when using the cross val score() function.

Area Under ROC Curve

Area under ROC Curve (or ROC AUC for short) is a performance metric for binary classification problems. The AUC represents a model’s ability to discriminate between positive and negative classes. An area of 1.0 represents a model that made all predictions perfectly. An area of 0.5 represents a model that is as good as random. A ROC Curve is a plot of the true positive rate and the false positive rate for a given set of probability predictions at different thresholds used to map the probabilities to class labels. The area under the curve is then the approximate integral under the ROC Curve.

from pandas import read_csv
from sklearn.model_selection import KFold,cross_val_score
from sklearn.linear_model import LogisticRegression

names = ['preg','plas','pres','skin','test','mass','pedi','age','class']

dataframe = read_csv(filepath+filename, names=names)
array = dataframe.values
X = array[:, 0:8]
Y = array[:, 8]

cv = KFold(n_splits=10, random_state=7, shuffle=True)
model = LogisticRegression(solver='liblinear')

results = cross_val_score(model, X, Y, cv=cv, scoring='roc_auc')
print(f'AUC: {results.mean(): .3f} ({results.std(): 0.3f})')

You can see the AUC is relatively close to 1 and greater than 0.5, suggesting some skill in the predictions.

Confusion Matrix

The confusion matrix is a handy presentation of the accuracy of a model with two or more classes. The table presents predictions on the x-axis and true outcomes on the y-axis. The cells of the table are the number of predictions made by a machine learning algorithm. For example, a machine learning algorithm can predict 0 or 1 and each prediction may actually have been a 0 or 1. Predictions for 0 that were actually 0 appear in the cell for prediction = 0 and actual = 0, whereas predictions for 0 that were actually 1 appear in the cell for prediction = 0 and actual = 1. And so on.

from pandas import read_csv
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

names = ['preg','plas','pres','skin','test','mass','pedi','age','class']

dataframe = read_csv(filepath+filename, names=names)
array = dataframe.values
X = array[:, 0:8]
Y = array[:,8]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=7)

model = LogisticRegression(solver='liblinear')

model.fit(X_train, Y_train)

predicted_Y = model.predict(X_test)

matrix = confusion_matrix(Y_test, predicted_Y)
print(matrix)

Although the array is printed without headings, you can see that the majority of the predictions fall on the diagonal line of the matrix (which are correct predictions).

Classification Report

The scikit-learn library provides a convenience report when working on classification problems to give you a quick idea of the accuracy of a model using a number of measures. The classification report() function displays the precision, recall, F1-score and support for each class.

from pandas import read_csv
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

names = ['preg','plas','pres','skin','test','mass','pedi','age','class']

dataframe = read_csv(filepath+filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

model = LogisticRegression(solver='liblinear')

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=7)

model.fit(X_train, Y_train)

predicted_Y = model.predict(X_test)

report = classification_report(Y_test, predicted_Y)
print(report)

Regression Metrics

In this section will review 3 of the most common metrics for evaluating predictions on regression machine learning problems:

Mean Absolute Error.
Mean Squared Error.
R 2 .

Mean Absolute Error

The Mean Absolute Error (or MAE) is the average of the absolute differences between predictions and actual values. It gives an idea of how wrong the predictions were. The measure gives an idea of the magnitude of the error, but no idea of the direction (e.g. over or under predicting).

from pandas import read_csv
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold, cross_val_score

names = ['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO',
'B','LSTAT','MEDV']

dataframe = read_csv(filepath+housing, delim_whitespace=True, names=names)
dataframe

array = dataframe.values
X = array[:, 0:13]
Y = array[:, 13]

cv = KFold(n_splits=10, random_state=7, shuffle=True)

model = LinearRegression()

results = cross_val_score(model, X, Y, cv=cv, scoring='neg_mean_absolute_error')

print(f'MAE: {results.mean(): 0.3f} ({results.std(): .3f})')

Mean Squared Error

from pandas import read_csv
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LinearRegression

names = ['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO',
'B','LSTAT','MEDV']

dataframe = read_csv(filepath+housing, delim_whitespace=True,names=names)
array = dataframe.values
X = array[:, 0:13]
Y = array[:, 13]

model = LinearRegression()

cv = KFold(n_splits=10, random_state=7, shuffle=True)

results = cross_val_score(model, X, Y, cv=cv, scoring='neg_mean_squared_error')

print(f'MSE: {results.mean(): 0.3f} ({results.std(): 0.3f})')

This metric too is inverted so that the results are increasing. Remember to take the absolute value before taking the square root if you are interested in calculating the RMSE.

R2 Metric

The R 2 (or R Squared) metric provides an indication of the goodness of fit of a set of predictions to the actual values. In statistical literature this measure is called the coefficient of determination. This is a value between 0 and 1 for no-fit and perfect fit respectively.

from pandas import read_csv
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LinearRegression

names = ['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO',
'B','LSTAT','MEDV']

dataframe = read_csv(filepath+housing, delim_whitespace=True, names=names)
array = dataframe.values
X = array[:, 0:13]
Y = array[:, 13]

cv = KFold(n_splits=10, shuffle=True, random_state=7)

model = LinearRegression()

results = cross_val_score(model, X, Y, cv=cv, scoring='r2')

print(f'R Squared: {results.mean():0.3f} ({results.std(): .3f})')

Spot-Check Classification Algorithms

Spot-checking is a way of discovering which algorithms perform well on your machine learning problem. You cannot know which algorithms are best suited to your problem beforehand. You must trial a number of methods and focus attention on those that prove themselves the most promising. In this chapter you will discover six machine learning algorithms that you can use when spot-checking your classification problem in Python with scikit-learn. After completing this lesson you will know: 1. How to spot-check machine learning algorithms on a classification problem. 2. How to spot-check two linear classification algorithms. 3. How to spot-check four nonlinear classification algorithms.

Algorithm Spot-Checking

You cannot know which algorithm will work best on your dataset beforehand. You must use trial and error to discover a shortlist of algorithms that do well on your problem that you can then double down on and tune further. I call this process spot-checking. The question is not: What algorithm should I use on my dataset? Instead it is: What algorithms should I spot-check on my dataset? You can guess at what algorithms might do well on your dataset, and this can be a good starting point. I recommend trying a mixture of algorithms and see what is good at picking out the structure in your data. Below are some suggestions when spot-checking algorithms on your dataset:

Try a mixture of algorithm representations (e.g. instances and trees).
Try a mixture of learning algorithms (e.g. different algorithms for learning the same type of representation).
Try a mixture of modeling types (e.g. linear and nonlinear functions or parametric and nonparametric).

Let’s get specific. In the next section, we will look at algorithms that you can use to spot-check on your next classification machine learning project in Python.

Algorithms Overview

We are going to take a look at six classification algorithms that you can spot-check on your dataset. Starting with two linear machine learning algorithms:

Logistic Regression.
Linear Discriminant Analysis.

Then looking at four nonlinear machine learning algorithms: * k-Nearest Neighbors. * Naive Bayes. * Classification and Regression Trees. * Support Vector Machines.

Each recipe is demonstrated on the Pima Indians onset of Diabetes dataset. A test harness using 10-fold cross-validation is used to demonstrate how to spot-check each machine learning algorithm and mean accuracy measures are used to indicate algorithm performance. The recipes assume that you know about each machine learning algorithm and how to use them. We will not go into the API or parameterization of each algorithm.

Linear Machine Learning Algorithms

Logistic Regression

Logistic regression assumes a Gaussian distribution for the numeric input variables and can model binary classification problems.

from pandas import read_csv
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LogisticRegression

names = ['preg','plas','pres','skin','test','mass','pedi','age','class']

dataframe = read_csv(filepath+pima, names=names)
array = dataframe.values
X = array[:, 0:8]
Y = array[:, 8]

kfodl = KFold(n_splits=10, random_state=7, shuffle=True)

model = LogisticRegression(solver='liblinear')

results = cross_val_score(model, X, Y, cv=kfodl)
print(f'{results.mean(): .3f}')

Linear Discriminant Analysis

Linear Discriminant Analysis or LDA is a statistical technique for binary and multiclass classification. It too assumes a Gaussian distribution for the numerical input variables.

from pandas import read_csv
from sklearn.model_selection import KFold, cross_val_score
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

names = ['preg','plas','pres','skin','test','mass','pedi','age','class']

dataframe = read_csv(filepath+pima, names=names)
array = dataframe.values
X = array[:, 0:8]
Y = array[:,8]

kfold = KFold(n_splits=10, random_state=7, shuffle=True)
model = LinearDiscriminantAnalysis()

results = cross_val_score(model , X ,Y, cv=kfold)
print(f'{results.mean()}')

Nonlinear Machine Learning Algorithms

k-Nearest Neighbors

The k-Nearest Neighbors algorithm (or KNN) uses a distance metric to find the k most similar instances in the training data for a new instance and takes the mean outcome of the neighbors as the prediction. You can construct a KNN model using the KNeighborsClassifier class.

from pandas import read_csv
from sklearn.model_selection import KFold, cross_val_score
from sklearn.neighbors import KNeighborsClassifier

names = ['preg','plas','pres','skin','test','mass','pedi','age','class']

dataframe = read_csv(filepath+pima, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

kfold = KFold(n_splits=10, random_state=7, shuffle=True)

model = KNeighborsClassifier()

results = cross_val_score(model, X, Y, cv=kfold)
print(f'{results.mean()}')

Naive Bayes

Naive Bayes calculates the probability of each class and the conditional probability of each class given each input value. These probabilities are estimated for new data and multiplied together, assuming that they are all independent (a simple or naive assumption). When working with real-valued data, a Gaussian distribution is assumed to easily estimate the probabilities for input variables using the Gaussian Probability Density Function.

from pandas import read_csv
from sklearn.model_selection import KFold, cross_val_score
from sklearn.naive_bayes import GaussianNB

names = ['preg','plas','pres','skin','test','mass','pedi','age','class']

dataframe = read_csv(filepath+pima, names=names)
array = dataframe.values
X = array[:, 0:8]
Y = array[:,8]

kfod = KFold(n_splits=10, random_state=7, shuffle=True)
model = GaussianNB()

results = cross_val_score(model, X, Y, cv=kfod)
print(f'{results.mean()}')

Classification and Regression Trees

Classification and Regression Trees (CART or just decision trees) construct a binary tree from the training data. Split points are chosen greedily by evaluating each attribute and each value of each attribute in the training data in order to minimize a cost function (like the Gini index).

from pandas import read_csv
from sklearn.model_selection import KFold, cross_val_score
from sklearn.tree import DecisionTreeClassifier

names = ['preg','plas','pres','skin','test','mass','pedi','age','class']

dataframe = read_csv(filepath+pima, names=names)
array = dataframe.values
X = array[:, 0:8]
Y = array[:, 8]

kfold = KFold(n_splits=10, random_state=7, shuffle=True)
model = DecisionTreeClassifier()

results = cross_val_score(model, X, Y, cv=kfod)
print(f'{results.mean(): .3f}')

Support Vector Machines

Support Vector Machines (or SVM) seek a line that best separates two classes. Those data instances that are closest to the line that best separates the classes are called support vectors and influence where the line is placed. SVM has been extended to support multiple classes. Of particular importance is the use of different kernel functions via the kernel parameter. A powerful Radial Basis Function is used by default.

from pandas import read_csv
from sklearn.model_selection import KFold, cross_val_score
from sklearn.svm import SVC

names = ['preg','plas','pres','skin','test','mass','pedi','age','class']

dataframe = read_csv(filepath+pima, names=names)
array = dataframe.values
X = array[:, 0:8]
Y = array[:, 8]

kfold = KFold(n_splits=10, random_state=7, shuffle=True)

model = SVC()

results = cross_val_score(model, X, Y, cv=kfod)
print(f'{results.mean()}')

Spot-Check Regression Algorithms

Algorithms Overview

In this lesson we are going to take a look at seven regression algorithms that you can spot-check on your dataset.

Starting with four linear machine learning algorithms: * Linear Regression. * Ridge Regression. * LASSO Linear Regression. * Elastic Net Regression.

Then looking at three nonlinear machine learning algorithms: * k-Nearest Neighbors. * Classification and Regression Trees. * Support Vector Machines.

Linear Machine Learning Algorithms

Linear Regression

Linear regression assumes that the input variables have a Gaussian distribution. It is also assumed that input variables are relevant to the output variable and that they are not highly correlated with each other (a problem called collinearity).

from pandas import read_csv
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LinearRegression

names = ['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','MEDV']

dataframe = read_csv(filepath+housing, names=names, delim_whitespace=True)
array = dataframe.values
X = array[:, 0:13]
Y = array[:, 13]

kfold = KFold(n_splits=10, random_state=7, shuffle=True)

model = LinearRegression()

results = cross_val_score(model, X, Y, cv=kfold, scoring='neg_mean_squared_error')
print(f'{results.mean()}')

Ridge Regression

Ridge regression is an extension of linear regression where the loss function is modified to minimize the complexity of the model measured as the sum squared value of the coefficient values (also called the L2-norm).

from pandas import read_csv
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import Ridge

names = ['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','MEDV']

dataframe = read_csv(filepath+housing, names=names, delim_whitespace=True)
array = dataframe.values
X = array[:, 0:13]
Y = array[:,13]

kfold = KFold(n_splits=10, random_state=7, shuffle=True)

model = Ridge()

results = cross_val_score(model, X, Y, cv=kfold, scoring='neg_mean_squared_error')
print(f'{results.mean()}')

LASSO Regression

The Least Absolute Shrinkage and Selection Operator (or LASSO for short) is a modification of linear regression, like ridge regression, where the loss function is modified to minimize the complexity of the model measured as the sum absolute value of the coefficient values (also called the L1-norm).

from pandas import read_csv
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import Lasso

names = ['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','MEDV']

dataframe = read_csv(filepath+housing, delim_whitespace=True, names=names)
array = dataframe.values
X = array[:, 0:13]
Y = array[:, 13]

kfold = KFold(n_splits=10, random_state=7, shuffle=True)

model = Lasso()

results = cross_val_score(model, X, Y, cv=kfold, scoring='neg_mean_squared_error')
print(f'{results.mean()}')

ElasticNet Regression

ElasticNet is a form of regularization regression that combines the properties of both Ridge Regression and LASSO regression. It seeks to minimize the complexity of the regression model (magnitude and number of regression coefficients) by penalizing the model using both the L2-norm (sum squared coefficient values) and the L1-norm (sum absolute coefficient values).

from pandas import read_csv
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import ElasticNet

names = ['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','MEDV']

dataframe = read_csv(filepath+housing, names=names, delim_whitespace=True)
array = dataframe.values
X = array[:, 0:13]
Y = array[:, 13]

kfold = KFold(n_splits=10, random_state=7, shuffle=True)

model = ElasticNet()

results = cross_val_score(model, X, Y, cv=kfold, scoring='neg_mean_squared_error')
print(f'{results.mean()}')

Nonlinear Machine Learning Algorithms

K-Nearest Neighbors

The k-Nearest Neighbors algorithm (or KNN) locates the k most similar instances in the training dataset for a new data instance. From the k neighbors, the mean or median output variable is taken as the prediction. Of note is the distance metric used (the metric argument). The Minkowski distance is used by default, which is a generalization of both the Euclidean distance (used when all inputs have the same scale) and Manhattan distance (used when the scales of the input variables differ).

from pandas import read_csv
from sklearn.model_selection import KFold, cross_val_score
from sklearn.neighbors import KNeighborsRegressor

names = ['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','MEDV']

dataframe = read_csv(filepath+housing, names=names, delim_whitespace=True)
array = dataframe.values
X = array[:, 0:13]
Y = array[: ,13]

kfold = KFold(n_splits=10, random_state=7, shuffle=True)

model = KNeighborsRegressor()

results = cross_val_score(model, X, Y, cv=kfold, scoring='neg_mean_squared_error')
print(f'{results.mean()}')

Classification and Regression Trees

Decision trees or the Classification and Regression Trees (CART as they are known) use the train- ing data to select the best points to split the data in order to minimize a cost metric. The default cost metric for regression decision trees is the mean squared error, specified in the criterion parameter.

from pandas import read_csv
from sklearn.model_selection import KFold, cross_val_score
from sklearn.tree import DecisionTreeRegressor

names = ['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','MEDV']

dataframe = read_csv(filepath+housing, names=names, delim_whitespace=True)
array = dataframe.values
X = array[:, 0:13]
Y = array[:, 13]

kfold = KFold(n_splits=10, shuffle=True, random_state=7)

model = DecisionTreeRegressor()

results = cross_val_score(model, X, Y, cv=kfold, scoring='neg_mean_squared_error')
print(f'{results.mean()}')

Support Vector Machines

Support Vector Machines (SVM) were developed for binary classification. The technique has been extended for the prediction real-valued problems called Support Vector Regression (SVR). Like the classification example, SVR is built upon the LIBSVM library.

from pandas import read_csv
from sklearn.model_selection import KFold, cross_val_score
from sklearn.svm import SVR

names = ['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','MEDV']

dataframe = read_csv(filepath+housing, names=names, delim_whitespace=True)
array = dataframe.values
X = array[:,0:13]
Y = array[:, 13]

kfold = KFold(n_splits=10, random_state=7, shuffle=True)

model = SVR(gamma='auto')

results = cross_val_score(model, X, Y, cv=kfold, scoring='neg_mean_squared_error')
print(f'{results.mean()}')

Compare Machine Learning Algorithms

It is important to compare the performance of multiple different machine learning algorithms consistently. In this chapter you will discover how you can create a test harness to compare multiple different machine learning algorithms in Python with scikit-learn. You can use this test harness as a template on your own machine learning problems and add more and different algorithms to compare. After completing this lesson you will know:

How to formulate an experiment to directly compare machine learning algorithms.
A reusable template for evaluating the performance of multiple algorithms on one dataset.
How to report and visualize the results when comparing algorithm performance.

Choose The Best Machine Learning Model

When you work on a machine learning project, you often end up with multiple good models to choose from. Each model will have different performance characteristics. Using resampling methods like cross-validation, you can get an estimate for how accurate each model may be on unseen data. You need to be able to use these estimates to choose one or two best models from the suite of models that you have created.

When you have a new dataset, it is a good idea to visualize the data using different techniques in order to look at the data from different perspectives. The same idea applies to model selection. You should use a number of different ways of looking at the estimated accuracy of your machine learning algorithms in order to choose the one or two algorithms to finalize. A way to do this is to use visualization methods to show the average accuracy, variance and other properties of the distribution of model accuracies. In the next section you will discover exactly how you can do that in Python with scikit-learn.

Compare Machine Learning Algorithms Consistently

The key to a fair comparison of machine learning algorithms is ensuring that each algorithm is evaluated in the same way on the same data. You can achieve this by forcing each algorithm to be evaluated on a consistent test harness. In the example below six different classification algorithms are compared on a single dataset: * Logistic Regression. * Linear Discriminant Analysis. * k-Nearest Neighbors. * Classification and Regression Trees. * Naive Bayes. * Support Vector Machines.

from matplotlib import pyplot
from pandas import read_csv
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

names = ['preg','plas','pres','skin','test','mass','pedi','age','class']

dataframe = read_csv(filepath+pima, names=names)
array = dataframe.values
X = array[:, 0:8]
Y = array[:,8]

models = []

models.append(('LR', LogisticRegression(solver='liblinear')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))

results = []
names = []

for name, model in models:
    kfold = KFold(n_splits=10, random_state=7, shuffle=True)
    cv_results = cross_val_score(model, X, Y, cv=kfold, scoring='accuracy')
    results.append(cv_results)
    names.append(name)
    print(f'{name}: {cv_results.mean():.3f} ({cv_results.std():.3f})')

fig = pyplot.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
pyplot.boxplot(results)
ax.set_xticklabels(names)
pyplot.show()

The example also provides a box and whisker plot showing the spread of the accuracy scores across each cross-validation fold for each algorithm.

In this lesson you learned how to compare the performance of machine learning algorithms to each other. But what if you need to prepare your data as part of the comparison process? In the next lesson you will discover Pipelines in scikit-learn and how they overcome the common problems of data leakage when comparing machine learning algorithms.

Feature Selection For Machine Learning

The data features that you use to train your machine learning models have a huge influence on the performance you can achieve. Irrelevant or partially relevant features can negatively impact model performance. In this chapter you will discover automatic feature selection techniques that you can use to prepare your machine learning data in Python with scikit-learn.

After completing this lesson you will know how to use:

Univariate Selection.
Recursive Feature Elimination.
Principle Component Analysis.
Feature Importance.

Univariate Selection

Statistical tests can be used to select those features that have the strongest relationship with the output variable. The scikit-learn library provides the SelectKBest class 2 that can be used with a suite of different statistical tests to select a specific number of features. Many different statistical tests can be used with this selection method. For example the ANOVA F-value method is appropriate for numerical inputs and categorical data, as we see in the Pima dataset.

from sklearn.feature_selection import SelectKBest, f_classif
from numpy import set_printoptions

X, Y = load_pima_data()

X[0:5, :]

array([[6.000e+00, 1.480e+02, 7.200e+01, 3.500e+01, 0.000e+00, 3.360e+01,
        6.270e-01, 5.000e+01],
       [1.000e+00, 8.500e+01, 6.600e+01, 2.900e+01, 0.000e+00, 2.660e+01,
        3.510e-01, 3.100e+01],
       [8.000e+00, 1.830e+02, 6.400e+01, 0.000e+00, 0.000e+00, 2.330e+01,
        6.720e-01, 3.200e+01],
       [1.000e+00, 8.900e+01, 6.600e+01, 2.300e+01, 9.400e+01, 2.810e+01,
        1.670e-01, 2.100e+01],
       [0.000e+00, 1.370e+02, 4.000e+01, 3.500e+01, 1.680e+02, 4.310e+01,
        2.288e+00, 3.300e+01]])

# feature extraction
test = SelectKBest(score_func=f_classif, k=4)
fit = test.fit(X,Y)

# summarize scores
set_printoptions(precision=3)
fit.scores_

array([ 39.67 , 213.162,   3.257,   4.304,  13.281,  71.772,  23.871,
        46.141])

features = fit.transform(X)

features[0:5, :]

array([[  6. , 148. ,  33.6,  50. ],
       [  1. ,  85. ,  26.6,  31. ],
       [  8. , 183. ,  23.3,  32. ],
       [  1. ,  89. ,  28.1,  21. ],
       [  0. , 137. ,  43.1,  33. ]])

You can see the scores for each attribute and the 4 attributes chosen (those with the highest scores). Specifically features with indexes 0 (preq), 1 (plas), 5 (mass), and 7 (age).

Recursive Feature Elimination

The Recursive Feature Elimination (or RFE) works by recursively removing attributes and building a model on those attributes that remain. It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute.

from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE

X, Y = load_pima_data()

model = LogisticRegression(solver='liblinear')

rfe = RFE(model, n_features_to_select=3)
fit = rfe.fit(X,Y)

print(f'Num Features: {fit.n_features_}')

Num Features: 3

print(f'Selected Features: {fit.support_}')

Selected Features: [ True False False False False  True  True False]

print(f'Feature Ranking: {fit.ranking_}')

Feature Ranking: [1 2 3 5 6 1 1 4]

Principal Component Analysis

Principal Component Analysis (or PCA) uses linear algebra to transform the dataset into a compressed form. Generally this is called a data reduction technique. A property of PCA is that you can choose the number of dimensions or principal components in the transformed result.

from sklearn.decomposition import PCA

X, Y = load_pima_data()

pca = PCA(n_components=3)

fit = pca.fit(X)

print(f'Explained Variance: {fit.explained_variance_ratio_}')

Explained Variance: [0.889 0.062 0.026]

print(fit.components_)

[[-2.022e-03  9.781e-02  1.609e-02  6.076e-02  9.931e-01  1.401e-02
   5.372e-04 -3.565e-03]
 [-2.265e-02 -9.722e-01 -1.419e-01  5.786e-02  9.463e-02 -4.697e-02
  -8.168e-04 -1.402e-01]
 [-2.246e-02  1.434e-01 -9.225e-01 -3.070e-01  2.098e-02 -1.324e-01
  -6.400e-04 -1.255e-01]]

Feature Importance

Bagged decision trees like Random Forest and Extra Trees can be used to estimate the importance of features.

from sklearn.ensemble import ExtraTreesClassifier

X,Y = load_pima_data()

model = ExtraTreesClassifier(n_estimators=100)

model.fit(X,Y)

ExtraTreesClassifier()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

print(model.feature_importances_)

[0.112 0.237 0.098 0.081 0.075 0.139 0.12  0.139]

Automate Machine Learning Workflows with Pipelines

There are standard workflows in a machine learning project that can be automated. In Python scikit-learn, Pipelines help to clearly define and automate these workflows. In this chapter you will discover Pipelines in scikit-learn and how you can automate common machine learningworkflows. After completing this lesson you will know: 1. How to use pipelines to minimize data leakage. 2. How to construct a data preparation and modeling pipeline. 3. How to construct a feature extraction and modeling pipeline.

Automating Machine Learning Workflows

There are standard workflows in applied machine learning. Standard because they overcome common problems like data leakage in your test harness. Python scikit-learn provides a Pipeline utility to help automate machine learning workflows. Pipelines work by allowing for a linear sequence of data transforms to be chained together culminating in a modeling process that can be evaluated.

The goal is to ensure that all of the steps in the pipeline are constrained to the data available for the evaluation, such as the training dataset or each fold of the cross-validation procedure. You can learn more about Pipelines in scikit-learn by reading the Pipeline section 1 of the user guide. You can also review the API documentation for the Pipeline and FeatureUnion classes and the pipeline module.

Data Preparation and Modeling Pipeline

An easy trap to fall into in applied machine learning is leaking data from your training dataset to your test dataset. To avoid this trap you need a robust test harness with strong separation of training and testing. This includes data preparation. Data preparation is one easy way to leak knowledge of the whole training dataset to the algorithm. For example, preparing your data using normalization or standardization on the entire training dataset before learning would not be a valid test because the training dataset would have been influenced by the scale of the data in the test set.

Pipelines help you prevent data leakage in your test harness by ensuring that data preparation like standardization is constrained to each fold of your cross-validation procedure. The example below demonstrates this important data preparation and model evaluation workflow on the Pima Indians onset of diabetes dataset. The pipeline is defined with two steps: 1. Standardize the data. 2. Learn a Linear Discriminant Analysis model.

# Create a pipeline that standardizes the data then creates a model

from pandas import read_csv
from sklearn.model_selection import KFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.pipeline import Pipeline

dataframe = read_csv(filepath+pima, names=names)
array = dataframe.values
X = array[:, 0:8]
Y = array[:, 8]

kfold = KFold(n_splits=10, random_state=7, shuffle=True)

# create pipeline
pipline_items = []
pipline_items.append(('standarize', StandardScaler()))
pipline_items.append(('lda', LinearDiscriminantAnalysis()))
pipeline = Pipeline(pipline_items)

results = cross_val_score(pipeline, X, Y, cv=kfold)
print(f'{results.mean()}')

Notice how we create a Python list of steps that are provided to the Pipeline for processing the data. Also notice how the Pipeline itself is treated like an estimator and is evaluated in its entirety by the k-fold cross-validation procedure.

Feature Extraction and Modeling Pipeline

Feature extraction is another procedure that is susceptible to data leakage. Like data preparation, feature extraction procedures must be restricted to the data in your training dataset. The pipeline provides a handy tool called the FeatureUnion which allows the results of multiple feature selection and extraction procedures to be combined into a larger dataset on which a model can be trained. Importantly, all the feature extraction and the feature union occurs within each fold of the cross-validation procedure. The example below demonstrates the pipeline defined with four steps:

Feature Extraction with Principal Component Analysis (3 features).
Feature Extraction with Statistical Selection (6 features).
Feature Union.
Learn a Logistic Regression Model.

from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LogisticRegression

# load data
datafarem = read_csv(filepath+pima, names=pima_names)
array = datafarem.values
X = array[:,0:8]
Y = array[:, 8]

# create feature union
features = []
features.append(('pca', PCA(n_components=3)))
features.append(('select_best', SelectKBest(k=6)))
feature_union = FeatureUnion(features)

# create pipeline
pipeline_items = []
pipeline_items.append(('feature_union', feature_union)) 
pipeline_items.append(('logistic', LogisticRegression(solver='liblinear')))
pipeline = Pipeline(pipeline_items)

# evaluate pipeline
kfold = KFold(n_splits=10, random_state=seed, shuffle=True)
results = cross_val_score(pipeline, X, Y, cv=kfold)
print(f'{results.mean()}')

Notice how the FeatureUnion is its own Pipeline that in turn is a single step in the final Pipeline used to feed Logistic Regression. This might get you thinking about how you can start embedding pipelines within pipelines.

In this chapter you discovered the difficulties of data leakage in applied machine learning. You discovered the Pipeline utilities in Python scikit-learn and how they can be used to automate standard applied machine learning workflows. You learned how to use Pipelines in two important use cases:

Data preparation and modeling constrained to each fold of the cross-validation procedure.
Feature extraction and feature union constrained to each fold of the cross-validation procedure.

Improve Results

Improve Performance with Ensembles

Ensembles can give you a boost in accuracy on your dataset. In this chapter you will discover how you can create some of the most powerful types of ensembles in Python using scikit-learn. This lesson will step you through Boosting, Bagging and Majority Voting and show you how you can continue to ratchet up the accuracy of the models on your own datasets. After completing this lesson you will know:

How to use bagging ensemble methods such as bagged decision trees, random forest and extra trees.
How to use boosting ensemble methods such as AdaBoost and stochastic gradient boosting.
How to use voting ensemble methods to combine the predictions from multiple algorithms.

The three most popular methods for combining the predictions from different models are:

Bagging. Building multiple models (typically of the same type) from different subsamples of the training dataset.
Boosting. Building multiple models (typically of the same type) each of which learns to fix the prediction errors of a prior model in the sequence of models.
Voting. Building multiple models (typically of differing types) and simple statistics (like calculating the mean) are used to combine predictions. This assumes you are generally familiar with machine learning algorithms and ensemble methods and will not go into the details of how the algorithms work or their parameters.

Bagging Algorithms

Bootstrap Aggregation (or Bagging) involves taking multiple samples from your training dataset (with replacement) and training a model for each sample. The final output prediction is averaged across the predictions of all of the sub-models. The three bagging models covered in this section are as follows: * Bagged Decision Trees. * Random Forest. * Extra Trees.

Bagged Decision Trees

Bagging performs best with algorithms that have high variance. A popular example are decision trees, often constructed without pruning. In the example below is an example of using the BaggingClassifier with the Classification and Regression Trees algorithm (DecisionTreeClassifier).

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

dataframe = read_csv(filepath+pima, names=pima_names)
array = datafarem.values
X = array[:, 0:8]
Y = array[:, 8]

kfold = KFold(n_splits=10, random_state=seed, shuffle=True)

cart = DecisionTreeClassifier()

model = BaggingClassifier(base_estimator=cart, n_estimators=100, random_state=seed)

results = cross_val_score(model, X, Y, cv=kfold)
print(f'{results.mean()}')

Random Forest

Random Forests is an extension of bagged decision trees. Samples of the training dataset are taken with replacement, but the trees are constructed in a way that reduces the correlation between individual classifiers. Specifically, rather than greedily choosing the best split point in the construction of each tree, only a random subset of features are considered for each split.

from sklearn.ensemble import RandomForestClassifier

X, Y = load_pima_data()

model = RandomForestClassifier(n_estimators=100, max_features=3)

results = cross_val_score(model, X, Y, 
                          cv=KFold(n_splits=10, random_state=7, shuffle=True))

print(f'{results.mean()}')

0.7577922077922078

Extra Trees

Extra Trees are another modification of bagging where random trees are constructed from samples of the training dataset. You can construct an Extra Trees model for classification using the ExtraTreesClassifier class

from sklearn.ensemble import ExtraTreesClassifier

X, Y = load_pima_data()

model = ExtraTreesClassifier(n_estimators=100, max_features=7)

results = cross_val_score(model, X, Y, 
                         cv= KFold(n_splits=10, random_state=7, shuffle=True))

print(f'{results.mean()}')

0.7656527682843473

Boosting Algorithms

Boosting ensemble algorithms create a sequence of models that attempt to correct the mistakes of the models before them in the sequence. Once created, the models make predictions which may be weighted by their demonstrated accuracy and the results are combined to create a final output prediction. The two most common boosting ensemble machine learning algorithms are: * AdaBoost. * Stochastic Gradient Boosting.

AdaBoost

AdaBoost was perhaps the first successful boosting ensemble algorithm. It generally works by weighting instances in the dataset by how easy or difficult they are to classify, allowing the algorithm to pay less or more attention to them in the construction of subsequent models

from sklearn.ensemble import AdaBoostClassifier

X, Y = load_pima_data()

model = AdaBoostClassifier(n_estimators=30, random_state=seed)

results = cross_val_score(model, X, Y, 
                         cv=KFold(n_splits=10, random_state=seed, shuffle=True))

print(f'{results.mean()}')

0.7552802460697198

Stochastic Gradient Boosting

Stochastic Gradient Boosting (also called Gradient Boosting Machines) are one of the most sophisticated ensemble techniques. It is also a technique that is proving to be perhaps one of the best techniques available for improving performance via ensembles.

from sklearn.ensemble import GradientBoostingClassifier

X, Y = load_pima_data()

model = GradientBoostingClassifier(n_estimators=100, random_state=seed)

results = cross_val_score(model, X, Y, 
                         cv=KFold(n_splits=10, random_state=seed, shuffle=True))

print(f'{results.mean()}')

0.7604921394395079

Voting Ensemble

Voting is one of the simplest ways of combining the predictions from multiple machine learning algorithms. It works by first creating two or more standalone models from your training dataset. A Voting Classifier can then be used to wrap your models and average the predictions of the sub-models when asked to make predictions for new data. The predictions of the sub-models can be weighted, but specifying the weights for classifiers manually or even heuristically is difficult. More advanced methods can learn how to best weight the predictions from sub-models, but this is called stacking (stacked aggregation) and is currently not provided in scikit-learn.

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import VotingClassifier

X, Y = load_pima_data()

models = []
models.append(('logistic', LogisticRegression(solver='liblinear')))
models.append(('cart', DecisionTreeClassifier()))
models.append(('svm', SVC(gamma='auto')))

ensemble = VotingClassifier(models)

results = cross_val_score(ensemble, X, Y, 
                         cv=KFold(n_splits=10, random_state=seed, shuffle=True))

print(f'{results.mean()}')

0.7461893369788107

Improve Performance with Algorithm Tuning

achine learning models are parameterized so that their behavior can be tuned for a given problem. Models can have many parameters and finding the best combination of parameters can be treated as a search problem. In this chapter you will discover how to tune the parameters of machine learning algorithms in Python using the scikit-learn. After completing this lesson you will know: 1. The importance of algorithm parameter tuning to improve algorithm performance. 2. How to use a grid search algorithm tuning strategy. 3. How to use a random search algorithm tuning strategy.

Machine Learning Algorithm Parameters

Algorithm tuning is a final step in the process of applied machine learning before finalizing your model. It is sometimes called hyperparameter optimization where the algorithm parameters are referred to as hyperparameters, whereas the coefficients found by the machine learning algorithm itself are referred to as parameters. Optimization suggests the search-nature of the problem. Phrased as a search problem, you can use different search strategies to find a good and robust parameter or set of parameters for an algorithm on a given problem. Python scikit-learn provides two simple methods for algorithm parameter tuning:

Grid Search Parameter Tuning.
Random Search Parameter Tuning.

Grid Search Parameter Tuning

Grid search is an approach to parameter tuning that will methodically build and evaluate a model for each combination of algorithm parameters specified in a grid.

import numpy
from sklearn.linear_model import RidgeClassifier
from sklearn.model_selection import GridSearchCV

X, Y = load_pima_data()

alphas = numpy.array([1, 0.1, 0.01, 0.001, 0.0001, 0])

param_grid = dict(alpha=alphas)

model = RidgeClassifier()

grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=3)

grid.fit(X, Y)

GridSearchCV(cv=3, estimator=RidgeClassifier(),
             param_grid={'alpha': array([1.e+00, 1.e-01, 1.e-02, 1.e-03, 1.e-04, 0.e+00])})

print(grid.best_score_)

0.7708333333333334

print(grid.best_estimator_.alpha)

1.0

Random Search Parameter Tuning

Random search is an approach to parameter tuning that will sample algorithm parameters from a random distribution (i.e. uniform) for a fixed number of iterations. A model is constructed and evaluated for each combination of parameters chosen.

from scipy.stats import uniform
from sklearn.linear_model import RidgeClassifier
from sklearn.model_selection import RandomizedSearchCV

X, Y = load_pima_data()

param_grid = {'alpha': uniform()}

model = RidgeClassifier()

rsearch = RandomizedSearchCV(estimator=model, 
                            param_distributions=param_grid,
                           n_iter=100, cv=3, random_state=7)

rsearch.fit(X, Y)

RandomizedSearchCV(cv=3, estimator=RidgeClassifier(), n_iter=100,
                   param_distributions={'alpha': <scipy.stats._distn_infrastructure.rv_continuous_frozen object at 0x7ff9ffd58100>},
                   random_state=7)

print(rsearch.best_score_)

0.7708333333333334

print(rsearch.best_estimator_.alpha)

0.07630828937395717

Algorithm parameter tuning is an important step for improving algorithm performance right before presenting results or preparing a system for production. In this chapter you discovered algorithm parameter tuning and two methods that you can use right now in Python and scikit-learn to improve your algorithm results:

Grid Search Parameter Tuning
Random Search Parameter Tuning

Present Results

Finding an accurate machine learning model is not the end of the project. In this chapter you will discover how to save and load your machine learning model in Python using scikit-learn. This allows you to save your model to file and load it later in order to make predictions. After completing this lesson you will know:

The importance of serializing models for reuse.
How to use pickle to serialize and deserialize machine learning models.
How to use Joblib to serialize and deserialize machine learning models.

Finalize Your Model with pickle

Pickle is the standard way of serializing objects in Python. You can use the pickle 1 operation to serialize your machine learning algorithms and save the serialized format to a file. Later you can load this file to deserialize your model and use it to make new predictions.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from pickle import dump, load

X, Y = load_pima_data()

X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.33, random_state=seed)

model = LogisticRegression(solver='liblinear')
model.fit(X_train, Y_train)

LogisticRegression(solver='liblinear')

dump(model, open('models/model1.sav', 'wb'))

loaded_model = load(open('models/model1.sav', 'rb'))

result = loaded_model.score(X_test, Y_test)
print(result)

0.7559055118110236

Finalize Your Model with Joblib

The Joblib 2 library is part of the SciPy ecosystem and provides utilities for pipelining Python jobs. It provides utilities for saving and loading Python objects that make use of NumPy data structures, efficiently 3 . This can be useful for some machine learning algorithms that require a lot of parameters or store the entire dataset (e.g. k-Nearest Neighbors).

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from joblib import dump, load

X, Y = load_pima_data()

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=7)

model = LogisticRegression(solver='liblinear')

model.fit(X_train, Y_train)

LogisticRegression(solver='liblinear')

dump(model, 'models/model2.sav')

['models/model2.sav']

loaded_model = load('models/model2.sav')

result = loaded_model.score(X_test, Y_test)
print(result)

0.7559055118110236

Tips for Finalizing Your Model

This section lists some important considerations when finalizing your machine learning models. * Python Version. Take note of the Python version. You almost certainly require the same major (and maybe minor) version of Python used to serialize the model when you later load it and deserialize it. * Library Versions. The version of all major libraries used in your machine learning project almost certainly need to be the same when deserializing a saved model. This is not limited to the version of NumPy and the version of scikit-learn. * Manual Serialization. You might like to manually output the parameters of your learned model so that you can use them directly in scikit-learn or another platform in the future. Often the techniques used internally by machine learning algorithms to make predictions are a lot simpler than those used to learn the parameters and can be easy to implement in custom code that you have control over.

Take note of the version so that you can re-create the environment if for some reason you cannot reload your model on another machine or another platform at a later time.

Summary

In this chapter you discovered how to persist your machine learning algorithms in Python with scikit-learn. You learned two techniques that you can use: * The pickle API for serializing standard Python objects. * The Joblib API for efficiently serializing Python objects with NumPy arrays.