Introduction
This book focuses on a specific sub-field of machine learning called predictive modeling. This is the field of machine learning that is the most useful in industry and the type of machine learning that the scikit-learn library in Python excels at facilitating. Unlike statistics, where models are used to understand data, predictive modeling is laser focused on developing models that make the most accurate predictions at the expense of explaining why predictions are made. Unlike the broader field of machine learning that could feasibly be used with data in any format, predictive modeling is primarily focused on tabular data (e.g. tables of numbers like in a spreadsheet).
A predictive modeling machine learning project can be broken down into 6 top-level tasks: 1. Define Problem: Investigate and characterize the problem in order to better understand the goals of the project. 2. Analyze Data: Use descriptive statistics and visualization to better understand the data you have available. 3. Prepare Data: Use data transforms in order to better expose the structure of the prediction problem to modeling algorithms. 4. Evaluate Algorithms: Design a test harness to evaluate a number of standard algorithms on the data and select the top few to investigate further. 5. Improve Results: Use algorithm tuning and ensemble methods to get the most out of well-performing algorithms on your data. 6. Present Results: Finalize the model, make predictions and present results.
You need to piece the recipes together into end-to-end projects. This will show you how to actually deliver a model or make predictions on new data using Python. This book uses small well-understood machine learning datasets from the UCI Machine learning repository 1 in both the lessons and in the example projects. These datasets are available for free as CSV downloads. These datasets are excellent for practicing applied machine learning because: * They are small, meaning they fit into memory and algorithms can model them in reasonable time. * They are well behaved, meaning you often don’t need to do a lot of feature engineering to get a good result. * They are benchmarks, meaning that many people have used them before and you can get ideas of good algorithms to try and accuracy levels you should expect.
In Part III you will work through three projects:
Hello World Project (Iris flowers dataset) : This is a quick pass through the project steps without much tuning or optimizing on a dataset that is widely used as the hello world of machine learning.
Regression (Boston House Price dataset) : Work through each step of the project process with a regression problem.
Binary Classification (Sonar dataset) : Work through each step of the project process using all of the methods on a binary classification problem.