Introduction

What is Probability?

What is Probability?
Uncertainty involves making decisions with incomplete information, and this is the way we generally operate in the world. Handling uncertainty is typically described using everyday words like chance, luck, and risk. Probability is a field of mathematics that gives us the language and tools to quantify the uncertainty of events and reason in a principled manner.

Probability theory is the mathematics of uncertainty. Uncertainty refers to imperfect or incomplete information. The world is messy and imperfect and we must make decisions and operate in the face of this uncertainty.For example, we often talk about luck, chance, odds, likelihood, and risk. These are words that we use to interpret and negotiate uncertainty in the world. When making inferences and reasoning in an uncertain world, we need principled, formal methods to express and solve problems. Probability provides the language and tools to handle uncertainty.

The probability, or likelihood, of an event is also commonly referred to as the odds of the event or the chance of the event. These all generally refer to the same notion, although odds often has its own notation of wins to losses, written as w:1; e.g. 1:3 for a 1 win and 3 losses or 1/4 (25%) probability of a win.

Probability theory has three important concepts: * Event (A): An outcome to which a probability is assigned. * Sample Space (S): The set of possible outcomes or events. * Probability Function (P): The function used to assign a probability to an event

The likelihood of an event (A) being drawn from the sample space (S) is determined by the probability function (P ). The shape or distribution of all events in the sample space is called the probability distribution. Many domains have a familiar shape to the distribution of probabilities to events, such as uniform if all events are equally likely or Gaussian if the likelihood of the events forms a normal or bell-shape.

Two Schools of Probability

There are two main ways of interpreting or thinking about probability. The perhaps simpler approach is to consider probability as the actual likelihood of an event, called the Frequentist probability. Another approach is to consider probability a notion of how strongly it is believed the event will occur, called Bayesian probability. It is not that one approach is correct and the other is incorrect; instead, they are complementary and both interpretations provide different and useful techniques.

Frequentist Probability

The frequentist approach to probability is objective. Events are observed and counted, and their frequencies provide the basis for directly calculating a probability, hence the name frequentist. Probability theory was originally developed to analyze the frequencies of events.

Methods from frequentist probability include p-values and confidence intervals used in statistical inference and maximum likelihood estimation for parameter estimation.

Bayesian Probability

The Bayesian approach to probability is subjective. Probabilities are assigned to events based on evidence and personal belief and are centered around Bayes’ theorem, hence the name Bayesian. This allows probabilities to be assigned to very infrequent events and events that have not been observed before, unlike frequentist probability.

One big advantage of the Bayesian interpretation is that it can be used to model our uncertainty about events that do not have long term frequencies.

Methods from Bayesian probability include Bayes factors and credible interval for inference and Bayes estimator and maximum a posteriori estimation for parameter estimation.

Uncertainty in Machine Learning

There are many sources of uncertainty in a machine learning project, including variance in the specific data values, the sample of data collected from the domain, and in the imperfect nature of any models developed from such data.

Noise in data, incomplete coverage of the domain, and imperfect models provide the three main sources of uncertainty in machine learning.

Applied machine learning requires getting comfortable with uncertainty. Uncertainty means working with imperfect or incomplete information.For software engineers and developers, computers are deterministic. You write a program, and the computer does what you say. Algorithms are analyzed based on space or time complexity and can be chosen to optimize whichever is most important to the project, like execution speed or memory constraints.

There are three main sources of uncertainty in machine learning:

Noise in Observations

An observation from the domain is often referred to as an instance or a example and is one row of data. It is what was measured or what was collected. It is the data that describes the object or subject. It is the input to a model and the expected output.

Noise refers to variability in the observation. Variability could be natural, such as a larger or smaller flower than normal. It could also be an error, such as a slip when measuring or a typo when writing it down. This variability impacts not just the inputs or measurements but also the outputs; for example, an observation could have an incorrect class label. This means that although we have observations for the domain, we must expect some variability or randomness.

Incomplete Coverage of the Domain

Observations from a domain used to train a model are a sample and incomplete by definition. In statistics, a random sample refers to a collection of observations chosen from the domain without systematic bias (e.g. uniformly random). Nevertheless, there will always be some limitation that will introduce bias. For example, we might choose to measure the size of randomly selected flowers in one garden. The flowers are randomly selected, but the scope is limited to one garden. Scope can be increased to gardens in one city, across a country, across a continent, and so on.

An appropriate level of variance and bias in the sample is required such that the sample is representative of the task or project for which the data or model will be used. We aim to collect or obtain a suitably representative random sample of observations to train and evaluate a machine learning model. Often, we have little control over the sampling process. Instead, we access a database or CSV file and the data we have is the data we must work with. In all cases, we will never have all of the observations. If we did, a predictive model would not be required. This means that there will always be some unobserved cases. There will be part of the problem domain for which we do not have coverage.

This is why we split a dataset into train and test sets or use resampling methods like k-fold cross-validation. We do this to handle the uncertainty in the representativeness of our dataset and estimate the performance of a modeling procedure on data not used in that procedure.

Imperfect Model of the Problem

A machine learning model will always have some error. This is often summarized as all models are wrong, or more completely in an aphorism by George Box: All models are wrong but some are useful.

This does not apply just to the model, the artifact, but the whole procedure used to prepare it, including the choice and preparation of data, choice of training hyperparameters, and the interpretation of model predictions. Model error could mean imperfect predictions, such as predicting a quantity in a regression problem that is quite different to what was expected, or predicting a class label that does not match what would be expected. This type of error in prediction is expected given the uncertainty we have about the data that we have just discussed, both in terms of noise in the observations and incomplete coverage of the domain.

Another type of error is an error of omission. We leave out details or abstract them in order to generalize to new cases. This is achieved by selecting models that are simpler but more robust to the specifics of the data, as opposed to complex models that may be highly specialized to the training data. As such, we might and often do choose a model known to make errors on the training dataset with the expectation that the model will generalize better to new cases and have better overall performance.

How to Manage Uncertainty

In terms of noisy observations, probability and statistics help us to understand and quantify the expected value and variability of variables in our observations from the domain.

In terms of the incomplete coverage of the domain, probability helps to understand and quantify the expected distribution and density of observations in the domain.

In terms of model error, probability helps to understand and quantify the expected capability and variance in performance of our predictive models when applied to new data.

But this is just the beginning, as probability provides the foundation for the iterative training of many machine learning models, called maximum likelihood estimation, behind models such as linear regression, logistic regression, artificial neural networks, and much more. Probability also provides the basis for developing specific algorithms, such as Naive Bayes, as well as entire subfields of study in machine learning, such as graphical models like the Bayesian Belief Network.

Why Learn Probability for Machine Learning

Class Membership Requires Predicting a Probability
Some Algorithms Are Designed Using Probability
- Naive Bayes which is constructed using Bayes Theorem with some simplifying assumptions.
- Probabilistic Graphical Models (PGM) are designed around Bayes Theorem.
- Bayesian Belief Networks or Bayes Nets, which are capable of capturing the conditional dependencies between variables.
Models Are Trained Using a Probabilistic Framework

Many machine learning models are trained using an iterative algorithm designed under a probabilistic framework. Perhaps the most common is the framework of maximum likelihood estimation, sometimes shorted as MLE. This is a framework for estimating model parameters (e.g. weights) given observed data. This is the framework that underlies the ordinary least squares estimate of a linear regression model. The expectation-maximization algorithm, or EM for short, is an approach for maximum likelihood estimation often used for unsupervised data clustering, e.g. estimating k means for k clusters, also known as the k-Means clustering algorithm.

For models that predict class membership, maximum likelihood estimation provides the framework for minimizing the difference or divergence between an observed and a predicted probability distribution. This is used in classification algorithms like logistic regression as well as deep learning neural networks. It is common to measure this difference in probability distributions during training using entropy, e.g. via cross-entropy. Entropy, differences between distributions measured via KL divergence, and cross-entropy are from the field of information theory that directly builds upon probability theory. For example, entropy is calculated directly as the negative log of the probability.

Models Can Be Tuned With a Probabilistic Framework
Probabilistic Measures Are Used to Evaluate Model Skill