Maximum Likelihood

Density estimation is the problem of estimating the probability distribution for a sample of observations from a problem domain. There are many techniques for solving density estimation, although a common framework used throughout the field of machine learning is maximum likelihood estimation. Maximum likelihood estimation involves defining a likelihood function for calculating the conditional probability of observing the data sample given a probability distribution and distribution parameters. This approach can be used to search a space of possible distributions and parameters. This flexible probabilistic framework also provides the foundation for many machine learning algorithms, including important methods such as linear regression and logistic regression for predicting numeric values and class labels respectively, but also more generally for deep learning artificial neural networks. In this tutorial, you will discover a gentle introduction to maximum likelihood estimation. After reading this tutorial, you will know: * Maximum Likelihood Estimation is a probabilistic framework for solving the problem of density estimation. * It involves maximizing a likelihood function in order to find the probability distribution and parameters that best explain the observed data. * It provides a framework for predictive modeling in machine learning where finding model parameters can be framed as an optimization problem.

Maximum Likelihood Estimation

Problem of Probability Density Estimation

A common modeling problem involves how to estimate a joint probability distribution for a dataset. For example, given a sample of observation (X) from a domain (x 1 , x 2 , x 3 , · · · , x n ), where each observation is drawn independently from the domain with the same probability distribution (so-called independent and identically distributed, i.i.d., or close to it). Density estimation involves selecting a probability distribution function and the parameters of that distribution that best explain the joint probability distribution of the observed data (X).

How do you choose the probability distribution function?
How do you choose the parameters for the probability distribution function?

This problem is made more challenging if the sample (X) drawn from the population is small and has noise, meaning that any evaluation of an estimated probability density function and its parameters will have some error. There are many techniques for solving this problem, although two common approaches are: * Maximum a Posteriori (MAP), a Bayesian method. * Maximum Likelihood Estimation (MLE), frequentist method.

The main difference is that MLE assumes that all solutions are equally likely beforehand, whereas MAP allows prior information about the form of the solution to be harnessed. In this tutorial, we will take a closer look at the MLE method and its relationship to applied machine learning.

Maximum Likelihood Estimation

One solution to probability density estimation is referred to as Maximum Likelihood Estimation, or MLE for short. Maximum Likelihood Estimation involves treating the problem as an optimization or search problem, where we seek a set of parameters that results in the best fit for the joint probability of the data sample (X). First, it involves defining a parameter called theta (θ) that defines both the choice of the probability density function and the parameters of that distribution. It may be a vector of numerical values whose values change smoothly and map to different probability distributions and their parameters. In Maximum Likelihood Estimation, we wish to maximize the probability of observing the data from the joint probability distribution given a specific probability distribution and its parameters, stated formally as:

P (X|θ)

This conditional probability is often stated using the semicolon (;) notation instead of the bar notation (|) because θ is not a random variable, but instead an unknown parameter. For example:

P (X; θ)

Or:

P (x 1 , x 2 , x 3 , · · · , x n ; θ)

This resulting conditional probability is referred to as the likelihood of observing the data given the model parameters and written using the notation L() to denote the likelihood function. For example:

L(X; θ)

The objective of Maximum Likelihood Estimation is to find the set of parameters (θ) that maximize the likelihood function, e.g. result in the largest likelihood value.

max L(X; θ)

We can unpack the conditional probability calculated by the likelihood function. Given that the sample is comprised of n examples, we can frame this as the joint probability of the observed data samples x 1 , x 2 , x 3 , · · · , x n in X given the probability distribution parameters (θ).

L(x 1 , x 2 , x 3 , · · · , x n ; θ)

The joint probability distribution can be restated as the multiplication of the conditional probability for observing each example given the distribution parameters.

n Y P (x i ; θ)

Multiplying many small probabilities together can be numerically unstable in practice, therefore, it is common to restate this problem as the sum of the log conditional probabilities of observing each example given the model parameters.

Where log with base-e called the natural logarithm is commonly used.

This product over many probabilities can be inconvenient […] it is prone to numerical underflow. To obtain a more convenient but equivalent optimization problem, we observe that taking the logarithm of the likelihood does not change its arg max but does conveniently transform a product into a sum

Given the frequent use of log in the likelihood function, it is commonly referred to as a log-likelihood function. It is common in optimization problems to prefer to minimize the cost function, rather than to maximize it. Therefore, the negative of the log-likelihood function is used, referred to generally as a Negative Log-Likelihood (NLL) function.

In software, we often phrase both as minimizing a cost function. Maximum likelihood thus becomes minimization of the negative log-likelihood (NLL) …

Relationship to Machine Learning

This problem of density estimation is directly related to applied machine learning. We can frame the problem of fitting a machine learning model as the problem of probability density estimation. Specifically, the choice of model and model parameters is referred to as a modeling hypothesis h, and the problem involves finding h that best explains the data X.

P (X; h)

We can, therefore, find the modeling hypothesis that maximizes the likelihood function.

max L(X; h)

Or, more fully:

This provides the basis for estimating the probability density of a dataset, typically used in unsupervised machine learning algorithms; for example: * Clustering algorithms.

Using the expected log joint probability as a key quantity for learning in a proba- bility model with hidden variables is better known in the context of the celebrated expectation maximization or EM algorithm.

The Maximum Likelihood Estimation framework is also a useful tool for supervised machine learning. This applies to data where we have input and output variables, where the output variable may be a numerical value or a class label in the case of regression and classification predictive modeling retrospectively. We can state this as the conditional probability of the output y given the input (X) given the modeling hypothesis (h).

max L(y|X; h)

Or, more fully:

The maximum likelihood estimator can readily be generalized to the case where our goal is to estimate a conditional probability P (y|x; θ) in order to predict y given x. This is actually the most common situation because it forms the basis for most supervised learning.

This means that the same Maximum Likelihood Estimation framework that is generally used for density estimation can be used to find a supervised learning model and parameters. This provides the basis for foundational linear modeling techniques, such as: * Linear Regression, for predicting a numerical value. * Logistic Regression, for binary classification.

In the case of linear regression, the model is constrained to a line and involves finding a set of coefficients for the line that best fits the observed data. Fortunately, this problem can be solved analytically (e.g. directly using linear algebra). In the case of logistic regression, the model defines a line and involves finding a set of coefficients for the line that best separates the classes. This cannot be solved analytically and is often solved by searching the space of possible coefficient values using an efficient optimization algorithm such as the BFGS algorithm or variants.

Both methods can also be solved less efficiently using a more general optimization algorithm such as stochastic gradient descent. In fact, most machine learning models can be framed under the maximum likelihood estimation framework, providing a useful and consistent way to approach predictive modeling as an optimization problem. An important benefit of the maximum likelihood estimator in machine learning is that as the size of the dataset increases, the quality of the estimator continues to improve.

In this tutorial, you discovered a gentle introduction to maximum likelihood estimation. Specifi- cally, you learned: * Maximum Likelihood Estimation is a probabilistic framework for solving the problem of density estimation. * It involves maximizing a likelihood function in order to find the probability distribution and parameters that best explain the observed data. * It provides a framework for predictive modeling in machine learning where finding model parameters can be framed as an optimization problem.

Linear Regression With Maximum Likelihood Estimation

Linear regression is a classical model for predicting a numerical quantity. The parameters of a linear regression model can be estimated using a least squares procedure or by a maximum likelihood estimation procedure. Maximum likelihood estimation is a probabilistic framework for automatically finding the probability distribution and parameters that best describe the observed data. Supervised learning can be framed as a conditional probability problem, and maximum likelihood estimation can be used to fit the parameters of a model that best summarizes the conditional probability distribution, so-called conditional maximum likelihood estimation. A linear regression model can be fit under this framework and can be shown to derive an identical solution to a least squares approach. In this tutorial, you will discover linear regression with maximum likelihood estimation. After reading this tutorial, you will know: * Linear regression is a model for predicting a numerical quantity and maximum likelihood estimation is a probabilistic framework for estimating model parameters. * Coefficients of a linear regression model can be estimated using a negative log-likelihood function from maximum likelihood estimation. * The negative log-likelihood function can be used to derive the least squares solution to linear regression.

Linear Regression as Maximum Likelihood

Least Squares and Maximum Likelihood

Logistic Regression With Maximum Likelihood Estimation

Logistic Regression and Log-Odds

from math import log
from math import exp

prob = 0.8
print(f'Probability is: {prob}')

odds = prob / (1 - prob)
print(f'Odds is: {odds: .1f}')

prob = odds/ (odds +1)
print(f'Probability is: {prob}')

logodds = log(odds)
print(f'Log-Odds is: {logodds: 0.2f}')

prob = 1 / (1 + exp(-logodds))
print(f'Probability is: {prob}')

Logistic Regression as Maximum Likelihood

def likelihood(y, yhat):
    return yhat*y + (1 - yhat)*(1 - y)

y, yhat = 1, 0.9
print(f'y={y}, yhat={yhat}, likelihood: {likelihood(y, yhat):.3f}')

y, yhat = 1, 0.1
print(f'y={y}, yhat={yhat}, likelihood: {likelihood(y, yhat):.3f}')

y, yhat = 0, 0.1
print(f'y={y}, yhat={yhat}, likelihood: {likelihood(y, yhat):.3f}')

y, yhat = 0, 0.9
print(f'y={y}, yhat={yhat}, likelihood: {likelihood(y, yhat):.3f}')

Expectation Maximization (EM Algorithm)

Expectation-Maximization Algorithm

Gaussian Mixture Model and the EM Algorithm

Example of Gaussian Mixture Model

from numpy import hstack
from numpy.random import normal
from sklearn.mixture import GaussianMixture

X1 = normal(loc=20, scale=5, size=3000)
X2 = normal(loc=40, scale=5, size=7000)

X = hstack((X1, X2))
X = X.reshape(len(X), 1)

model = GaussianMixture(n_components=2, init_params='random')
model.fit(X)

GaussianMixture(init_params='random', n_components=2)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

yhat = model.predict(X)

print(yhat[:100])

[0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]

print(yhat[-100:])

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

Maximum Likelihood Estimation

Problem of Probability Density Estimation

Maximum Likelihood Estimation

Relationship to Machine Learning

Linear Regression With Maximum Likelihood Estimation

Linear Regression as Maximum Likelihood

Least Squares and Maximum Likelihood

Logistic Regression With Maximum Likelihood Estimation

Logistic Regression and Log-Odds

Logistic Regression as Maximum Likelihood

Expectation Maximization (EM Algorithm)

Expectation-Maximization Algorithm

Gaussian Mixture Model and the EM Algorithm

Example of Gaussian Mixture Model

Probabilistic Model Selection with AIC, BIC, and MDL