Machine Learning (or Data Science) is an interdisciplinary field that merges ideas from various domains such as information theory, signal processing, computer science, cognitive science, philosophy, just to name a few . However, it primarily relies on the principles of Statistics and Probability Theory along with different areas of mathematics at its core. While Machine Learning extends beyond the aforementioned fields with its own unique methodologies and applications, in its simplest form it can be viewed as a mathematical process that draws on the foundations of Statistics and Probability theory to learn from data and make predictions by finding patterns in the data. Statisticians often refer to this process of finding pattern in a dataset as ‘curve fitting’ since the goal is to find the ‘best fit’ curve or line in a series of data points under some constraints. Yet, within the framework of statistics and probability theory finding the ‘best fit’ curve generally occurs through a type of mathematical optimization process called ‘function approximation’ where the goal is to find the best function from a specified class of functions. In this series I will explore some of the core ideas of Machine learning keeping the explanations as simple and jargon free as possible while still preserving the integrity of the fundamental ideas and underlying mathematics for easy and clear understanding. The first topic in this series is called estimation and maximum likelihood estimation, an absolutely fundamental concept at the heart of Machine Learning. It’s a simple and elegant idea often hidden under the heavy abstraction of the more exotic concepts of Machine Learning, yet providing a solid foundation for many advanced techniques in the field.
Estimation and Maximum Likelihood Estimation-Demonstration,Visualization,Math(in python from scratch)
Role of Estimation in Machine Learning and Data Science
Machine learning algorithms specially the supervised kind often entails a form of curve fitting or function approximation where the goal is to find the best function that captures the underlying pattern or relationship in a given dataset. The measure of 'the best' function is often determined by a process called estimation, where we quantify how well the function fits the data by optimizing the function parameters for the dataset . Maximum Likelihood Estimation (MLE) is one of the core estimation methods which provides a systematic approach to estimate the function parameters through analytical or numerical optimization processes . Any mathematical optimization process that adjusts parameter values so that the statistical model 'best fits' the dataset happens according to some criterion like maximization or minimization. And for MLE, the criterion is to find those parameters that 'maximize the likelihood function'.
But what does it mean when we say finding these parameters that 'maximize the likelihood function'?
In Data Science and Machine Learning, we use statistical model and its underlying mathematical framework to represent a real world random process . And the behavior for this statistical model usually depends on a number of parameters that can vary. For example, if we assume that a given dataset follows a normal or Gaussian distribution, then the model behavior is described by two parameters , the mean and the variance. And different values of mean and variance would give us different ‘Gaussian distributions’(we would look into the Gaussian formula and its parameters when we get into the math part).
Now lets look into the concept of likelihood .Given a specific dataset and a specific model , the ‘likelihood’ of the parameters measure how ‘likely’ or how ‘probable’ it is to observe our data for our selected model and the parameter values. And when we perform Maximum Likelihood Estimation (MLE), our goal is to find the parameter values that make the observed data ‘highly probable’. We do this by adjusting the parameters until we reach at a point where the likelihood of the observed data is maximized .For the Gaussian model , this implies finding the mean and variance values of the model . The maximization process involve techniques from calculus as well as different types of complex numerical optimizations( we would look more into it when we get into the math part ).
Having a clear understanding on the concept of estimation, and particularly Maximum Likelihood Estimation, is crucial to grasp the foundations of machine learning since these principles inform us on how models learn from data, how they make predictions, and ultimately how well they perform . In turn it essentially helps us make better choices about model selection, training, and evaluation.
Demonstration of Maximum Likelihood Estimation on IRIS dataset
In this demonstration we’re going to work with the Iris flower dataset,a popular choice in the field of machine learning and statistics to illustrate fundamental concepts like Maximum Likelihood Estimation due to the simplicity yet insightful characteristics of this dataset. The dataset consists of 150 samples from each of three species of Iris flowers (Iris Setosa, Iris Virginica, and Iris Versicolor). Four features were measured from each sample: the lengths and the widths of the sepals and petals.
Our primary objective here is to explore the concept of Maximum Likelihood Estimation (MLE) and in the process demonstrate a systematic approach to estimate the parameters of a statistical model .So for simplicity and ease of visualization, we’re going to focus on just one feature and limit our analysis to only one species of Iris flower to establish a foundational understanding of the concepts.
From visual observation we would first identify an attribute of a particular species that sufficiently follows a Gaussian distribution since this distribution is very well-understood and has known properties as we discussed earlier. We also know that the parameters of a Gaussian distribution have direct, “closed-form”(ie analytical) solution for maximum likelihood estimations, making it an appropriate choice .
For demonstration we would first draw a pair plot on the ‘IRIS’ dataset to visually explore the relationships between variables as well as the distribution of each variable.
iris pairplot
In the pairplot, each diagonal plot shows the distribution of a single variable(ie three species) while the rest of the plots show the relationships between pairs of variables(ie pairs of attributes for each of those three species). These plots can give us some insight into which variables follow a Gaussian distribution and could be good candidates for estimation using MLE.For example in the graph we can see there are different degrees of overlap for both Versicolor and Virginica data pretty much across all attributes making any individual attribute for those two species less distinguishing(meaning some values of an attribute would appear to belong to either class and difficult to separate). Between the petal length and petal width options the distribution of Setosa is quite distinguishing in both cases. However in case of petal width the distribution of Setosa is not as great a Gaussian as it is in case of petal legth.So from visual observation we may decide to focus our modeling on the petal length of Setosa for demonstrative purposes and proceed to use MLE to estimate the parameters.
P.S. the assumptions made here , like assuming a normal distribution, are for the purpose of simplifying our example. Real-world data often deviates from idealized distributions, and the choice of model are often made based on both the data and the given problem .
maximum likelihood estimation in Gaussian
Above we computed the Maximum Likelihood Estimates (MLE) which are the values of mean and variance of the Gaussian. These parameter values make the observed data ‘highly probable’ under the assumed Gaussian distribution model.The graph shows the histogram plot of the petal lengths of ‘Setosa’ and the underlying distribution. The overlaid solid line on the histogram is our Gaussian model (PDF) whose calculated parameter values(mean and variance) are mentioned at the top. So the graph effectively shows the “fit” of our Gaussian model to the given dataset.
This demonstrates the application of MLE to real-world data to estimate the parameters of a statistical model. By fitting the Gaussian distribution to the observed data, we come up with a probabilistic model for petal lengths of Iris ‘Setosa’. This model could then be used to generate new petal length data that follows the same distribution or to classify new observations as ‘Setosa’ or ‘not Setosa’ based on their petal length.
Next we would look deeper into that process along with the underlying math and explain the math in simple English to give us a clear idea about the inner workings of MLE.First lets see the expression for the Gaussian probability distribution function (PDF) which is defined as::
Gaussian formula(probability density function)
The Function describes the probability of getting a single observation x, given the parameters μ (mean) and σ² (variance). But in case of ‘likelihood function’ for our dataset we are trying to find the probability of a set of observations (x_1,x_2…,x_n),given the same parameters(ie μ and σ²) and such can be expressed as L(μ,σ²|x_1,x_2…,x_n). This likelihood function basically has to calculate the joint probability of observing all the data points in our dataset under a specific Gaussian distribution.So in such cases the function expression would look like ::
The interpretation of this function is quite simple. It is essentially finding the gaussian values for each datapoint using the gaussian pdf formula(f(x|μ,σ²))which we saw above, and then multiplying the values for each datapoint(the greek pi symbol is used to express multiplication of a function for ith times, where in this case i ranges between 1 to n). We could show this function expression in a more elaborate but simple product terms form.And in that case the function ends up looking like ::
In real time however, for the sake of easier calculation we often use the negative log of the function since the function has products in it which can cause the final result become very small and create underflow in a computer program, which happens when dealing with smaller number than the computer can actually store in memory due to its specific limits. The negative log of the function transforms the ‘product terms of likelihoods’ into ‘sum terms of log-likelihoods’, making the function numerically more tractable.For reference this log transformation can also help converting a non convex optimization problem into a convex one, in turn simplifying the optimization process.But understanding the details of that process is not crucial for our MLE demonstration.After some simplification the final form of the negative log likelihood function ends up looking like::
Taking the partial derivatives of this negative log function with respect to μ and σ², setting it to zero because at zero the derivative of a function is maximum(or minimum) and finally solving the function gives us the maximum likelihood parameter( μ_MLE, σ²_MLE) estimates for a Gaussian distribution. The parameters μ_MLE, σ²_MLE end up looking like::
It may not be obvious but in this closed form solution we are implicitly optimizing a loss function. What we are really doing here is minimizing the ‘negative log likelihood’ function(i.e. loss function), in other words maximizing the ‘likelihood’ function via the optimization process(i.e. computing partial derivatives ).
The process of finding the maximum likelihood estimate (MLE) can also be interpreted as an attempt to minimize the KL(Kullback-Leibler) divergence. KL divergence measures the discrepancy between the true probability distribution and the estimated probability distribution. In our demonstration of MLE which mirrors real-world machine learning scenarios, we typically do not know the true population distribution that our data comes from, we just have samples from that distribution. And because of this, we cannot directly compute the KL divergence between our model and the true distribution. Nevertheless when we find the parameters that maximize the likelihood of our data we are also finding the parameters that make our model as ‘close’ as possible to the true distribution. Which means those parameter values that maximize the likelihood would also minimize the KL divergence, if we could compute it.
While this closed-form solution is often more computationally efficient, not all models or distributions have closed-form solutions for their MLEs in real-world applications. Therefore, we often resort to numerical iterative optimization methods like Gradient Descent, a technique frequently used in both machine learning and statistics to iteratively find the parameters. During the numerical optimization process, we must explicitly define a loss function (similar to the negative log-likelihood mentioned above) and an optimization algorithm to minimize the loss function. A deep understanding of various loss functions and optimizations is crucial for building a solid foundation in machine learning and data science.