One must have heard the buzzwords “Model Training”, “Machine Learning", "Model Learning”, or “AI Model” quite often — whether in tech discussions, product demos, or data science talks.
However, when it comes to explaining what actually happens during this “training” process — in plain English or even in technical terms — most people are left guessing. Is the model memorizing data? Is it adjusting something inside? What exactly is it learning?
In this blog, let’s peel back the layers and understand what truly happens when a model is trained — step by step. We’ll start from a simple analogy and then gradually move into the math behind the learning process. The goal is to make the idea of “model training” not just familiar, but intuitively clear.
To understand the model learning process in a simple, non-technical way, imagine a child learning to throw a basketball into a hoop.
Initially, the child doesn’t know how much force to use. On the first try, the ball falls too short or goes too far. Depending on the outcome, the child adjusts slightly and tries again. After a few attempts, the child improves and starts hitting the target consistently.
That’s exactly how a machine learning model gets trained — it starts with random guesses, measures how wrong it was, adjusts itself, and improves over many repetitions. It learns not because someone told it what’s right, but by learning from its own mistakes.
To train any model — whether for image classification, prediction, or generative AI — data must be represented numerically (as integers, decimals, or vectors). In this blog, we’ll skip the mathematical details of data conversion to numeric format. As part of this blog we will take an example of data which is already in numerical form.
A loss function is like a report card for a machine learning model. It tells the model how well or how poorly it performed on the training data by comparing its predictions with the actual answers. In simple terms, the loss function calculates the difference between what the model's predicted and what it should have predicted. The bigger the difference, the higher the loss — meaning the model is doing poorly.
The whole idea of model training is to minimize the loss — that is, to reduce the gap between what the model predicted and what it should have predicted with every iteration.
An optimizer is the part of the training process that helps the model to learn from its mistakes. Once the loss function tells the model how wrong it was, the optimizer decides how to adjust the model’s internal parameters (like weights and biases) to reduce that error in the next round.
Think of it like the model’s coach or guide — after every attempt, it reviews the model’s performance (using the loss value) and gives it small, calculated corrections to move it closer to the right answer. Technically, an optimizer updates the model’s parameters ensuring that with every step, the model’s predictions improve.
Popular optimizers include Gradient Descent, Adam, RMSProp, and SGD.
An optimization step is the actual moment when the model updates its internal parameters (like weights and biases) based on what it learned from the loss function. The optimization step applies the gradients calculated in previous steps to make the model slightly better than before.
You can think of it as the model taking one step forward in the right direction toward minimizing the loss.
Over many such steps (iterations or epochs), the model gradually “learns” the best parameter values.
The learning rate, often denoted by the Greek letter η (eta), controls how big each optimization step should be. It’s a small numerical value that determines how quickly or slowly the model updates its parameters.
In simple terms —
The learning rate is like the step size the model takes while learning.
A good learning rate ensures the model moves steadily toward lower loss without jumping past the goal.
Mathematically:
When we call below code in python:
model.fit()we are asking the model to learn patterns that map inputs to outputs. Behind this simple command lies a mathematical cycle of prediction, error measurement, and gradual improvement.
To begin any model training process, we need historical records which holds the input and output values required for model training.
Let’s consider the below data points as our historical records, where x is the input and y is the output for our model training. It means that whenever x happened, what was the value of y.
| x | y |
| 1 | 2 |
| 2 | 4 |
| 3 | 6 |
Business problem: Build a model that predicts y for any given x, based on the historical data.
Making predictions in world of model training is also referred to as “Forward Pass”, where both true input and the true outputs (i.e. historical record samples) are provided to the model for it to start learning.
Since we are considering a linear regression as our example, we use the simple model equation:
We’ll start with random model parameters: w = 0 and b = 0, predictions for all (x, y) pairs are:
For three training data points values i.e. in the pairing of (x,y) as per above mentioned historical records:
| x | y (Actual) | ŷ (Predicted) |
| 1 | 2 | 0 |
| 2 | 4 | 0 |
| 3 | 6 | 0 |
The model predicts nothing correctly yet — it hasn’t learned.
Let’s break down one prediction for better understanding.
For example let us consider the pair (x,y) --> (1,2), where x is the input to the model equation and y is the expected output.
We have our parameters w and b both having value as “0”.
If we substitute values of w, x and b in above equation, since both w and b are “0”, the result will be “0”.
Same thing happens with other paired values of (x,y). Hence all the predicted values are “0”.
We measure how wrong the predictions are using Mean Squared Error (MSE), which is given by:
Substituting the numbers:
The model now knows how bad it is doing, but not how to improve. That’s where gradients come in.
To improve, the model must figure out how changing each parameter (w, b) affects the loss.
This is done using gradients — the partial derivatives of the loss with respect to each parameter.
At our current statue (w=0, b=0):
This tells the model to increase w and b to reduce the loss.
Now comes the optimization step — where we update parameters in the opposite direction of the gradient, scaled by the learning rate (η).
Lets take η = 0.1.
We update parameters using the learning rate (η):
After iteration 1: w = 1.867, b = 0.8
Training doesn’t stop after one update.
We repeat the process (forward pass → loss → gradient → update) for several epochs, each time bringing the model closer to the true pattern.
Let’s perform one more iteration to see the progression.
| x | y | ŷ (Predicted) |
| 1 | 2 | 2.667 |
| 2 | 4 | 4.534 |
| 3 | 6 | 6.401 |
Loss after iteration 2:
Loss dropped from 18.67 → 0.34 in just one iteration!
| Iterations | w | b | Loss |
| 1 | 1.867 | 0.800 | 18.67 |
| 2 | 1.681 | 0.627 | 0.34 |
If we finish the training process after two iterations, the model for the given data would be represented by the equation: ŷ = 1.681x + 0.627
Here, the values 1.681 and 0.627 are the learned parameters (weight and bias) that the model has adjusted during training to best fit the data.
The animation attached shows how the regression line gradually adjusts during training for 10 iterations. With each iteration, the model updates its weight (w) and bias (b) to better fit the data points — moving closer to the true relationship between x and y.
And that’s how a simple mathematical routine turns into a “learning” machine or what we proudly call today a “Machine Learning Model.”
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
| User | Count |
|---|---|
| 18 | |
| 16 | |
| 14 | |
| 13 | |
| 9 | |
| 9 | |
| 9 | |
| 9 | |
| 8 | |
| 8 |