This blog will explain Linear Regression algorithm, a way to achieve Data modeling (fourth step in CRISP-DM model)
CRISP-DM: Cross Industry Standard Process for Data Mining provides a structured approach to planning a data mining project. This model is an idealized sequence of below mentioned events:
- Business Understanding
- Data Understanding
- Data Preparation
- Data Modeling
- Model Evaluation
- Model Deployment
Data Modeling uses machine learning algorithms, in which machine learns from the data. It is like the way humans learn from their experience.
Machine Learning models are classified in two categories:
- Supervised learning method: This method has historical data with labels. Regression and Classification algorithms fall under this category.
- Unsupervised learning methods: No pre-defined labels are assigned to historical data. Clustering algorithms fall under this category.
For example, predicting the performance of a company in terms of revenue based on history data is a regression problem and classifying if a person is likely to default loan or not is a classification problem.
How regression works?
Let’s consider an example, a company could predict it sales based on the money they put in advertising.
Previous data of spending on advertising and actual sales
Advertising expenditure (in thousands) |
Sales (in lakhs) |
20 |
11 |
30 |
23 |
11 |
6 |
14 |
7 |
45 |
44.4 |
You would like to know if you are spending X amount in advertising then what would be your sales.
Always remember that domain expertise helps in finding the right prediction results. Also, the domain expertise of the company’s advertising team can give a rough idea on the effect of change in advertising expenditure on change in sales. But to find exactly what amount of sales would get generated and to know whether a relationship between advertising expenditure and sales exists or not; you can use regression algorithm to build a model and to do a prediction.
Let’s try to plot a graph of Advertising Expenditure versus Sales
Independent Variable: Variable on X-axis which is used for prediction is independent variable.
Dependent Variable: Variable on Y-axis which we want to predict is a dependent variable.
Equation of a straight-line y = mx + c, where m is the slope of the line and c is the intercept.
What is the significance of m and c in the equation of a straight?
‘m’ signifies the strength of the relation between X and Y.
‘c’ in above example means the amount of Sales when no money is spent on Advertising that is when X = 0.
Best Fit line: The line that best fits the scatter plot. What does best fit means and how to determine whether a line is best fit or not?
Residual: Residual is used to find the best fit line. Every data point has a residual value which is the difference between the actual value and the predicted value (the value of point on line). Let’s denote this by E(error)
E = Actual – Predicted (for every data point)
Minimize the total error square i.e. minimize e1
^{2} + e2
^{2} + …… + en
^{2}.
This is also called as Residual Sum of Squares (RSS). So, choose the value of m and c in such a way that it reduces the value of RSS.
Let’s write E in terms of m and c.
E = e
_{i} = y
_{i }(actual) – ypred
ei = y
_{i} – mx
_{i} – c
In Machine Learning models, a cost function is defined for a problem and then it is either minimized or maximized according to the requirement. In case of the above described regression the cost function in Residual Sum of Squares.
How to minimize a cost function?
- Differentiate the cost function and put it equal to zero.
- Gradient Descent; start with some value of ‘m’ and ‘c’ and then iteratively move to better ‘m’ and ‘c’ to minimize the cost function.
RSS is an absolute quantity and hence, in the data set if the unit gets changed then the value of RSS will also change. There exists another measure TSS which is relative and not absolute. TSS is Total Sum of Square.
How to calculate TSS?
TSS is the sum of square of difference of each data point from the mean value of all the values of target variable (y).
TSS = (Y1 – Ymean)
^{2} + (Y2 – Ymean)
^{2} + ……. (Yn – Ymean)
^{2}
Here, the line is with intercept (‘c’ in y = mx + c) equal to Ymean; it means that this line does not include any influence of independent variable. This is a very basic model and therefore, any model that is build using independent variable should be better than the basic model.
RSS/TSS is a normalized quantity.
R
^{2 }= 1 – RSS/TSS
Higher the value of R
^{2} explains how good a model is.
Let’s say the value of R
^{2 }is 0.87; it means that 87% of the variance can be explained in the data.
If the predicted line can explain each data point correctly then the difference between actual and predicted is 0 which means that RSS is 0 and hence, R
^{2 }is 1.
Next topic will cover using Linear Regression via python.