Regression: The Crystal Ball of Machine Learning!
Despite the regressive tonality to the word, regression is one of the widely used techniques, in the field of Machine Learning. They are amongst the initial set of algorithms to be learnt, and widely used for prediction.
Let’s find out, what’s progressive about regression.
What is regression?
The Free Dictionary defines regression as “A technique for predicting the value of a dependent variable as a function of one or more independent variables in the presence of random error”
There are a few key concepts, we need to get acquainted with – some of which I had covered them in my previous blogs. If you haven’t read it yet, it’s highly recommend reading it first:
- The concept of dependent and independent variables in this blog
- The concept of hypothesis and prediction in this blog
That leaves us with understanding “random error”. Let’s understand this, through Linear Equations.
What is a linear equation?
This is a simple linear equation:
y = β + mx
y | The dependent variable |
X | The independent variable |
β | A constant value |
m | Slope |
Let’s assume β=1 and m=2, and plot the equation (y=1+2x) on a chart. As you would see from the image below, it is a straight line.
Source: https://www.mathsisfun.com
Here’s a typical growth chart for the babies.
Source: www.babycenter.com
Can this data/chart be explained using a linear equation? In other words, could this data fit into an equation like y = β + mx?
x (months) | y (weight) | Inference |
At birth | 6 pounds | The birth weight is 6 pounds. β in this case = 6 |
1 month | 9 pounds | For simplicity, let’s say the babies put on 3 pounds for every passing month. The slope (m) in this case will be 3 |
2 months | 12 pounds | |
The linear equation in this case is: y = β + mx Weight = 6 + (3*month) |
The linear equation above is not accurate. It’s way too simplistic, and doesn’t explain the data & their relationship accurately.
In real-life, not everything would fit into a straight line. But, can it be made to fit into a straight line? If we were to draw a straight line, that explains the “best fit”, then it would look something like this:
Oh dear, the line doesn’t pass through every data set. We cannot ask the babies to eat/drink more/less, to fit along the straight line. Instead, we do the next best thing – “swalpa adjust maadi”.
These outliers, give rise to the random errors. Random errors, also known as residuals, are the unexplained variations in the data. Our linear equation in this case might look like:
y = β + mx + ε , where ε is the random error
In essence, (β + mx) is the explained variation/relation, and ε is the unexplained variation/relation. There are variety of statistical techniques available to calculate the random error – Ordinary Least Squares is a common one.
Often, in your area of business, you would hear people say, “I’m 70% confident that this will work”. Don’t blow fire at the poor statistician, it’s the data talking. The lesser the unexplained variation, the higher the confidence of the models to explain the data.
There are many factors, that would affect the unexplained variation:
- Quantity of data & quality of data (a sample size of 100 thousand and 100 million will produce different results)
- Domain knowledge (The growth patterns of baby girls & boys are different, genetics play a part in the size etc)
- Changing conditions (A new formula feed is now helping the babies grow twice as fast) – therefore analytics is not a one-off activity, but a constant feed-back and feed-forward program
The types of regression techniques
Going back to regression, the objective of regression is to validate a hypothesis, and find the strength of the hypothesis. In the example of babies, it would be “Is there a relation between weight and age of the baby? If so, what is the relation – how does the age affect the growth?”.
There are many regression techniques available, and the commonly used ones are Linear regression and Logistic regression.
Further reading & understanding of the concepts
If you are interested in getting deeper into the topic, these are a few useful resources to start off with.
Simple Linear Regression | |
Intro to Linear Regression | https://www.youtube.com/watch?v=owI7zxCqNY0 |
Ordinary Least Squares | https://www.youtube.com/watch?v=0T0z8d0_aY4 |
Interpreting R2 | https://www.youtube.com/watch?v=9L7r3Uc4fGU |
Interpreting results | https://www.youtube.com/watch?v=54ewnkdWU6w |
Hypothesis Tests | https://www.youtube.com/watch?v=4WrfHUWDi7c |
Multiple Linear Regression | |
Intro to Multiple Linear Regression | https://www.youtube.com/watch?v=dQNpSa-bq4M https://www.youtube.com/watch?v=px72eCYPuvc |
Interpreting results | https://www.youtube.com/watch?v=wPJ1_Z8b0wk |
Linear regression models are commonly used for solving the prediction problems, and logistics regression models for solving the classification problems. Let’s get to know more about logistic regression in the next blog.