4 min read

Regression: The Crystal Ball of Machine Learning!

Regression: The Crystal Ball of Machine Learning!

Despite the regressive tonality to the word, regression is one of the widely used techniques, in the field of Machine Learning. They are amongst the initial set of algorithms to be learnt, and widely used for prediction.

Let’s find out, what’s progressive about regression.

What is regression?

The Free Dictionary defines regression as A technique for predicting the value of a dependent variable as a function of one or more independent variables in the presence of random error

There are a few key concepts, we need to get acquainted with – some of which I had covered them in my previous blogs. If you haven’t read it yet, it’s highly recommend reading it first:

  • The concept of dependent and independent variables in this blog
  • The concept of hypothesis and prediction in this blog 

That leaves us with understanding “random error”. Let’s understand this, through Linear Equations.

What is a linear equation?

This is a simple linear equation:

y = β + mx

y The dependent variable
X The independent variable
β A constant value
m Slope

Let’s assume β=1 and m=2, and plot the equation (y=1+2x) on a chart. As you would see from the image below, it is a straight line.

Source: https://www.mathsisfun.com

Here’s a typical growth chart for the babies.

Source: www.babycenter.com

Can this data/chart be explained using a linear equation? In other words, could this data fit into an equation like y = β + mx?

x (months) y (weight) Inference
At birth 6 pounds The birth weight is 6 pounds. β in this case = 6
1 month 9 pounds For simplicity, let’s say the babies put on 3 pounds for every passing month. The slope (m) in this case will be 3
2 months 12 pounds
The linear equation in this case is: y = β + mx Weight = 6 + (3*month)

The linear equation above is not accurate. It’s way too simplistic, and doesn’t explain the data & their relationship accurately.

In real-life, not everything would fit into a straight line. But, can it be made to fit into a straight line? If we were to draw a straight line, that explains the “best fit”, then it would look something like this:

Oh dear, the line doesn’t pass through every data set. We cannot ask the babies to eat/drink more/less, to fit along the straight line. Instead, we do the next best thing – “swalpa adjust maadi”.

These outliers, give rise to the random errors. Random errors, also known as residuals, are the unexplained variations in the data. Our linear equation in this case might look like:

y = β + mx + ε , where ε is the random error

In essence, (β + mx) is the explained variation/relation, and ε is the unexplained variation/relation. There are variety of statistical techniques available to calculate the random error – Ordinary Least Squares is a common one.

Often, in your area of business, you would hear people say, “I’m 70% confident that this will work”. Don’t blow fire at the poor statistician, it’s the data talking. The lesser the unexplained variation, the higher the confidence of the models to explain the data.

There are many factors, that would affect the unexplained variation:

  • Quantity of data & quality of data (a sample size of 100 thousand and 100 million will produce different results)
  • Domain knowledge (The growth patterns of baby girls & boys are different, genetics play a part in the size etc)
  • Changing conditions (A new formula feed is now helping the babies grow twice as fast) – therefore analytics is not a one-off activity, but a constant feed-back and feed-forward program

The types of regression techniques

Going back to regression, the objective of regression is to validate a hypothesis, and find the strength of the hypothesis. In the example of babies, it would be “Is there a relation between weight and age of the baby? If so, what is the relation – how does the age affect the growth?”.

There are many regression techniques available, and the commonly used ones are Linear regression and Logistic regression.

Further reading & understanding of the concepts

If you are interested in getting deeper into the topic, these are a few useful resources to start off with.

Simple Linear Regression
Intro to Linear Regression https://www.youtube.com/watch?v=owI7zxCqNY0
Ordinary Least Squares https://www.youtube.com/watch?v=0T0z8d0_aY4
Interpreting R2 https://www.youtube.com/watch?v=9L7r3Uc4fGU
Interpreting results https://www.youtube.com/watch?v=54ewnkdWU6w
Hypothesis Tests https://www.youtube.com/watch?v=4WrfHUWDi7c
Multiple Linear Regression
Intro to Multiple Linear Regression https://www.youtube.com/watch?v=dQNpSa-bq4M https://www.youtube.com/watch?v=px72eCYPuvc
Interpreting results https://www.youtube.com/watch?v=wPJ1_Z8b0wk

Linear regression models are commonly used for solving the prediction problems, and logistics regression models for solving the classification problems. Let’s get to know more about logistic regression in the next blog.