In my previous blog, I briefly touched upon the concept of hypothesis. Let’s get to see it in detail, in this blog. In the context of Machine Learning, hypothesis testing is one of the important concepts, especially in regression modelling.
What is a hypothesis?
In the context of business, any business makes a certain claim, and then works towards finding if it is right or not. For example, if a business wants to grow its revenue, and “thinks or believes” that the additional spend on marketing is going to deliver that growth, a claim (or hypothesis) has been made.
Hypothesis: More spend on marketing = revenue growth
Using data (or opinions or gut-feel or all put together), the business then works towards validating the claim. In some cases, revenue projections are made based on data available or what-if scenarios. In certain cases, business executes pilot programs to measure the effectiveness of the marketing spend, and if it indeed does deliver the results.
With Machine Learning gaining popularity, predictive learning models are used to establish the business case. In this example, the modelling will be to test whether a positive relationship between marketing spend and revenue growth exists. This is called as hypothesis testing.
What is hypothesis testing?
The basic premise of hypothesis testing is to check the VALIDITY of the NULL & the ALTERNATE HYPOTHESIS. Let’s take it one at a time.
In the business example above, our objective is to validate the relationship between marketing spend and revenue.
Initially, when we start the hypothesis testing, the following is assumed to hold good:
- NULL HYPOTHESIS = There is NO relationship between marketing spend and revenue
- ALTERNATE HYPOTHESIS = There exists a relationship between marketing spend and revenue
In the hypothesis testing, initially, we start with the initial belief that the NULL hypothesis is true i.e., “there is NO relationship between marketing spend and revenue”. At the end of the hypothesis testing, using statistical methods, either the NULL hypothesis is ACCEPTED OR REJECTED. When the NULL hypothesis is rejected, the ALTERNATE hypothesis becomes valid.
The framework for hypothesis testing
How does one go about with hypothesis testing? Well, as with everything else, it starts with an idea. A framework from idea to reality looks like:
Describe the Idea/problem: It starts with describing the problem or the need. In our example, it starts with “Our business needs to grow”. From this, many hypotheses will emerge. One such hypothesis is “Increase the spend on marketing” – which the business decided to test out. Our objective is to find whether a “positive” relation exists between marketing spend, and revenue. By increasing the spend on marketing, would the revenue grow?
Define the NULL & ALT Hypothesis: The next step is to define the NULL & ALT hypothesis. As a standard practice, NULL hypothesis, assumes there is NO relationship between the variables (marketing spend is the input variable, and revenue is the output variable). In our example, it will be:
NULL Hypothesis: There is NO relationship between marketing spend and revenue
ALT Hypothesis: There exists a relationship between marketing spend and revenue
Note: A “thumb-rule” is that the NULL hypothesis is assumed to be true. It starts with that premise, and the objective of hypothesis testing is either to accept or reject the NULL hypothesis.
Define the ACC & REJ criteria: At what point do we accept or reject the NULL hypothesis? This is determined by the significance value (denoted by the symbol α). This value depends on the context.
For example, If the business deems that a cut-off of 10% is required to accept or reject NULL hypothesis, then α = 0.1. What it means is that, “there has to be no more than 10% chance of marketing spend and revenue to be un-related”.
Identify the testing methods: There are a variety of testing methods applied, depending on the context. The most common ones are the Z-Test, T-Test, and Chi-Square Tests. The objective of these testing methods is to check the validity of the NULL hypothesis, based on the evidence from the data sets/observations.
Calculate the probability: The probability value (or the p-value), is the statistical evidence, from the data set observed via the test. The p-value represents the probability of observing the test statistical value, when the NULL hypothesis is true.
Decide: Finally, based on the significance value (α), and the p-value, a final decision is taken on whether to spend on marketing to improve the revenues.
The big picture
Let’s bring it all together and observe what happens.
|OBJECTIVE||“Our business needs to grow”|
|CHOSEN HYPOTHESIS||“Increase spend on marketing, to increase revenue”|
|NULL HYPOTHESIS||There is NO relation between marketing spend and revenue|
|ALT HYPOTHESIS||There EXISTS a relation between marketing spend and revenue|
|SIGNIFICANCE VALUE (α) (to accept null hypothesis)||No more than 10% (0.1) of chance for marketing spend and revenue to be un-related|
|p-VALUE (from the tests)||The evidence from the data suggests that 23% of the time marketing spend and revenue are un-related (Confused? Look at the objective of null hypothesis, and what hypothesis is being tested)|
|ACCEPTANCE CRITERIA FOR NULL HYPOTHESIS||p-value < α|
|OBSERVED VALUE||p-value (23%) > α (10%)|
|INFERENCE||23% of the time the marketing spend and revenue are UNRELATED. In other words, 77% of the time the marketing spend and revenue are RELATED|
|FINAL DECISION||NULL HYP = Rejected ALT HYP = Accepted|
|BUSINESS DECISION||Business may decide to go ahead with the marketing spend, as evidence suggests 77% of the time, it results in revenue improvement|
Further reading & understanding of the concepts
If you are interested in getting deeper into the topic, these are a few useful resources to start off with.
|Introduction to hypothesis||https://www.youtube.com/watch?v=VK-rnA3-41c|
|When to use Z-test & T-test||https://www.youtube.com/watch?v=YsalXF5POtY|
Thank you for reading this far, and encouraging me to continue to write. In the next blog, let’s get an understanding the regression models – a commonly used Machine Learning method.