Many a times, we are overwhelmed with the size of a problem, and left confounded. A common and an effective problem-solving strategy is to “break it down” into smaller components. This breaking down, helps manage the chunks, instead of one monolithic monster. Decision trees employ similar techniques to solve prediction problems.
What is a decision tree?
Decision trees are supervised learning algorithms, that employ the “divide & conquer” strategy to solve the prediction problems. The tree-like structures, that these algorithms use to predict the outcomes, gives them the name. The tree begins with the “root node” representing the complete data set (or observation), and various strategies to split the root(parent) node, into branches (also referred to as child or internal nodes).
A sample decision tree, for accepting a new job offer, is:
In the case of above example, the output is discrete (accept or decline offer). The decision trees, with discrete output, are used for solving classification problems, and called as classification trees. When the output is continuous, then the decision trees are also referred to as regression trees.
Generating decision trees
- Root node: Start with the root/parent node – which represents the entire data set (100% of the available observations)
- Splitting the nodes: Based on certain criteria (the splitting criteria), the root node is split into two or more child (internal) nodes
- Stopping criteria: The nodes keep splitting further, until the stopping criteria is met
The final set of nodes, once the stopping criteria is met, are called as the
terminal (or leaf nodes). These nodes are used for generating business rules. From
the above example, one of the business rules to predict job offer acceptance,
- If “Salary > $50,000” and “Commute < 1 hour” and “Coffee = Free”, then “accept the job”
Random Forests – An ensemble method
Now, that you have an idea about the (decision) trees, let’s get into the (random) forests. First, the ensemble methods.
In my previous blog, I wrote about the variance of output, based on the quantity and quality of data available. Therefore, in certain situations, it isn’t wise to rely on the output/predictions based on a single model alone. Hence, ensemble methods are used.
Ensemble method, generates several models using different sampling strategies, and combine them to produce the result. Each classification model is given a weightage, and final observation is decided based on the majority.
Random forest is a popular ensemble method. In random forest, several trees (hence the name forest) are developed using different sampling strategies, and the result obtained from the combined weightage. A frequently used sampling strategy is Bootstrap Aggregating.
We are nearing the completion of this series of blog on Analytics Translator, Machine Learning, and Analytics. In the next blog, which will be the final one, I will cover commonly used tools for Machine Learning and Analytics.