Tree Models

Chocky _18
Jul 6, 2021
4 min read

Navigating through simple decision trees, bootstrapping/bagging and finally leading up to random forest models.

Tree-based models use a series of if-then rules to generate predictions from one or more decision trees. All tree-based models can be used for either regression (predicting numerical values) or classification (predicting categorical values). We’ll explore three types of tree-based models:

Decision tree models, which are the foundation of all tree-based models.
Random forest models, an “ensemble” method which builds many decision trees in parallel.
Gradient boosting models, an “ensemble” method which builds many decision trees sequentially.

Decision Tree Models:

A Decision Tree is a powerful supervised learning tool in Machine Learning for splitting up your data into separate “islands” recursively (via feature splits) for the purpose of decreasing the overall weighted loss of your fit to your training set. What is commonly used in decision tree classification is the modal classifier with Gini index loss, as well as the mean-regression with L2 loss for decision tree regression. What should be observed additionally though is the fact that the decision tree can in principal take on any model during the tree splitting procedure i.e. linear regression, logistic regression, neural networks. The purpose of this article is to introduce you to this more generalized approach named Model Trees, which will allow you to build decision trees out of any model of your choice.

How Do We Actually Create These Decision Tree Model?

There are essentially two key components to building a decision tree model: determining which features to split on and then deciding when to stop splitting.

When determining which features to split on, the goal is to select the feature that will produce the most homogenous resulting datasets. The simplest and most commonly used method of doing this is by minimizing entropy, a measure of the randomness within a dataset, and maximizing information gain, the reduction in entropy that results from splitting on a given feature.

We’ll split on the feature that results in the highest information gain, and then recompute entropy and information gain for the resulting output datasets. The partition process continues until no further separation can be made, e.g. the model wants to reach a state where each leaf node becomes pure as quickly as possible. When making predictions, the new data point walks through the sequence of decision nodes to arrive at the determination.

Strengths:

They are intuitive and easy to understand, even for people from non-analytical backgrounds.
Decision Trees are a non-parametric method which does not require that the data set follow a normal distribution.
They are tolerant to data quality issues and outliers, e.g. they require less data preparation such as scaling and normalization prior to implementation. Further, it works well with both categorical and continuous variables.
They can be used during the data exploration phase to quickly identify significant variables.

Challenges:

Decision trees are prone to overfitting, which occurs when the function is too closely fit to the training data (see bias-variance tradeoff). This occurs when a decision tree model learns the granular details and noise in the training data to the extent that it impairs its ability to make predictions on new data. Creating an overly complex model risks predicting poorly on data it hasn’t seen before.
Decision trees suffer from high variance. If the data set is small, the result can be very different depending on how the training and testing samples are split.

Bagging (Bootstrap Aggregation):

Bagging is an ensemble technique used to reduce the variance of predictions by considering results from multiple decision tree models that were trained on different sub-samples of the same data set. This is particularly useful when there is limitation to the size of the data. The predictions of all the models are combined using a mean, median, or mode value depending on the problem.

Boosting:

Boosting is another type of ensemble learning that combines weak learners to achieve improved model performance. Weak learners are simple models that predict relatively poorly. The concept of boosting is to sequentially train models, each time trying to fit better than before. One approach known as Adaptive Boosting (Ada Boost), modifies the weights of the data points based on prior outcome. For each subsequent instance of model building, correctly classified data points are given a less weight and misclassified data points are given a higher weight. The higher weights guide the model to learn the specifics of those data points. In the end, all models contribute to making predictions.

Random Forest:

Random Forest is a type of ensemble learning method that builds multiple tree models using different subsets of features, with or without sampling (i.e. Bootstrap). It can handle high dimensional data sets with many variables efficiently since only a subset is used to build individual trees. The intuition of limiting the number of features for each tree model is to remove the correlation between them, which occurs when strong predictors are consistently used by decision nodes. Collaboration of highly correlated models do not meaningfully reduce the variance in results. Random Forest is popular in that it is versatile and can be trained quickly with high accuracy.

Summary:

In this article we reviewed some broad-stroke terminology and techniques to improve tree-based models. Tree-based models is popular because of it’s intuitive nature.