The accuracy of a predictive model can be boosted in two ways: Either by embracing Feature engineering or by applying boosting algorithms straight away. There are multiple boosting algorithms like Gradient Boosting, XGBoost, AdaBoost, Gentle Boost etc. Every algorithm has its own underlying mathematics and a slight variation is observed while applying them. While working with boosting algorithms, we have come across two frequently occurring buzzwords: Bagging and Boosting.
Bagging: It is an approach where you take random samples of data, build learning algorithms and take simple means to find bagging probabilities.
Boosting: Boosting is similar, however the selection of sample is made more intelligently. We subsequently give more and more weight to hard to classify observations.
Below are Default Algorithms used in Predictive Models generated in ThingWorx Analytics:
- Decision Tree
- Gradient Boost
- Linear regression
- Neural Net
- Random Forrest
- Logistic Regression
Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differential loss function.
Let’s begin with an easy example:
Assume, you are given a previous model M to improve on. Currently you observe that the model has an accuracy of 80% (any metric). How do you go further about it?
One simple way is to build an entirely different model using new set of input variables and trying better ensemble learners. On the contrary, we have a much simpler way to suggest. It goes like this: Y = M(x) + error
What if we are able to see that error is not a white noise but have same correlation with outcome(Y) value. What if we can develop a model on this error term? Like:error = G(x) + error2
Probably, we will see error rate will improve to a higher number, say 84%. Let’s take another step and regress against error2: error2 = H(x) + error3
Now we combine all these together: Y = M(x) + G(x) + H(x) + error3
This probably will have a accuracy of even more than 84%. What if we can find an optimal weights for each of the three learners: Y = alpha * M(x) + beta * G(x) + gamma * H(x) + error4
How Gradient Boosting Works:
1. Loss Function: The loss function used depends on the type of problem being solved. It must be differential, but many standard loss functions are supported and you can define your own. A benefit of the gradient boosting framework is that a new boosting algorithm does not have to be derived for each loss function that may want to be used, instead, it is a generic enough framework that any differential loss function can be used.
2. Weak Learner: Decision trees are used as the weak learner in gradient boosting. Specifically regression trees are used that output real values for splits and whose output can be added together, allowing subsequent models outputs to be added and “correct” the residuals in the predictions. Trees are constructed in a greedy manner, choosing the best split points based on purity scores like Gini or to minimize the loss.
3. Additive Model: Trees are added one at a time, and existing trees in the model are not changed. A gradient descent procedure is used to minimize the loss when adding trees. we have weak learner sub-models or more specifically decision trees. After calculating the loss, to perform the gradient descent procedure, we must add a tree to the model that reduces the loss.
Improvements to Basic Gradient Boosting:
1. Tree Constraints: It is important that the weak learners have skill but remain weak. Below are some constraints that can be imposed on the construction of decision trees:
- Number of trees: Generally adding more trees to the model can be very slow to over fit. The advice is to keep adding trees until no further improvement is observed.
- Tree depth: Deeper trees are more complex trees and shorter trees are preferred. Generally, better results are seen with 4-8 levels.
- Number of nodes or number of leaves: like depth, this can constrain the size of the tree, but is not constrained to a symmetrical structure if other constraints are used.
- Number of observations per split: Imposes a minimum constraint on the amount of training data at a training node before a split can be considered
- Minimum improvement to loss: Is a constraint on the improvement of any split added to a tree.
2. Weighted Updates: The contribution of each tree to this sum can be weighted to slow down the learning by the algorithm. This weighting is called a shrinkage or a learning rate. "Each update is simply scaled by the value of the “learning rate parameter v".
3. Stochastic Gradient Boosting: At each iteration a sub sample of the training data is drawn at random (without replacement) from the full training data set. The randomly selected sub sample is then used, instead of the full sample, to fit the base learner.
4. Penalized Gradient Boosting: The additional regularization term helps to smooth the final learnt weights to avoid over-fitting. Intuitively, the regularized objective will tend to select a model employing simple and predictive functions.