Intro to XGBoost

3 min readOct 25, 2022

XGBoost is a go-to library for large scale structured datasets. It’s short for eXtreme Gradient Boosting. It’s a library for scalable tree boosting algorithms. Some example applications include anomaly detection, ads clickthrough, recommendations, and fraud detection. The structured input data is often coming from a SQL-style database, so it’s especially useful in existing architectures. Current interfaces support typical model building APIs, and language supports many common languages, such as Python, R, C++, Julia, Java, etc. XGBoost runs fast and can work in parallel and in distributed systems. Performance wise, XGBoost also tends to outperform other options. This combination of speed and performance has made XGBoost an extremely popular choice of machine learning technique for tabular data. In this blog post, we will walk through the structure of XGBoost as well as some usage examples.

Decision Trees

At the element level of Gradient Boosting Trees are decision trees. A decision tree is a tree structure that can be used for classification and regression problems. In a binary decision tree, at each node, a decision has to be made whether to move to the left subtree or right subtree. This decision is made depending on the current node’s criteria. At each level of the tree, the decision tree algorithm optimizes the best subset of the training data to split the input data.

Figure 1: An example of a simple decision tree that classifies animals by their species. Note that the value of the leaf node indicates the final output of the decision tree. Using simple features such as whether an animal has wings or meows, a decision tree is able to classify between a squirrel, a cat, and a bird.

In addition, one of the problems with a single decision tree is that it can overfit the training data. Therefore, one of the ways to overcome the overfitting problem is to create an ensemble of decision trees. Both the gradient boosting and random forest algorithms are ensemble methods that utilize decision trees. Importantly, the ensemble of decision trees is better than any single decision tree for both gradient boosting and random forests.

Gradient Boosting Trees

Boosting is an ensemble method where a set of weak classifiers is combined together in an intelligent way to create a strong classifier. In the case of gradient boosting, the weak classifiers that are used by the algorithm are a series of decision trees.

Importantly, each decision tree is added one at a time to the ensemble model starting with the strongest tree. Subsequently weaker trees are then progressively added to the model. In this way, each subsequent decision tree reduces the residual error of the combined ensemble method.

These decision trees can be created in a variety of ways. Some important parameters that are considered when applying the gradient boosting algorithm include the depth of each tree as well as the number of trees.

Gradient Boosting vs Random Forest

Another important algorithm that utilizes decision trees in the random forest algorithm. Unlike gradient boosting which adds trees one at a time, random forests will create multiple independent trees. Random forests will then aggregate the multiple decision trees together to output the final result.

Coding Examples (Prices in Boston, Eye Movements)

We now apply XGBoost to regression and classification example problems. For regression, we use the Boston housing prices dataset that is available at the time of writing. (It should be acknowledged this dataset has issues and should not be used to develop real models). For classification, we use the eye movements dataset from Salojarvi et Al 2005. In our colab notebook, we show how to apply XGBoost to create a regression model and a classification model using the Scipy API. While there are more parameters to tune, a practitioner should pay attention to

Choice of objective function
Learning rate
Max depth of tree
Number of estimators

Conclusion

In this blog post, we’ve learned about the practical use of XGBoost as a machine learning technique. XGBoost is a library that implements the gradient boosting tree algorithm which builds an effective ensemble model using decision trees. Boosting is an ensemble method that combines multiple weak models into a stronger model. Finally, we learned how to apply XGBoost to example problems in our colab notebook.