Skip to content

XGBoost

Python notebook: https://github.com/daviskregers/data-science-recap/blob/main/17-xgboost.ipynb

  • stands for eXtreme Gradient Boosted trees
  • Remember boosting is an ensemble method
    • Each tree boosts attributes that led to mis-classifications of previous tree
  • It is amazing
    • routinely wins Kaggle competitions
    • Easy to use
    • Fast
    • A good choice for an algorithm to start with

Features of XGBoost

  • Regularized boosting (prevents overfitting)
  • Can handle missing values automatically
  • Parallel processing
  • Can cross-validate at each iteration
    • Enables early stopping, finding optimal number of iterations.
  • Incremental training
  • Can plug in your own optimization objectives
  • Tree pruning
    • Generally results in deeper, but optimized trees

Using XGBoost

  • pip install xgboost
  • Also CLI, C++, R, Julia, JVM interfaces
  • It's not just made for scikit_learn so it has it's own interface
    • Uses DMatrix structure to hold features & labels
      • Can create this easily from a numpy array though
    • All parameters passed via a dictionary
  • Call train, then predict.

XGBoost hyperparameters

  • Booster
    • gbtree or gblinear
  • Objective (ie multi:softmax, multi:softprob)
  • Eta (learning rate - adjusts weights on each step)
  • max_depth (depth of the tree)
  • min_child_weight (can control overfitting, but too high will underfit)
  • ... many others

It's almost all that you need to know for ML in practical terms, at least for simple classification or regression problems.