Catboost for big data

17.11.20

Catboost for big data

For structured, heterogenous data, gradient boosting is the way to go.

For all of the hoo-ha about deep learning, the most widely used machine learning algorithm is either logistic regression or gradient boosted decision trees. Gradient boosting is a method whereby you iteratively fit simple models to your data (typically shallow trees), but weight each iteration based on the errors of the previous iteration. It tends to produce good prediction in medium to large datasets.

This paper reviews Catboost which, alongside Xgboost and LightGBM, is one of the most popular gradient boosting implementations. It is particularly well suited to categorical data (hence the name) and doesn't work well with homogenous numeric data like images. The paper compares implementation and describes application in fields such as psychology, transport and chemistry.

←Previous: How to hire people

Next: Neurips 2020→

Catboost for big data

Keep up with the latest developments in data science. One email per month.