pure Go implementation of prediction part for GBRT (Gradient Boosting Regression Trees) models from popular frameworks

leaves

version Build Status GoDoc Coverage Status Go Report Card

Logo

Introduction

leaves is a library implementing prediction code for GBRT (Gradient Boosting Regression Trees) models in pure Go. The goal of the project - make it possible to use models from popular GBRT frameworks in Go programs without C API bindings.

NOTE: Before 1.0.0 release the API is a subject to change.

Features

  • General Features:
    • support parallel predictions for batches
    • support sigmoid, softmax transformation functions
    • support getting leaf indices of decision trees
  • Support LightGBM (repo) models:
    • read models from text format and from JSON format
    • support gbdt, rf (random forest) and dart models
    • support multiclass predictions
    • addition optimizations for categorical features (for example, one hot decision rule)
    • addition optimizations exploiting only prediction usage
  • Support XGBoost (repo) models:
    • read models from binary format
    • support gbtree, gblinear, dart models
    • support multiclass predictions
    • support missing values (nan)
  • Support scikit-learn (repo) tree models (experimental support):
    • read models from pickle format (protocol 0)
    • support sklearn.ensemble.GradientBoostingClassifier

Usage examples

In order to start, go get this repository:

go get github.com/dmitryikh/leaves

Minimal example:

package main

import (
	"fmt"

	"github.com/dmitryikh/leaves"
)

func main() {
	// 1. Read model
	useTransformation := true
	model, err := leaves.LGEnsembleFromFile("lightgbm_model.txt", useTransformation)
	if err != nil {
		panic(err)
	}

	// 2. Do predictions!
	fvals := []float64{1.0, 2.0, 3.0}
	p := model.PredictSingle(fvals, 0)
	fmt.Printf("Prediction for %v: %f\n", fvals, p)
}

In order to use XGBoost model, just change leaves.LGEnsembleFromFile, to leaves.XGEnsembleFromFile.

Documentation

Documentation is hosted on godoc (link). Documentation contains complex usage examples and full API reference. Some additional information about usage examples can be found in leaves_test.go.

Compatibility

Most leaves features are tested to be compatible with old and coming versions of GBRT libraries. In compatibility.md one can found detailed report about leaves correctness against different versions of external GBRT libraries.

Some additional information on new features and backward compatibility can be found in NOTES.md.

Benchmark

Below are comparisons of prediction speed on batches (~1000 objects in 1 API call). Hardware: MacBook Pro (15-inch, 2017), 2,9 GHz Intel Core i7, 16 ГБ 2133 MHz LPDDR3. C API implementations were called from python bindings. But large batch size should neglect overhead of python bindings. leaves benchmarks were run by means of golang test framework: go test -bench. See benchmark for mode details on measurments. See testdata/README.md for data preparation pipelines.

Single thread:

Test Case Features Trees Batch size C API leaves
LightGBM MS LTR 137 500 1000 49ms 51ms
LightGBM Higgs 28 500 1000 50ms 50ms
LightGBM KDD Cup 99* 41 1200 1000 70ms 85ms
XGBoost Higgs 28 500 1000 44ms 50ms

4 threads:

Test Case Features Trees Batch size C API leaves
LightGBM MS LTR 137 500 1000 14ms 14ms
LightGBM Higgs 28 500 1000 14ms 14ms
LightGBM KDD Cup 99* 41 1200 1000 19ms 24ms
XGBoost Higgs 28 500 1000 ? 14ms

(?) - currenly I'm unable to utilize multithreading form XGBoost predictions by means of python bindings

(*) - KDD Cup 99 problem involves continuous and categorical features simultaneously

Limitations

  • LightGBM models:
    • limited support of transformation functions (support only sigmoid, softmax)
  • XGBoost models:
    • limited support of transformation functions (support only sigmoid, softmax)
    • could be slight divergence between C API predictions vs. leaves because of floating point convertions and comparisons tolerances
  • scikit-learn tree models:
    • no support transformations functions. Output scores is raw scores (as from GradientBoostingClassifier.decision_function)
    • only pickle protocol 0 is supported
    • could be slight divergence between sklearn predictions vs. leaves because of floating point convertions and comparisons tolerances

Contacts

In case if you are interested in the project or if you have questions, please contact with me by email: khdmitryi at gmail.com

Owner
Dmitry Khominich
Software & Machine Learning Engineer
Dmitry Khominich
Comments
  • Support outputting leaf indices for all the `predict*` functions

    Support outputting leaf indices for all the `predict*` functions

    Thanks to the popular paper https://research.fb.com/wp-content/uploads/2016/11/practical-lessons-from-predicting-clicks-on-ads-at-facebook.pdf

    Many people use GBDT to extract features from a dataset and then instead of predicting results directly. The extracted features are the leaf indices from each estimator which makes the decision.

    This is achieved by setting predleaf with lightGBM https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.Booster.html#lightgbm.Booster.predict

    I am adding the same support in this pull request. Basically, the user can set this parameter and then predict function will return the leaf indices.

    The test data are generated by lightGBM to make sure that the indices are generated in the same way.

  • Skips transformation for regression for LightGBM models

    Skips transformation for regression for LightGBM models

    This patch fixes the behaviour of LGEnsembleFromFile function. If a regression model is loaded then loadTransformation parameter will be autoset to false.

  • Understanding the output of Predict

    Understanding the output of Predict

    Hi,

    I'm not sure I fully understand the output of the Predict() methods.

    I have a fully trained model with 9 classes and 100 estimators. I then run:

    predictions := make([]float64, 9)
    err = model.Predict(values, 100, predictions)
    util.SigmoidFloat64SliceInplace(predictions)
    log.Infof("Prediction for %v:\n %v", values, predictions)
    

    That yields:

    Prediction for [110 0 12 0 2 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]: 
    [0.2276 0.1822 0.2664 0.0594 0.0682 0.9859 0.1283 0.6349 0.0706]
    

    I understand those are the probabilities for EACH of the 9 classes being the right one. However, how am I able to get the actual value of the class? In python if I do y_pred = model.predict(values), it will correctly show me the expected class values. E.g. my class values look like this: 1242, 1152, 1552, 6662, etc. How can I map the prediction output from above to the class values? I haven't provided any specific order of it to the model

  • Return explicit error for predictSingle.

    Return explicit error for predictSingle.

    The PredictSingle returns 0.0 when there is an error, which could be sometimes confusing what is going on unless you read the code.

    Maybe it's a good idea to just return the explicit errors, in order to be more friendly for the client code.

  • xgboost consistency failed

    xgboost consistency failed

    i build xgb model by python, and then run the results of test dataset. but when i use leaves to load model and predict, the results is inconsistent with python results.

    and i test lgb model with the same dataset, the results are consistent.

  • Allow v3 as well

    Allow v3 as well

    AFAIK v3 "just" contains additional values for debugging purposes. So at least for the time being it should be possible to allow v3 as well IMHO.

  • xgEnsemble prediction results are different from xgboost in python

    xgEnsemble prediction results are different from xgboost in python

    I traning and testing data with xgboost in python, then use leaves in production env. The more infos are there:

    In Python xgb testing, The data structure that I set up with pd.DataFrame is [0:value1 1:v2, 2:v3, ... , n:v(n+1)] the value1 is any value int type. and v2, ... , v(n+1) is float64 type. The 0 is prediction value. This result is testing result.

    And this structure: [feature1:v2, f2:v3, ... , f(n):v(n+1)] This result is NOT testing result.

    In Golang and I use leaves XGEnsembleFromFile->model.PredictCSR() also the result is NOT testing result.

    I have tried to find how to solve it for over 5h like add {0:0} to first features group, but for my ridiculous low English level and Math level I can't find it. What's wrong with my testing data = =

  • Lightgbm dart support

    Lightgbm dart support

    #25

    It seems that lightgbm DART works out of the box because of the lightgbm model format generalization.

    Here I added doc&test on lightgbm DART support.

  • adds ability to support multi:softprob for xgboost

    adds ability to support multi:softprob for xgboost

    allows for use of multi:softprob for xgboost by defining a Softprob transformation and making the appropriate registration in transformation.go. Lastly, adds the transformation loading check for "multi:softprob" when loading an xgboost model.

  • Compatibility tests

    Compatibility tests

    This PR add scripts to perform compatibility tests: run leaves against models built in different versions of third party libraries (lightgbm & xgboost for now).

    See compatibility.md for results

  • Unexpected objective field: 'lambdarank'

    Unexpected objective field: 'lambdarank'

    leaves.LGEnsembleFromFile() failed when load an objective=lambdarank model (lightGBM)

    Error message: unexpected objective field: 'lambdarank', model:

    tree
    version=v2
    num_class=1
    num_tree_per_iteration=1
    label_index=0
    max_feature_idx=24
    objective=lambdarank
    feature_names=t quality freshness navboost pctr video_type lctr_1_3 lctr_4_7 lctr_8_30 sctr_1_3 sctr_4_7 sctr_8_30 ctr_1_3 ctr_4_7 ctr_8_30 loglclick_1_3 logclick_1_3 logsclick_1_3 lctr_ins ctr_ins sctr_ins loglclick_ins logclick_ins logsclick_ins instant_navboost
    feature_infos=[0:1.3200000524520874] [3.3299998904112726e-05:1] [0.36787900328636169:1] [0.36790001392364502:0.9999966025352478] [0:0.98189848661422729] [1:200] [0:10.87989330291748] [0:9.3969650268554688] [0:11.486390113830566] [0:4.2822332382202148] [0:3.7750816345214844] [0:2.8636219501495361] [0:8.7641057968139648] [0:7.1885638236999512] [0:10.401005744934082] [0:6.4371075630187988] [0:6.4220900535583496] [0:5.9216046333312988] [0:5.5910482406616211] [0:3.3260509967803955] [0:1.3753839731216431] [0:5.3813371658325195] [0:5.3396997451782227] [0:4.9291071891784668] [0.36790001392364502:0.9999929666519165]
    tree_sizes=1308 911 993 1073 1235 1154 992 1316 1234 1151 997 1163 1234 1077 1090 1244 1237 1152 1400 1228 1246 1310 1240 1072 1327 1068 1242 1081 1312 1082 1162 1000 1330 1310 1408 1253 1165 1328 1082 1004 1172 1328 1161 1081 1151 1323 1325 1321 1410 1166 1073 1403 996 1242 991 1336 1232 1250 995 1309
    
  • Is there any way to support tweedie regression models?

    Is there any way to support tweedie regression models?

    I have a model trained withtweedie regression in light gbm. With leaves I got panic:

    panic: unexpected objective field: 'tweedie'
    

    It works perfectly in python's lightGBM

    lgbmodel_ww.txt

  • Support the use of sklearn pipelines with prediction model

    Support the use of sklearn pipelines with prediction model

    • I found this super handy, will be great if we can not just predict based on trained model but can also used a sklearn pipeline including the transformation steps before actual prediction
  • Question: support for objective:quantile

    Question: support for objective:quantile

    I have a model trained with quantile regression in light gbm. I get an error that this is not a valid option for objective when I used my model. Is there a workaround to get it working?

  • Support for newer versions of XGBoost

    Support for newer versions of XGBoost

    Something has changed in XGBoost model's binary format. The highest versions I've managed to make leaves work with is 1.0. Starting from 1.1+ I keep getting "panic: unexpected EOF". Is support for newer versions planned? Moreover, they've started to save models in JSON format and it looks like they're going to deprecate binaries altogether.

Standard machine learning models

Cog: Standard machine learning models Define your models in a standard format, store them in a central place, run them anywhere. Standard interface fo

Jan 9, 2023
Example of Neural Network models of social and personality psychology phenomena

SocialNN Example of Neural Network models of social and personality psychology phenomena This repository gathers a collection of neural network models

Dec 5, 2022
Ensembles of decision trees in go/golang.
Ensembles of decision trees in go/golang.

CloudForest Google Group Fast, flexible, multi-threaded ensembles of decision trees for machine learning in pure Go (golang). CloudForest allows for a

Dec 1, 2022
k-modes and k-prototypes clustering algorithms implementation in Go

go-cluster GO implementation of clustering algorithms: k-modes and k-prototypes. K-modes algorithm is very similar to well-known clustering algorithm

Nov 29, 2022
A native Go clean room implementation of the Porter Stemming algorithm.

Go Porter Stemmer A native Go clean room implementation of the Porter Stemming Algorithm. This algorithm is of interest to people doing Machine Learni

Jan 3, 2023
An implementation of Neural Turing Machines
An implementation of Neural Turing Machines

Neural Turing Machines Package ntm implements the Neural Turing Machine architecture as described in A.Graves, G. Wayne, and I. Danihelka. arXiv prepr

Sep 13, 2022
Golang implementation of the Paice/Husk Stemming Algorithm

##Golang Implementation of the Paice/Husk stemming algorithm This project was created for the QUT course INB344. Details on the algorithm can be found

Sep 27, 2022
Fast (linear time) implementation of the Gaussian Blur algorithm in Go.
Fast (linear time) implementation of the Gaussian Blur algorithm in Go.

Song2 Fast (linear time) implementation of the Gaussian Blur algorithm in Go.

Oct 25, 2022
Implementation of E(n)-Equivariant Graph Neural Networks, in Pytorch
Implementation of E(n)-Equivariant Graph Neural Networks, in Pytorch

EGNN - Pytorch Implementation of E(n)-Equivariant Graph Neural Networks, in Pytorch. May be eventually used for Alphafold2 replication.

Dec 23, 2022
A high performance go implementation of Wappalyzer Technology Detection Library

wappalyzergo A high performance port of the Wappalyzer Technology Detection Library to Go. Inspired by https://github.com/rverton/webanalyze. Features

Jan 8, 2023
Go implementation of the yolo v3 object detection system
Go implementation of the yolo v3 object detection system

Go YOLO V3 This repository provides a plug and play implementation of the Yolo V3 object detection system in Go, leveraging gocv. Prerequisites Since

Dec 14, 2022
k-means clustering algorithm implementation written in Go
k-means clustering algorithm implementation written in Go

kmeans k-means clustering algorithm implementation written in Go What It Does k-means clustering partitions a multi-dimensional data set into k cluste

Dec 6, 2022
Golang k-d tree implementation with duplicate coordinate support

Golang k-d tree implementation with duplicate coordinate support

Nov 9, 2022
A prediction program which analyzes given numbers and calculates new values

Guess-It-2 About This is a prediction program which analyzes given numbers and calculates new values. Usage To test the program, download this zip fil

Nov 30, 2021
An implementation of the popular game Codenames created with Go and React.

OpenCodenames A real-time implementation of Codenames created with React/TypeScript and Golang. You can play the game here! Installation Stack: React

Aug 8, 2021
Implementation of a popular graphics benchmark written on Ebiten.
Implementation of a popular graphics benchmark written on Ebiten.

Ebiten Bunny Mark This is an implementation of the popular graphics benchmark written on Ebiten. The initial benchmark was created by Ian Lobb (code)

Dec 7, 2022
A TUI implementation of the popular word quiz wordle!

gordle A TUI implementation of the popular word quiz Wordle! Building Build the cli command: $ go build ./cmd/cli <Empty output on build success> Buil

Dec 21, 2022
A Left-Leaning Red-Black (LLRB) implementation of balanced binary search trees for Google Go

GoLLRB GoLLRB is a Left-Leaning Red-Black (LLRB) implementation of 2-3 balanced binary search trees in Go Language. Overview As of this writing and to

Dec 23, 2022
Golang implementation of Radix trees

go-radix Provides the radix package that implements a radix tree. The package only provides a single Tree implementation, optimized for sparse nodes.

Dec 30, 2022