Skip to content

Latest commit

 

History

History
190 lines (149 loc) · 6.12 KB

README.md

File metadata and controls

190 lines (149 loc) · 6.12 KB

trafo (version 0.1.0) is a tiny random forestish library. Most likely this isn't what you are looking fore, but free to copy/fork/use or have fun finding bugs.

Features and Limitations

  • Tiny: Less than 3 kSLOC. The compiled library, libtrafo.so is < 50K.
  • Trees are trained in parallel using OpenMP.
  • Features are only sorted once. Book keeping maintains this property throughout the tree constructions.
  • Nodes are split by Gini impurity as the only option.
  • Supports integer labels and floating point features.
  • Does not impute missing features.
  • Very little functionality besides the basics. See the trafo_cli.c for one way to add k-fold cross validation on top of the library. It should be very simple implement the feature permutation method to estimate feature importance on top of this library as well.
  • The command line interface trafo_cli can be used to test the library on tsv-formated data. The tsv parser is very limited.
  • Paramerers include: - The number of trees. - Fraction of samples per tree. - Number of features per tree.
  • Internally, features are floats with double precision and labels are uint32. In 99% of all application it would probably be better with the combination of single precision and uint16. That is on the todo list.
  • Only smoke tested and no support offered.

Basic Library Usage

Train a classifier on labelled data and save it to disk:

#include <trafo.h>

// Basic configuration
trafo_conf C = {0};
C.n_sample = n_sample;
C.n_feature = n_feature;
C.F_row_major = F;       // Input: Features
C.label = L;             // Input: Labels
C.n_tree = conf->n_tree;

// Fitting / Training
trafo * T = trafo_fit(C);

// Use it ...

// Save to disk
trafo_save(T, "classifier.trafo");

trafo_free(T); // And done

Load a classifier and apply it to some data

#include <trafo.h>

trafo * T = trafo_load("classifier.trafo");
uint32_t * class = trafo_predict(T, features, NULL, n_features);
trafo_free(T);

see trafo.h for the full API. For more examples, look in trafo_cli.c.

Performance hints

Benchmarks should, among other things, provide averages over multiple runs. There are only results from single runs reported here. Test system: 4-core AMD Ryzen 3 PRO 3300U.

The random forest inplementation in scikit-learn is denoted skl in the tables below. The memory usage metric are not directly comparable since the values for trafo includes the whole cli interface. For skl it is just the delta value, i.e. difference in RSS memory before and after the call to the .fit method.

This is measured with a procedure like this:

mem0 = get_peak_memory()
clf = RandomForestClassifier(...)
...
clf = clf.fit(X, Y)
mem1 = get_peak_memory()
delta_rss = mem1-mem0

See test/run_on_test_data.sh for the full code.

Datasets:

Name Samples Features Classes
iris 150 5 3
digits 1797 64 10
wine 178 13 3
breast_cancer 569 30 2
diabetes 442 10 347
rand 100000 100 2

A single tree

The scikit-learn package was configured by:

clf = RandomForestClassifier(n_estimators=1)
clf.n_jobs=-1
clf.bootstrap = False
clf.max_features=X.shape[1]
clf.min_samples_split=2

giving these settings:

{'bootstrap': False, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 10, 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'monotonic_cst': None, 'n_estimators': 1, 'n_jobs': -1, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}

Results for tree construction:

bin dataset time (s) RSS (kb)
trafo iris 0.002 2436
trafo digits 0.020 6152
trafo wine 0.006 2456
trafo breast_cancer 0.003 2928
trafo diabetes 0.016 2648
trafo rand 3.256 323660
skl iris 0.015 1612
skl digits 0.036 2192
skl wine 0.016 1428
skl breast_cancer 0.016 1428
skl diabetes 0.027 3712
sk1 rand 13.96 48344

In all cases the input data is correctly classified.

A forest with 100 trees

For this test, skl was run by:

clf = RandomForestClassifier(n_estimators=100)
clf.n_jobs=-1
clf.min_samples_split=2
bin dataset time (s) RSS (kb)
trafo iris 0.045 2548
trafo digits 0.094 6940
trafo wine 0.004 2755
trafo breast_cancer 0.015 3416
trafo diabetes 0.121 3344
trafo rand 6.97 283088
skl iris 0.186 2208
skl digits 0.225 9412
skl wine 0.198 2512
skl breast_cancer 0.192 2340
skl diabetes 0.224 98548
sk1 rand 31.80 283560

The skl memory usage stand out on the diabetes dataset, due to the high number of classes?

Installation

Use cmake with the CMakeLists.txt file, something like this should do:

mkdir build
cd build
cmake ..
sudo make install

Then just add -ltrafo to the linker flags of your project.

To do

  • Feature importance estimation.
  • Single precision features/uint16 labels option for reduced memory usage.

See also