Enh/tree (performance optimised) #310

bdholt1 · 2011-08-11T12:07:26Z

As @satra commented in #288 (comment), here is a separate PR for a decision tree. This version is significantly faster than the alternatives (Orange and MILK). Furthermore, Orange and scikits.learn support multiclass classification and regression, Milk only supports classification.

We would now welcome any comments with the aim of gaining acceptance for a merge into master.

Performance and scores

$python bench_tree.py madelon
Tree benchmarks

Loading data ...
Done, 2000 samples with 500 features loaded into memory
scikits.learn (initial): mean 84.23, std 0.62
Score: 0.76

scikits.learn (now): mean 0.65, std 0.00
Score: 0.76

milk: mean 115.31, std 1.57
Score: 0.75

Orange: mean 25.82, std 0.02
Score: 0.50

$python bench_tree.py arcene
Tree benchmarks

Loading data ...
Done, 100 samples with 10000 features loaded into memory
scikits.learn (initial): mean 40.95, std 0.44
Score: 0.60

scikits.learn (now): mean 0.20, std 0.00
Score: 0.60

milk: mean 71.00, std 0.60
Score: 0.60

Orange: mean 10.78, std 0.20
Score: 0.51

TODO before merge

~~increase test coverage to over 95%~~
~~finish the documentation (fix broken example and plot links, add practical usage tips)~~
~~demonstrate how to use a graphviz output in an example~~
~~include a static grapvhiz output for the iris and boston datasets in the documentation~~
~~add feature_names to GraphvizExporter~~
~~extract the graphviz exporter code out of the tree classes (use visitor pattern), assign node numbers (not mem addresses)~~
~~s/dimension/feature/g~~
~~add a test for the pickling of a fitted tree~~
~~cythonise prediction~~
~~explain in the documentation and in the docstrings how these classes relate to ID3, C4.5 and CART~~

Future enhancements

ability to provide instance weights (for boosting DTs)
support a loss matrix (ala R's rpart)
support multivariate regression (ala R's mvpart)
support Randomized Trees

pprett · 2011-08-11T12:14:02Z

Very nice - can't wait to look at that in more detail! thanks @bdholt1, @ndawe, and @satra!

mblondel · 2011-08-11T17:34:15Z

Thanks a lot for the hard work. I think @glouppe will be interested in this branch.

pprett · 2011-08-12T07:11:55Z

scikits/learn/tree/tree.py

+                         "num labels is %s and num features is %s "
+                         % (len(labels), len(features)))
+
+    sample_dims = np.array(xrange(n_dims))


this should be::

sample_dims = np.arange(n_dims)

Timing differences:

In [2]: %timeit np.array(xrange(m))
1000 loops, best of 3: 1.6 ms per loop

In [3]: %timeit np.arange(m)
100000 loops, best of 3: 12 us per loop

Fixed. Thanks @pprett

bdholt1 · 2011-08-12T09:24:51Z

While I've been looking through neighbors.py, I've noticed a lot of

def fit(self, X, y, **params):
    ...
    self._set_params(**params)

to support use within a pipeline. Should tree.py also support this?

larsmans · 2011-08-12T10:25:54Z

We were just about to deprecate that.

ogrisel · 2011-08-13T17:42:46Z

Thanks for this work and taking the time to profile and benchmark it against competing implementations. This looks very promising. I will try to secure some time in the coming days to give this PR a deeper look.

ogrisel · 2011-08-13T19:26:38Z

I experiment the following broken doctest:

----------------------------------------------------------------------
File "/home/ogrisel/coding/scikit-learn/scikits/learn/tree/tree.py", line 474, in scikits.learn.tree.tree.DecisionTreeRegressor
Failed example:
    for train_index, test_index in kf:
        clf = DecisionTreeRegressor()
        clf = clf.fit(data.data[train_index], data.target[train_index])
        print np.mean(np.power(clf.predict(data.data[test_index]) - data.target[test_index], 2))
Expected:
    19.2264679543
    41.2959435867
Got:
    20.2065464252
    44.4226946421

>>  raise self.failureException(self.format_failure(<StringIO.StringIO instance at 0x58dd128>.getvalue()))

You should probably use # doctest: +ELLIPSIS and ... markers to replace the lowest decimal values of doctest to void the test to break when switching from one architecture to another.

ogrisel · 2011-08-13T19:34:04Z

To fix the previous breakage you should pass a random_state=0 to the constructor.

ogrisel · 2011-08-14T11:15:02Z

I am about to do a first pass at code style cleanup. Expect a pull request soon.

bdholt1 · 2011-08-15T20:45:30Z

Thanks Olivier! I've just added some more functionality (bringing this into line with the standard labels accepted for binary classification). You will probably need to merge.

ogrisel · 2011-08-16T08:40:41Z

Hi, I did not have time to come up with a new PR while after my short trip in a plane as I first intended to. I'll let you know when it's ready.

bdholt1 · 2011-08-19T09:12:28Z

@pprett: The script I used to do the benchmarking is at https://github.com/bdholt1/ml-benchmarks/blob/enh%2Ftree/benchmarks/bench_tree.py

The training set is both madelon and arcene (using the ml-benchmarks framework).

I started to port the _find_best_split to cython. a quick %timeit clf.fit(X, y) on madelon train turned up 21 sec vs. 99 s per loop for the pure python + numpy implementation

That's great! This implementation might even get to be the same ballpark speed as Orange, but with the higher score!

bdholt1 · 2011-08-23T18:24:33Z

Following @pprett's lead, I moved _find_best_split() to cython. A few key insights:

Treat classification and regression separately
Sort the features and labels, and then work on the sorted arrays.

A significant speed increase: (arcene)

Done, 100 samples with 10000 features loaded into memory
scikits.learn: mean 4.06, std 0.02
Score: 0.57

milk: mean 68.21, std 0.31
Score: 0.25

Orange: mean 10.26, std 0.02
Score: 0.51

All the tests still pass. This is a 10* speed increase, making it the fastest of the 3 implementations, while still easy to read and modify.

bdholt1 · 2011-08-23T18:33:39Z

thanks @satra

satra · 2011-08-23T18:34:24Z

nicely done.

pprett · 2011-08-24T07:21:33Z

@bdholt1 yesterday evening I played around with the second enhancement I told you about (updating the pm counts [or means in regression] vs. computing them from scratch). It turned out that this causes a massive improvement in terms of performance (it's about an order of magnitude or even more):

Madelon:
time: 0:00:00.776095 (that's 0.7 sec)
score: 0.741666666667

Arcene:
time: 0:00:00.125336 (that's 0.1 sec)
score: 0.58

The code is here:
https://github.com/pprett/scikit-learn/tree/bdholt1-enh/tree

When you look at _tree.pyx you see new extension types at the top (ClassificationCriterion and RegressionCriterion).
They implement the update logic; label counts (aka pm) for left and right branch for classification; label means for left and right branch for regression. The concrete subclasses (Gini, Entropy, and MSE) implement an eval method which computes the value of the splitting criterion.

bdholt1 · 2011-08-24T08:10:36Z

+1! Thats a significant improvement! I suspect it was this very idea that lead to the gain (but clearly not as much) in my previous attempt. The latest version is also quite a bit more readable. Given this major gain, I think it warrants some serious attention, so I'll try to work this into my implementation or just use yours straight.

pprett · 2011-08-24T11:23:52Z

What me strikes is the bad performance of the tree on the Arcene dataset - compared to RBF SVMs (score about 0.73).
Furthermore, I benched it on the covertype dataset (~500k examples, 54 features) -> it's score is similar to that of a linear svm (a non-linear clf should do much better than that!)...

bdholt1 · 2011-08-24T11:32:34Z

I'm not too surprised about the performance. Technically a decision tree is more of a non-parametric classifier than non-linear - it works by partitioning up the feature space perpendicular to the dimensions in a greedy fashion, which is quite different to casting it into a higher dimensional space and finding a separating hyperplane... The good news is that Orange's performance is in the same ballpark, and I've worked through the algorithm quite thoroughly so I'm confident its doing it correctly. Hopefully we will see a significant performance gain when it comes to using these trees in a forest. Forests will compete with the best classifiers out there.

mblondel · 2011-08-24T12:25:11Z

Where can I retrieve the Arcene dataset?

pprett · 2011-08-24T12:26:21Z

http://archive.ics.uci.edu/ml/datasets/Arcene

2011/8/24 mblondel
reply@reply.github.com:

Where can I retrieve the Arcene dataset?

Reply to this email directly or view it on GitHub:
#310 (comment)

Peter Prettenhofer

bdholt1 · 2011-08-24T12:29:07Z

I use ml-benchmarks (which comes with the Arcene and Madelon UCI datasets). The benchmarking script for this code that I use is https://github.com/bdholt1/ml-benchmarks/blob/enh%2Ftree/benchmarks/bench_tree.py which I'm hoping will be accepted (scikit-learn/ml-benchmarks#6) when this PR is accepted.

pprett · 2011-08-24T17:57:22Z

@bdholt1: does the error of a non-terminal node has to be smaller than the error of the parent node? I cannot find anything in ELSII (page 308, 1st paragraph discusses something related). But if I turn the check off accuracy on Arcene goes up to 0.7 (from 0.58).

bdholt1 · 2011-08-24T18:08:06Z

@pprett Well, it depends how you define error. In the way it is used in your code (which I'm pilfering as we speak), its the relative error (i.e. the amount of entropy in that node). What you would expect is that the overall entropy decreases as we go down the tree. In my code, I first compute the error at the node when all samples are on the right, and then test against that.

bdholt1 · 2011-08-24T18:20:10Z

@pprett: is it any slower if you turn it off?

pprett · 2011-08-24T18:45:01Z

a bit (factor 2):
M:
0.761666666667 0:00:01.571801
A:
0.7 0:00:00.285888

btw: if we set min_split to 5 (AFAIK recommended in ELSII) then the performance goes up too
M:
0.8 0:00:00.719846
A:
0.72 0:00:00.234298

bdholt1 · 2011-08-25T10:06:55Z

@pprett: I've taken your code and done a bit to make it easier to understand/modify and made sure it passes the unit tests (although the regression error on boston is still higher than the current version on this branch).

min_split = 1
Madelon:
recursive partition called 231 times
scikits.learn: mean 1.46, std 0.00
Score: 0.75

Arcene:
recursive partition called 7 times
scikits.learn: mean 0.27, std 0.00
Score: 0.60

min_split = 5
Madelon:
recursive partition called 95 times
scikits.learn: mean 0.87, std 0.00
Score: 0.78

Arcene:
recursive partition called 7 times
scikits.learn: mean 0.27, std 0.00
Score: 0.60

I've now got three versions (min_split = 1):
....................pprett_no_mask ........ pprett_mask ....... bdholt1_current
Madelon ... 1.92s 0.79 ............ 1.45s 0.75 ...... 4.59s 0.79
Arcene ..... 0.59s 0.57 ............ 0.27s 0.6 ...... 1.64s 0.57

Since they all pass the unit tests (and I've gone through the internal workings fairly thoroughly), I'm going to focus on the version that uses sample masks and get that into the branch.

pprett · 2011-08-25T10:11:34Z

Great - thanks for your work!

btw: for the arcene results with 0.7 acc I turned off the error check (simply pass np.inf to recursive split instead of error).

bdholt1 · 2011-08-25T10:15:53Z

These results are with np.inf passed in. I'm not sure how you got such high values. Does the version in your branch pass the unit tests? I've done quite a bit of debugging on the the splitting algorithm, which I'm putting in a branch for you now: https://github.com/bdholt1/scikit-learn/blob/enh/tree2/scikits/learn/tree/

pprett · 2011-10-04T13:09:06Z

@bdholt1: thanks!

BTW: does anybody of you know how one usually grows a tree with exactly J terminal nodes (==leaves); In ESLII they use such trees for boosting but they don't describe how exactly they are grown; does one always expand the node with the lowest error - or the one with the highest decrease in error...

bdholt1 · 2011-10-04T13:11:44Z

@pprett: my guess is that they use a pruning method after the tree has been grown to full size. That way, you know how many leaves there are and you can prune until you only have J of them.

ogrisel · 2011-10-04T13:20:11Z

Ok +1 for merge.

ogrisel · 2011-10-04T13:57:36Z

Actually the doc has a couple of links in markdown syntax that needs to be translated to the reST syntax. Also the complexity of a balanced tree is stated as O(n_samples . log(n_samples)). Is this true? There is no dependency on n_features?

bdholt1 · 2011-10-04T14:15:25Z

Also the complexity of a balanced tree is stated as O(n_samples . log(n_samples)). Is this true? There is no dependency on n_features?

This piece is the introduction where the basic idea of the tree is explained. Yes there is a dependency on n_features, and a few sentences on in the doc makes that clear.

If after ready that section again you think it should be re-worded, I'd be happy to do so.

ogrisel · 2011-10-04T14:18:47Z

Yeah so the complexity should be stated as O(n_samples . n_features . log(n_samples)) in the first sentence of that section.

glouppe · 2011-10-05T07:57:20Z

I am +1 for this branch to be merged. Much work has already been done and we have now reached a pretty decent implementation of decision trees, both fairly efficient and well documented.

ogrisel · 2011-10-05T08:28:58Z

@bdholt1 I will send you a PR with fixes on the links for the doc, merge from current master and some pep8.

Merge ogrisel's doc and pep8 commits

larsmans · 2011-10-05T11:23:50Z

Shouldn't the classes be called CARTClassifier and CARTRegressor because we might want to add other d-tree leaners?

pprett · 2011-10-05T11:40:52Z

I think we have to be careful because CART is a trademark.

ogrisel · 2011-10-05T11:42:56Z

I prefer to keep it DecisiontTreeClassifier as it's more explicit (no acronym). We can always re-alias the generic class name the day we want to implement other types of D-Trees.

bdholt1 · 2011-10-05T11:49:49Z

I am +0. CARTClassifier and CARTRegressor are shorter names which I like, but realistically we'll only need to change if someone decided to implement a C45Classifier and C45Regressor (I highly doubt someone would want an ID3Classifier, and C5.0 is closed source). So, I think that we could defer the decision until someone codes up a C4.5 learner.

larsmans · 2011-10-05T15:59:14Z

If CART is a trademark, then maybe we shouldn't use it. I'm no expert on trademark law, though; would it be illegal to use that name, or just to advertise with it?

ogrisel · 2011-10-06T14:53:57Z

I don't want to deal with trademark stuff and I don't like acronyms in class names and the current impl seems to be generic enough hence I like the generic name, so +1 for defering class renaming discussions to the day where we actually have some alternative implementation that makes sense to contribute to the scikit.

Any further comment before the merge?

GaelVaroquaux · 2011-10-06T14:55:26Z

Any further comment before the merge?

I'd like to review and merge tonight. I can dedicate a long sleepless
evening to this.

bdholt1 · 2011-10-06T15:32:39Z

I'd appreciate it, unfortunately I won't be online tonight though. Gael, perhaps it would be best if you copy the branch and submit a PR, or if you're happy then merge into master and then make your own changes.

------Original Message------
From: Gael Varoquaux
To: Brian Holt
Subject: Re: [scikit-learn] Enh/tree (performance optimised) (#310)
Sent: 6 Oct 2011 15:55

Any further comment before the merge?

I'd like to review and merge tonight. I can dedicate a long sleepless
evening to this.

Reply to this email directly or view it on GitHub:
#310 (comment)

GaelVaroquaux · 2011-10-06T15:34:06Z

I'd appreciate it, unfortunately I won't be online tonight though.
Gael, perhaps it would be best if you copy the branch and submit a PR,
or if you're happy then merge into master and then make your own
changes.

I was planning to do the merge myself tonight, and to make any changes I
think are needed (I am thinking of minor details like wording in the
docs).

bdholt1 · 2011-10-06T15:36:52Z

Great!
------Original Message------
From: Gael Varoquaux
To: Brian Holt
Subject: Re: [scikit-learn] Enh/tree (performance optimised) (#310)
Sent: 6 Oct 2011 16:34

I'd appreciate it, unfortunately I won't be online tonight though.
Gael, perhaps it would be best if you copy the branch and submit a PR,
or if you're happy then merge into master and then make your own
changes.

I was planning to do the merge myself tonight, and to make any changes I
think are needed (I am thinking of minor details like wording in the
docs).

Reply to this email directly or view it on GitHub:
#310 (comment)

GaelVaroquaux · 2011-10-06T22:01:25Z

Midnight local time. Just finished my other duties (namely reformating
and compressing the scikit-learn article to be able to resubmit it), and
moving on to this poule. Sorry, I am a bit late. What a restless life
full of exitement :)

GaelVaroquaux · 2011-10-06T22:25:57Z

At a first glance, the test time of the scikit on my box went from 80s
before the merge, to 102s after the merge. :(. That's a 20% increase for
the sole decision trees. If this is confirmed, we will need to address
this testing time.

GaelVaroquaux · 2011-10-06T22:40:08Z

At a first glance, the test time of the scikit on my box went from 80s before the merge, to 102s after the merge.

Noise! I hadn't done the merge, just a checkout, and thus didn't benefit
from some optimizations in master. When writing the first mail, I had a
feeling that something was wrong...

GaelVaroquaux · 2011-10-06T23:11:05Z

I have merged this pull request. I haven't really changed much. I am too tired to think. Besides, I had already had a close look before. IMHO, they are no low hanging fruits. I still don't like the pointer arithmetic, but I guess that I will have to live with it.

ogrisel · 2011-10-06T23:19:20Z

\o/ congrats to everybody involved, this is a great contrib for the project. I find it really amazing that so many people dived in to review, test, refactor and optimize this code.

yang · 2011-10-06T23:23:45Z

This is awesome, since it opens up the doors to random forests. Any comparisons with R's randomForest? (Or am I jumping the gun here?)

ogrisel · 2011-10-06T23:37:25Z

Nope but if you would like to script one for instance using RPy2, please feel free to submit a pull request to https://github.com/scikit-learn/ml-benchmarks

satra · 2011-10-07T00:13:41Z

@bdholt, pprett, louppe: thank you very much for taking what was a crude, initial stab at this and turning it into a nicely optimized piece of code. great job!

@scikits-learn community: for showing what a nice distributed coding can pull off.

@yang: random forests and other ensemble techniques are next in the queue to build on top of this

glouppe · 2011-10-07T06:36:56Z

Good job to everyone who contributed!

TimSC · 2011-10-07T07:42:36Z

Thanks everyone! I am already using this implementation in my research. (and I am keen to try random forests too!)

pprett reviewed Aug 12, 2011
View reviewed changes

Merge pull request #22 from ogrisel/bdholt1-enh-tree

0f3d93f

Merge ogrisel's doc and pep8 commits

GaelVaroquaux merged commit 0f3d93f into scikit-learn:master Oct 6, 2011

Enh/tree (performance optimised) #310

Enh/tree (performance optimised) #310

Conversation

bdholt1 commented Aug 11, 2011

Performance and scores

TODO before merge

Future enhancements

pprett commented Aug 11, 2011

mblondel commented Aug 11, 2011

pprett Aug 12, 2011

Choose a reason for hiding this comment

bdholt1 Aug 13, 2011

Choose a reason for hiding this comment

bdholt1 commented Aug 12, 2011

larsmans commented Aug 12, 2011

ogrisel commented Aug 13, 2011

ogrisel commented Aug 13, 2011

ogrisel commented Aug 13, 2011

ogrisel commented Aug 14, 2011

bdholt1 commented Aug 15, 2011

ogrisel commented Aug 16, 2011

bdholt1 commented Aug 19, 2011

bdholt1 commented Aug 23, 2011

bdholt1 commented Aug 23, 2011

satra commented Aug 23, 2011

pprett commented Aug 24, 2011

bdholt1 commented Aug 24, 2011

pprett commented Aug 24, 2011

bdholt1 commented Aug 24, 2011

mblondel commented Aug 24, 2011

pprett commented Aug 24, 2011

bdholt1 commented Aug 24, 2011

pprett commented Aug 24, 2011

bdholt1 commented Aug 24, 2011

bdholt1 commented Aug 24, 2011

pprett commented Aug 24, 2011

bdholt1 commented Aug 25, 2011

pprett commented Aug 25, 2011

bdholt1 commented Aug 25, 2011

pprett commented Oct 4, 2011

bdholt1 commented Oct 4, 2011

ogrisel commented Oct 4, 2011

ogrisel commented Oct 4, 2011

bdholt1 commented Oct 4, 2011

ogrisel commented Oct 4, 2011

glouppe commented Oct 5, 2011

ogrisel commented Oct 5, 2011

larsmans commented Oct 5, 2011

pprett commented Oct 5, 2011

ogrisel commented Oct 5, 2011

bdholt1 commented Oct 5, 2011

larsmans commented Oct 5, 2011

ogrisel commented Oct 6, 2011

GaelVaroquaux commented Oct 6, 2011

bdholt1 commented Oct 6, 2011

GaelVaroquaux commented Oct 6, 2011

bdholt1 commented Oct 6, 2011

GaelVaroquaux commented Oct 6, 2011

GaelVaroquaux commented Oct 6, 2011

GaelVaroquaux commented Oct 6, 2011

GaelVaroquaux commented Oct 6, 2011

ogrisel commented Oct 6, 2011

yang commented Oct 6, 2011

ogrisel commented Oct 6, 2011

satra commented Oct 7, 2011

glouppe commented Oct 7, 2011

TimSC commented Oct 7, 2011