Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enh/tree (performance optimised) #310

Merged
merged 1 commit into from Oct 6, 2011
Merged

Enh/tree (performance optimised) #310

merged 1 commit into from Oct 6, 2011

Conversation

bdholt1
Copy link
Member

@bdholt1 bdholt1 commented Aug 11, 2011

As @satra commented in #288 (comment), here is a separate PR for a decision tree. This version is significantly faster than the alternatives (Orange and MILK). Furthermore, Orange and scikits.learn support multiclass classification and regression, Milk only supports classification.

We would now welcome any comments with the aim of gaining acceptance for a merge into master.

Performance and scores

$python bench_tree.py madelon
Tree benchmarks

Loading data ...
Done, 2000 samples with 500 features loaded into memory
scikits.learn (initial): mean 84.23, std 0.62
Score: 0.76

scikits.learn (now): mean 0.65, std 0.00
Score: 0.76

milk: mean 115.31, std 1.57
Score: 0.75

Orange: mean 25.82, std 0.02
Score: 0.50

$python bench_tree.py arcene
Tree benchmarks

Loading data ...
Done, 100 samples with 10000 features loaded into memory
scikits.learn (initial): mean 40.95, std 0.44
Score: 0.60

scikits.learn (now): mean 0.20, std 0.00
Score: 0.60

milk: mean 71.00, std 0.60
Score: 0.60

Orange: mean 10.78, std 0.20
Score: 0.51

TODO before merge

  • increase test coverage to over 95%
  • finish the documentation (fix broken example and plot links, add practical usage tips)
  • demonstrate how to use a graphviz output in an example
  • include a static grapvhiz output for the iris and boston datasets in the documentation
  • add feature_names to GraphvizExporter
  • extract the graphviz exporter code out of the tree classes (use visitor pattern), assign node numbers (not mem addresses)
  • s/dimension/feature/g
  • add a test for the pickling of a fitted tree
  • cythonise prediction
  • explain in the documentation and in the docstrings how these classes relate to ID3, C4.5 and CART

Future enhancements

  • ability to provide instance weights (for boosting DTs)
  • support a loss matrix (ala R's rpart)
  • support multivariate regression (ala R's mvpart)
  • support Randomized Trees

@pprett
Copy link
Member

pprett commented Aug 11, 2011

Very nice - can't wait to look at that in more detail! thanks @bdholt1, @ndawe, and @satra!

@mblondel
Copy link
Member

Thanks a lot for the hard work. I think @glouppe will be interested in this branch.

"num labels is %s and num features is %s "
% (len(labels), len(features)))

sample_dims = np.array(xrange(n_dims))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be::

sample_dims = np.arange(n_dims)

Timing differences:

In [2]: %timeit np.array(xrange(m))
1000 loops, best of 3: 1.6 ms per loop

In [3]: %timeit np.arange(m)
100000 loops, best of 3: 12 us per loop

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks @pprett

@bdholt1
Copy link
Member Author

bdholt1 commented Aug 12, 2011

While I've been looking through neighbors.py, I've noticed a lot of

def fit(self, X, y, **params):
    ...
    self._set_params(**params)

to support use within a pipeline. Should tree.py also support this?

@larsmans
Copy link
Member

We were just about to deprecate that.

@ogrisel
Copy link
Member

ogrisel commented Aug 13, 2011

Thanks for this work and taking the time to profile and benchmark it against competing implementations. This looks very promising. I will try to secure some time in the coming days to give this PR a deeper look.

@ogrisel
Copy link
Member

ogrisel commented Aug 13, 2011

I experiment the following broken doctest:

----------------------------------------------------------------------
File "/home/ogrisel/coding/scikit-learn/scikits/learn/tree/tree.py", line 474, in scikits.learn.tree.tree.DecisionTreeRegressor
Failed example:
    for train_index, test_index in kf:
        clf = DecisionTreeRegressor()
        clf = clf.fit(data.data[train_index], data.target[train_index])
        print np.mean(np.power(clf.predict(data.data[test_index]) - data.target[test_index], 2))
Expected:
    19.2264679543
    41.2959435867
Got:
    20.2065464252
    44.4226946421

>>  raise self.failureException(self.format_failure(<StringIO.StringIO instance at 0x58dd128>.getvalue()))

You should probably use # doctest: +ELLIPSIS and ... markers to replace the lowest decimal values of doctest to void the test to break when switching from one architecture to another.

@ogrisel
Copy link
Member

ogrisel commented Aug 13, 2011

To fix the previous breakage you should pass a random_state=0 to the constructor.

@ogrisel
Copy link
Member

ogrisel commented Aug 14, 2011

I am about to do a first pass at code style cleanup. Expect a pull request soon.

@bdholt1
Copy link
Member Author

bdholt1 commented Aug 15, 2011

Thanks Olivier! I've just added some more functionality (bringing this into line with the standard labels accepted for binary classification). You will probably need to merge.

@ogrisel
Copy link
Member

ogrisel commented Aug 16, 2011

Hi, I did not have time to come up with a new PR while after my short trip in a plane as I first intended to. I'll let you know when it's ready.

@bdholt1
Copy link
Member Author

bdholt1 commented Aug 19, 2011

@pprett: The script I used to do the benchmarking is at https://github.com/bdholt1/ml-benchmarks/blob/enh%2Ftree/benchmarks/bench_tree.py

The training set is both madelon and arcene (using the ml-benchmarks framework).

I started to port the _find_best_split to cython. a quick %timeit clf.fit(X, y) on madelon train turned up 21 sec vs. 99 s per loop for the pure python + numpy implementation

That's great! This implementation might even get to be the same ballpark speed as Orange, but with the higher score!

@bdholt1
Copy link
Member Author

bdholt1 commented Aug 23, 2011

Following @pprett's lead, I moved _find_best_split() to cython. A few key insights:

  1. Treat classification and regression separately
  2. Sort the features and labels, and then work on the sorted arrays.

A significant speed increase: (arcene)

Done, 100 samples with 10000 features loaded into memory
scikits.learn: mean 4.06, std 0.02
Score: 0.57

milk: mean 68.21, std 0.31
Score: 0.25

Orange: mean 10.26, std 0.02
Score: 0.51

All the tests still pass. This is a 10* speed increase, making it the fastest of the 3 implementations, while still easy to read and modify.

@bdholt1
Copy link
Member Author

bdholt1 commented Aug 23, 2011

thanks @satra

@satra
Copy link
Member

satra commented Aug 23, 2011

nicely done.

@pprett
Copy link
Member

pprett commented Aug 24, 2011

@bdholt1 yesterday evening I played around with the second enhancement I told you about (updating the pm counts [or means in regression] vs. computing them from scratch). It turned out that this causes a massive improvement in terms of performance (it's about an order of magnitude or even more):

Madelon:
time: 0:00:00.776095 (that's 0.7 sec)
score: 0.741666666667

Arcene:
time: 0:00:00.125336 (that's 0.1 sec)
score: 0.58

The code is here:
https://github.com/pprett/scikit-learn/tree/bdholt1-enh/tree

When you look at _tree.pyx you see new extension types at the top (ClassificationCriterion and RegressionCriterion).
They implement the update logic; label counts (aka pm) for left and right branch for classification; label means for left and right branch for regression. The concrete subclasses (Gini, Entropy, and MSE) implement an eval method which computes the value of the splitting criterion.

@bdholt1
Copy link
Member Author

bdholt1 commented Aug 24, 2011

+1! Thats a significant improvement! I suspect it was this very idea that lead to the gain (but clearly not as much) in my previous attempt. The latest version is also quite a bit more readable. Given this major gain, I think it warrants some serious attention, so I'll try to work this into my implementation or just use yours straight.

@pprett
Copy link
Member

pprett commented Aug 24, 2011

What me strikes is the bad performance of the tree on the Arcene dataset - compared to RBF SVMs (score about 0.73).
Furthermore, I benched it on the covertype dataset (~500k examples, 54 features) -> it's score is similar to that of a linear svm (a non-linear clf should do much better than that!)...

@bdholt1
Copy link
Member Author

bdholt1 commented Aug 24, 2011

I'm not too surprised about the performance. Technically a decision tree is more of a non-parametric classifier than non-linear - it works by partitioning up the feature space perpendicular to the dimensions in a greedy fashion, which is quite different to casting it into a higher dimensional space and finding a separating hyperplane... The good news is that Orange's performance is in the same ballpark, and I've worked through the algorithm quite thoroughly so I'm confident its doing it correctly. Hopefully we will see a significant performance gain when it comes to using these trees in a forest. Forests will compete with the best classifiers out there.

@mblondel
Copy link
Member

Where can I retrieve the Arcene dataset?

@pprett
Copy link
Member

pprett commented Aug 24, 2011

http://archive.ics.uci.edu/ml/datasets/Arcene

2011/8/24 mblondel
reply@reply.github.com:

Where can I retrieve the Arcene dataset?

Reply to this email directly or view it on GitHub:
#310 (comment)

Peter Prettenhofer

@bdholt1
Copy link
Member Author

bdholt1 commented Aug 24, 2011

I use ml-benchmarks (which comes with the Arcene and Madelon UCI datasets). The benchmarking script for this code that I use is https://github.com/bdholt1/ml-benchmarks/blob/enh%2Ftree/benchmarks/bench_tree.py which I'm hoping will be accepted (scikit-learn/ml-benchmarks#6) when this PR is accepted.

@pprett
Copy link
Member

pprett commented Aug 24, 2011

@bdholt1: does the error of a non-terminal node has to be smaller than the error of the parent node? I cannot find anything in ELSII (page 308, 1st paragraph discusses something related). But if I turn the check off accuracy on Arcene goes up to 0.7 (from 0.58).

@bdholt1
Copy link
Member Author

bdholt1 commented Aug 24, 2011

@pprett Well, it depends how you define error. In the way it is used in your code (which I'm pilfering as we speak), its the relative error (i.e. the amount of entropy in that node). What you would expect is that the overall entropy decreases as we go down the tree. In my code, I first compute the error at the node when all samples are on the right, and then test against that.

@bdholt1
Copy link
Member Author

bdholt1 commented Aug 24, 2011

@pprett: is it any slower if you turn it off?

@pprett
Copy link
Member

pprett commented Aug 24, 2011

a bit (factor 2):
M:
0.761666666667 0:00:01.571801
A:
0.7 0:00:00.285888

btw: if we set min_split to 5 (AFAIK recommended in ELSII) then the performance goes up too
M:
0.8 0:00:00.719846
A:
0.72 0:00:00.234298

@bdholt1
Copy link
Member Author

bdholt1 commented Aug 25, 2011

@pprett: I've taken your code and done a bit to make it easier to understand/modify and made sure it passes the unit tests (although the regression error on boston is still higher than the current version on this branch).

min_split = 1
Madelon:
recursive partition called 231 times
scikits.learn: mean 1.46, std 0.00
Score: 0.75

Arcene:
recursive partition called 7 times
scikits.learn: mean 0.27, std 0.00
Score: 0.60

min_split = 5
Madelon:
recursive partition called 95 times
scikits.learn: mean 0.87, std 0.00
Score: 0.78

Arcene:
recursive partition called 7 times
scikits.learn: mean 0.27, std 0.00
Score: 0.60

I've now got three versions (min_split = 1):
....................pprett_no_mask ........ pprett_mask ....... bdholt1_current
Madelon ... 1.92s 0.79 ............ 1.45s 0.75 ...... 4.59s 0.79
Arcene ..... 0.59s 0.57 ............ 0.27s 0.6 ...... 1.64s 0.57

Since they all pass the unit tests (and I've gone through the internal workings fairly thoroughly), I'm going to focus on the version that uses sample masks and get that into the branch.

@pprett
Copy link
Member

pprett commented Aug 25, 2011

Great - thanks for your work!

btw: for the arcene results with 0.7 acc I turned off the error check (simply pass np.inf to recursive split instead of error).

@bdholt1
Copy link
Member Author

bdholt1 commented Aug 25, 2011

These results are with np.inf passed in. I'm not sure how you got such high values. Does the version in your branch pass the unit tests? I've done quite a bit of debugging on the the splitting algorithm, which I'm putting in a branch for you now: https://github.com/bdholt1/scikit-learn/blob/enh/tree2/scikits/learn/tree/

@pprett
Copy link
Member

pprett commented Oct 4, 2011

@bdholt1: thanks!

BTW: does anybody of you know how one usually grows a tree with exactly J terminal nodes (==leaves); In ESLII they use such trees for boosting but they don't describe how exactly they are grown; does one always expand the node with the lowest error - or the one with the highest decrease in error...

@bdholt1
Copy link
Member Author

bdholt1 commented Oct 4, 2011

@pprett: my guess is that they use a pruning method after the tree has been grown to full size. That way, you know how many leaves there are and you can prune until you only have J of them.

@ogrisel
Copy link
Member

ogrisel commented Oct 4, 2011

Ok +1 for merge.

@ogrisel
Copy link
Member

ogrisel commented Oct 4, 2011

Actually the doc has a couple of links in markdown syntax that needs to be translated to the reST syntax. Also the complexity of a balanced tree is stated as O(n_samples . log(n_samples)). Is this true? There is no dependency on n_features?

@bdholt1
Copy link
Member Author

bdholt1 commented Oct 4, 2011

Also the complexity of a balanced tree is stated as O(n_samples . log(n_samples)). Is this true? There is no dependency on n_features?

This piece is the introduction where the basic idea of the tree is explained. Yes there is a dependency on n_features, and a few sentences on in the doc makes that clear.

If after ready that section again you think it should be re-worded, I'd be happy to do so.

@ogrisel
Copy link
Member

ogrisel commented Oct 4, 2011

Yeah so the complexity should be stated as O(n_samples . n_features . log(n_samples)) in the first sentence of that section.

@glouppe
Copy link
Contributor

glouppe commented Oct 5, 2011

I am +1 for this branch to be merged. Much work has already been done and we have now reached a pretty decent implementation of decision trees, both fairly efficient and well documented.

@ogrisel
Copy link
Member

ogrisel commented Oct 5, 2011

@bdholt1 I will send you a PR with fixes on the links for the doc, merge from current master and some pep8.

Merge ogrisel's doc and pep8 commits
@larsmans
Copy link
Member

larsmans commented Oct 5, 2011

Shouldn't the classes be called CARTClassifier and CARTRegressor because we might want to add other d-tree leaners?

@pprett
Copy link
Member

pprett commented Oct 5, 2011

I think we have to be careful because CART is a trademark.

@ogrisel
Copy link
Member

ogrisel commented Oct 5, 2011

I prefer to keep it DecisiontTreeClassifier as it's more explicit (no acronym). We can always re-alias the generic class name the day we want to implement other types of D-Trees.

@bdholt1
Copy link
Member Author

bdholt1 commented Oct 5, 2011

I am +0. CARTClassifier and CARTRegressor are shorter names which I like, but realistically we'll only need to change if someone decided to implement a C45Classifier and C45Regressor (I highly doubt someone would want an ID3Classifier, and C5.0 is closed source). So, I think that we could defer the decision until someone codes up a C4.5 learner.

@larsmans
Copy link
Member

larsmans commented Oct 5, 2011

If CART is a trademark, then maybe we shouldn't use it. I'm no expert on trademark law, though; would it be illegal to use that name, or just to advertise with it?

@ogrisel
Copy link
Member

ogrisel commented Oct 6, 2011

I don't want to deal with trademark stuff and I don't like acronyms in class names and the current impl seems to be generic enough hence I like the generic name, so +1 for defering class renaming discussions to the day where we actually have some alternative implementation that makes sense to contribute to the scikit.

Any further comment before the merge?

@GaelVaroquaux
Copy link
Member

Any further comment before the merge?

I'd like to review and merge tonight. I can dedicate a long sleepless
evening to this.

@bdholt1
Copy link
Member Author

bdholt1 commented Oct 6, 2011

I'd appreciate it, unfortunately I won't be online tonight though. Gael, perhaps it would be best if you copy the branch and submit a PR, or if you're happy then merge into master and then make your own changes.

------Original Message------
From: Gael Varoquaux
To: Brian Holt
Subject: Re: [scikit-learn] Enh/tree (performance optimised) (#310)
Sent: 6 Oct 2011 15:55

Any further comment before the merge?

I'd like to review and merge tonight. I can dedicate a long sleepless
evening to this.

Reply to this email directly or view it on GitHub:
#310 (comment)

@GaelVaroquaux
Copy link
Member

I'd appreciate it, unfortunately I won't be online tonight though.
Gael, perhaps it would be best if you copy the branch and submit a PR,
or if you're happy then merge into master and then make your own
changes.

I was planning to do the merge myself tonight, and to make any changes I
think are needed (I am thinking of minor details like wording in the
docs).

@bdholt1
Copy link
Member Author

bdholt1 commented Oct 6, 2011

Great!
------Original Message------
From: Gael Varoquaux
To: Brian Holt
Subject: Re: [scikit-learn] Enh/tree (performance optimised) (#310)
Sent: 6 Oct 2011 16:34

I'd appreciate it, unfortunately I won't be online tonight though.
Gael, perhaps it would be best if you copy the branch and submit a PR,
or if you're happy then merge into master and then make your own
changes.

I was planning to do the merge myself tonight, and to make any changes I
think are needed (I am thinking of minor details like wording in the
docs).

Reply to this email directly or view it on GitHub:
#310 (comment)

@GaelVaroquaux
Copy link
Member

Midnight local time. Just finished my other duties (namely reformating
and compressing the scikit-learn article to be able to resubmit it), and
moving on to this poule. Sorry, I am a bit late. What a restless life
full of exitement :)

@GaelVaroquaux
Copy link
Member

At a first glance, the test time of the scikit on my box went from 80s
before the merge, to 102s after the merge. :(. That's a 20% increase for
the sole decision trees. If this is confirmed, we will need to address
this testing time.

@GaelVaroquaux
Copy link
Member

At a first glance, the test time of the scikit on my box went from 80s before the merge, to 102s after the merge.

Noise! I hadn't done the merge, just a checkout, and thus didn't benefit
from some optimizations in master. When writing the first mail, I had a
feeling that something was wrong...

@GaelVaroquaux GaelVaroquaux merged commit 0f3d93f into scikit-learn:master Oct 6, 2011
@GaelVaroquaux
Copy link
Member

I have merged this pull request. I haven't really changed much. I am too tired to think. Besides, I had already had a close look before. IMHO, they are no low hanging fruits. I still don't like the pointer arithmetic, but I guess that I will have to live with it.

@ogrisel
Copy link
Member

ogrisel commented Oct 6, 2011

\o/ congrats to everybody involved, this is a great contrib for the project. I find it really amazing that so many people dived in to review, test, refactor and optimize this code.

@yang
Copy link

yang commented Oct 6, 2011

This is awesome, since it opens up the doors to random forests. Any comparisons with R's randomForest? (Or am I jumping the gun here?)

@ogrisel
Copy link
Member

ogrisel commented Oct 6, 2011

Nope but if you would like to script one for instance using RPy2, please feel free to submit a pull request to https://github.com/scikit-learn/ml-benchmarks

@satra
Copy link
Member

satra commented Oct 7, 2011

@bdholt, pprett, louppe: thank you very much for taking what was a crude, initial stab at this and turning it into a nicely optimized piece of code. great job!

@scikits-learn community: for showing what a nice distributed coding can pull off.

@yang: random forests and other ensemble techniques are next in the queue to build on top of this

@glouppe
Copy link
Contributor

glouppe commented Oct 7, 2011

Good job to everyone who contributed!

@TimSC
Copy link

TimSC commented Oct 7, 2011

Thanks everyone! I am already using this implementation in my research. (and I am keen to try random forests too!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet