-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestion: keep track of utility of assigning all units each treatment outside of "level_one_learning" #146
Comments
Hi @beniaminogreen, thanks a lot, this looks very cool and interesting! I don't have time to look closely until a bit later next month, but in the meantime, could I please ask if you get the same result here if you plug in |
Hi @beniaminogreen, thanks for looking into this; your addition seems very promising. In addition to what @erikcs requests above, could you also add
I think we should definitely incorporate changes which offer gains of this magnitude. |
Hi @erikcs and @kanodiaayush, thanks for getting back to me so quickly! Please find below a screenshot of the tests @erikcs asked me to run. Not sure how to read these statistics, but the outputs look similar to the ones in the script you sent. I should also add that I have been unit-testing the treatment assignements produced by parallel_policy_tree against those produced by the policy_tree function (see here), to make sure that the two methods always produce the same treatment assignments. The tests show that the two methods coincide for tree depths up to 3 (beyond which I don't test, but I assume the assignments should still become identical). @kanodiaayush, please find below a comparison of I couldn't get to writing the algorithm in pseudocode this weekend, but I'll be able to do it in the next few days, and I'll leave it here as soon as I do. Best, |
Thanks Ben, that looks great. This is very impressive work. I’m trying to understand where the gains are coming from, could you please run that benchmark plot for dense Xj, ie, line11: X = matrix(rnorm(n*p), n, p)? |
Sorry for getting you to run this, but I only have my phone at the moment. Could you please run that same script but overlay a 3rd line with runtime for just a depth 0 policytree? Thanks ! |
Sure, here it is: This isn't what I expected to see, so I ran another flamegraph to try and diagnose where the policytree algorithm is spending time when creating a depth-0 tree for a data set of 50K units. It seems that for depth-0 trees in datasets of this size, most of the time is spent creating the sorted sets. Once that is done, it looks like the remaining search does not take much time at all. Maybe the differences in speed are coming from a difference in the time it takes to create the initial data structures? This might make sense - the first flamegraph that I ran was for a dataset with ~1000 observations, if I remember correctly. |
Thanks, yeah that is why I wanted to see the depth 0 timing, had a hunch a big speed difference arose from the sorted sets being created faster in Rust (I think algorithmically your level 1 learning should have the same amortized runtime as policytree’s, the amount of work it does should be the same) What data structures are you using to create the sorted sets in Rust? I think (at least part of) your incredible speed improvement may be a system-level improvement in that particular data structure 🤔 |
Definitely looks to be the case - good catch! I am using a BTreeMap to create the sorted sets (which are converted to vectors once they have been populated with all the observations). For each axis, I create an empty BTreeMap with X values as keys, vectors of observation indices as entries. So a dataset that looks like this:
Would be stored as the following key/value pairs:
I populate the B-TreeMap by iterating over each observation, and either adding a key-value pair if the key has not been seen before, or appending the value of the vector to the value associated with the key if the key has been seen before. In the example above, it would look like:
Some of the speed improvement may also have to do with the fact that I am storing indexes of observations in my trees, while I believe you are storing entire observations, which would make the trees slower to populate / clone. |
Great, thanks! Could you please run similar benchmarks for a depth 2 tree on three X-configurations: {all binary, a handful of discrete values, dense (i.e Also, if possible, (I don't have my laptop now so can't do this) could you repeat this + the depth 1 timings from above for a modified policytree where you: replace L18 here to Thanks again! |
Thanks! I'd like a larger number of observations to see what's going on, could you do up to n = 100k, at least for the top plot? It could take a few hours, but I guess you have an external server? (also p = 2 and d = 2 is fine). Thanks again! |
I've been running these on my laptop, but I can put them on my server. I don't know how long they will take, but I'll set them off and report back when they have completed. The only server I can run benchmarks on has 4 cores, so it could be a few days. |
Actually, you could leave out policytree from the benchmarks, we have roughly an idea of how long time that takes if you include an R file with benchmark details. What would be interesting to see is the Rust policytree (1 core) timing as a function of n up to 100k or more for the different X settings. How many categories are there in the 250k depth 2 tree you mention in your first post? |
Sounds good - I've set the code running again with just the Rust implementation going. Will likely still take a few days. The 250k observation run was based on dataset which actually had 45 columns, not 55 as I reported. The dimensions and number of unique values in each column are below, but the columns with 141 and 173 breaks were not included. If they are included, the training time goes up to about 50 seconds on our 40-core server. This time can still see a lot of improvement, as I parallelize by having each core only search over trees that start by splitting on a specific variable. This means that cores assigned variables with fewer break points finish much earlier than others, and have to sit around waiting for the stragglers to finish. My bet is that a better parallelization scheme could bring this time down to ~25 seconds. |
These came back a lot quicker than I anticipated! The binary and categorical runs both finished in about 15ms each for a dataset of n=100,000, while the dense dataset took about an hour and 5 mins to run. The benchmarking code I used can be found here |
Interesting, policytree takes around 15 sec for the two first ones and 11 minutes for the last. |
That is interesting. This definitely reflects my focus on the categorical / binary case in coding my version. I bundle all the observations with the same value of a predictor together so that they can be quickly added / removed from a node as a group. If no observations share the same value of a predictor, then all the bundling code will have a significant speed cost. My guess is that this explains the discrepancy in performance between the categorical and dense cases. In the amortized sorting operation, I imagine I could limit some of the speed losses by assessing if a variable is dense or not, and then deciding on an appropriate representation. That might allow me to glean the advantages of grouping, while not loosing performance in the dense case. |
Yes, we should consider handling repeated Xj in policytree, your package makes it clear it's very worthwhile. I'm not sure when I will get time to have a closer look though. And I looked at your Rust code, for the sorted sets it reminds of an optional splitting rule GRF used to have. For N unique samples you are essentially maintaining a NxP array of the global sort order. When moving samples L/R you can update indexes in O(1) time since you have an N-length index array, but creating a copy/new one takes O(N) time, where N is the global number of samples, in policytree creating a new one takes O(n) time where n is the number of samples in the recursive call, and insertion takes O(logn). Which of these terms dominate will depend on the number of distinct values (and will likely have to be large, e.g. num.samples=5k is too small to notice these differences). If you experiment in bechmarking (varying "denseness" and N) I think you will find a ~threshold where the the two runtimes cross. For "full" dense Xj your runtime when doubling num.samples is roughly 8 times larger (last plot from 50k to 100k), for policytree that same ratio will be ~4 instead of 8. |
Thanks for spelling that out for me - I thought that the array approach would always dominate the sorted sets implementation in policytree, but now I see that isn't the case. This will definitely be something for me to think about if I keep working on my package . Hopefully there will be a way for me to get the best of both worlds - the speed that arrays give for small / medium sizes of data with the scaling that your sorted sets implementation has for large datasets. |
Yes, you can probably work out empirical N/n heuristics to switch between implementations during runtime if you really want to keep tinkering! (In these things there are practically always areas you can squeeze out more perf by identifying special cases). One small thing you could try is make your How about we add a link to your package in the README highlighting it for people who have large data with few distinct values? |
Thanks for the suggestion - I've just made a branch with the hash set approach you describe, and I'll benchmark it to see which is faster. I went with the array approach originally so that the active array would be stored in a contiguous memory block, but I doubt that will offset the speed from the large array copy each time I want to split along a new axis. I'll benchmark both ways and pick the quicker one. I'll also look into the SIMD approach, but I don't think Rust has particularly mature support for SIMD stuff from what I've been able to google. An inclusion in the README sounds fantastic to me, thanks for the shout out! I can provide a little table like the one below for my algorithm if you'd like. |
Great, maybe you could put some timings on your GH README for reference? (considering I guess you'll keep making changes you'll always have the latest timings in your repo, we don't have bandwidth to make any big changes to policytree in the near future). |
Sounds good to me - I'll run them in the next few days, then put them in the README. Should also add some more documentation for how the parallel_policy_tree function is used. Thanks for your help, |
a) When things are getting more polished, we could also add an option to policytree that would dispatch to your package depending on input type? I.e. something like:
b) On algorithmic details, your ObservationBundles are the right idea, but I can imagine you got from the timings you ran in this thread that this loop is something you'd want to avoid. We could help you out if you want to improve this even further? You could probably write up what you did and send it to JOSS as a software paper - Ayush and I can help you along the way. If you are applying for PhD programs maybe that would give you a boost! So, for an optimal version of BundledObservations and SortedSets, I caught up with @kanodiaayush: an asymptotically unbeatable version of policytree for duplicated x-values would zap the O(N) term by maintaining an index data structure which tells you which ObservationBundle each sample belong to for each dimension. Then you could use this to move bundles left to right like policytree, and iterate over only the non-empty bundles in the recursive calls. A short pseudo example could be
(this will probably require some back and forth on details). |
Both those suggestions sound great to me - I would love to keep optimizing the package, and if there's a paper at the end of the process, that would be fantastic! Once things get more stable / developed, I can definitely try and construct a heuristic to decide which is more efficient for a given solution. It would be slightly complicated by the fact that Great suggestion about the modification to the sorted sets. I had thought of doing something similar earlier, but I shelved it because it seemed like a lot of effort, and I had not made the connection that it would make the runtime better asymptotically. I'll start working on it this evening / over the weekend - it would be super to achieve a better asymptotic scaling. G I tried implementing your suggestion to keep track of the alive units using hash sets see here but it didn't seem to be quicker for small dataset sizes. I'll schedule a more exhaustive test in the next few days, but it might be that the memory-contiguity of the arrays offsets the cost of the large copies for datasets of most sizes. |
Sounds good! Yes, we should add multithreading to policytree then as well, else it looks would look weird. Some practical considerations to keep in mind is guaranteeing same output for same input. Some settings may have many optimal trees and policytree returns the one that came first while exploring. Users will be confused if threading causes a different optimal tree to be returned. Similar considerations for very many trees with ~almost numerically the same reward. If Rust / C++ compiler internal numerics cause differences that will break user expectations. An implicit contract we should always try to stick with is: if I write a paper using policytree version a.b.c, then that paper should replicate exactly if I use policytree version a.x.z on an arbitrary computer in the future. |
I think that right now, because of how I implement the parallelization, I am guaranteeing same output for same input, but I'll keep an eye out as I make changes to ensure that this is the case. I parallelize the search by assigning each thread to search for the best tree that starts by splitting on a single variable, and then select the best from these. any ties are resolved by looking at the initial variable the tree splits on. It will be important to make sure I maintain this pattern in future versions. I also agree that Rust vs C++ numerics could be an issue, but I don't think that there is much I could do to fix these if they do arise. Empirically, I am not sure that they are much of a concern however, as I test Finally, I have experimented implementing the tweaks you have suggested, and the results seem mixed (see below) I'm not sure if this is an issue with my implementation (looking for places to speed this up this afternoon), or if this is an accurate representation of how the new strategy will perform. Part of what I think we are seeing is that the new strategy involves copying n sets, where n is the number of breaks / cut points in the dataset. This means that it would perform the best when the datasets are sparse, but this is also where the existing implimentation is the quickest, so there isn't much room for improvement. I'll keep plugging away and see if the performance in the sparse case can be improved at all. I think the best thing to do for the dense case might simply be to copy the existing algorithm into Rust (or to just have users use the existing package). |
a) sounds good! Different compilers on different OS's sometimes produce different result. Also, different OS users may face different installation hassles, I tried installing the package on stanford's linux HPC and it failed with a Rust linking error. Maybe this is something your GH actions could try and catch with a larger build matrix, + give additional assurance tests pass across different OS's and dependency versions (am not up to date on Rust, but a big reason policytree/grf is stable is that they do not depend on a gazillion other libraries). b1) great work, but what we meant with "asymptotically optimal" would be to do exactly what policytree (algo1) does, but with "sample_n" replaced by "ObservationBundle_{split_val_n}". Translating those BST insert/erase operations to work on ObservationBundles would require use of "SampleToBundle". That doesn't look like what's being done here? (It would likely be a quite big tinkering project.) b2) by "asymptotically optimal" we mean: consider algo A (yours) and B (ours). Both solve the same problem, but algo A has runtime O(A) and algo B runtime O(B). Both depend on N = the total number of samples, and n_j = the number of distinct values of feature j. As you saw empirically, A is faster than B until you reach a point P where N and n_j gets sufficiently large. What we mean is that: the suggestion above to modify A to A' will likely bump up the point P (to P') at which A will be faster than B (not that it will make it best for the dense case, that's B). Note 1: What P and P' look like, and if it's worth the hassle, I don't know. Note 2: A' might very well be slower than A for the sparsest cases! |
PS: I think your package in its current from already seem very useful. The algo suggestion above could be more of a longer term project if you are interested in keeping working on your package. A more actionable near term thing we could do is add the following option to policytree: policy_tree(
X,
Gamma,
depth = 2,
split.step = 1,
min.node.size = 1,
solver = c("policytree", "parallel_policy_tree"),
num.threads = NULL,
verbose = TRUE
) where if installed, solver = "parallel_policy_tree" would just call directly into a fit method in that package (which bypasses input checks, we should sync and stick to an API and ensure future compatibility then). That would be easier than a "auto" solver option, at least for now, and you could maintain some benchmarks on your package github that could help inform the user (we also sent you a GH invite if you feel any info would be useful on policytree's page). Ideally we should add multithreading to policytree too, as having a num.threads arugment that only works with one option looks strange (am not sure when I will get around to that though). On that note, maybe a more informative name could be something like "sparsepolicytree" instead of "parallel_policy_tree"? |
Gotcha, I see what you mean about "asymptotically optimal," and I think I understand what you mean about implementing the same strategy as policytree but with observation bundles. I did something similar to this in the original prototype of the package, and I think that it might still be archived in the git history so I'll try and pull it out / bring it back to life for some benchmarking. Historically, the issue that I had with that strategy was that I had to copy p sorted sets each recursive call, and this ended up being more expensive than copying the vector of active units. I'll try and see if there's a way to get the best of both worlds in the next few days. In the nearer term, I do like the idea of creating a fit method with a standard API so policytree can call out to sparsepolicytree (I do agree that's a better name). What would be the next steps here, and do you have any preferences for how the internal API should look? I'll add a split.step option to sparse_policy_tree so it has all the features of the other solver, then perhaps we can work on integrating the two. |
For now I think it would already be helpful if just Then internals + package integration via a solver argument is something we could come back to later. My plate is too full for this atm, so I can't give you a timeline except the vague academic glacial pace answer of sometime next year : P. There are some other policytree projects in the tinkering phase related bandit problem estimation I'd want to finish first (have no idea when that will be done). Also as I mentioned, we should do multithreading as well. Conceptually the implementation is simple: we should execute each {left, right} recursive call asynchronously in a thread pool (with a fixed number of threads). Getting that to work involves fighting with the C++ compiler and is not something I have time for now. It could however be a different threading strategy for you to try out in Rust, if you didn't already try that. |
Hi All,
Thanks for writing such a fantastic and well-documented package.
I've been working using the package the past few weeks, and I think I might have come across a potential speed improvement which could let users build even deeper trees than before.
Specifically, instead of creating an NxPxD array of cumulative treatment effects in the
level_one_learning function
, couldn't the algorithm keep tabs of the utility from assigning each unit to every treatment (a D-length vector) and use this to calculate the utility + best action on each side of a potential split?I've used a flamegraph to try and diagnose where the algorithm is spending time (see below) and it seems like a lot of the algorithm's time is spent allocating and de-allocating these arrays. I'm not a C++ developer, so I can't be sure that I am reading the code / charts correctly, but a decent amount of time also seems to be to be spent maintaining the binary trees, which like they could be converted to vectors after they have been sorted, as it looks like they are only being accessed by index after creation (but I'm less sure about this).
I don't know C++ so I can't file a PR to tweak this, but I do know Rust, so I've quickly put together both of these changes in an Rust-backed R package here). The package seems to perform well when compared to the existing search algorithm as the number of data points increases (see plot below), and also also parallelizes the search to run across multiple cores which increases the speed by another order of magnitude.
(I originally called the package "exhaustivetree," but I changed the name shortly after making the graph.)
As you can see, the change in scaling from these tweaks are quite dramatic, and they make it feasible to run quick policytree searches on larger datasets with a more covariates than before — I was delighted to find out that I can now run a search for a depth-2 tree on my a dataset with 250,000 observations and 55 (categorical) covariates in 15 seconds, which is a large improvement over the current policytree time (which took too long for me to record).
I hope you'll be able to consider these changes / look into them more, as it seems like they offer a large increase in speed. I'd be very happy to answer any questions you have, or to run any tests you would like on the code I have written to verify that the speedups are indeed due to the tweaks I made, rather than any Rust / C++ difference.
Best,
Ben
The text was updated successfully, but these errors were encountered: