Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

insert_point is slow #28

Open
mdbartos opened this issue Nov 17, 2018 · 11 comments
Open

insert_point is slow #28

mdbartos opened this issue Nov 17, 2018 · 11 comments

Comments

@mdbartos
Copy link
Member

Running a line profiler on insert_point shows the top time consumers:

%lprun -f tree.insert_point tree.insert_point(inlier, index='inlier')
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   ...
   302        13       6400.0    492.3     69.6              bbox = self.get_bbox(node)
   303        13       2214.0    170.3     24.1              cut_dimension, cut = self._insert_point_cut(point, bbox)
   ...

Within _insert_point_cut:

%lprun -f tree._insert_point_cut tree.insert_point(inlier, index='inlier')
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   ...
   520        13        267.0     20.5     10.1          bbox_hat = np.empty(bbox.shape)
   521                                                   # Update the bounding box based on the internal point
   522        13        363.0     27.9     13.8          bbox_hat[0, :] = np.minimum(bbox[0, :], point)
   523        13        808.0     62.2     30.7          bbox_hat[1, :] = np.maximum(bbox[1, :], point)
   524        13         85.0      6.5      3.2          b_span = bbox_hat[1, :] - bbox_hat[0, :]
   525        13        309.0     23.8     11.7          b_range = b_span.sum()
   526        13        213.0     16.4      8.1          r = np.random.uniform(0, b_range)
   527        13        246.0     18.9      9.3          span_sum = np.cumsum(b_span)
   ...
@mdbartos
Copy link
Member Author

Storing the bounding box for each branch could cut ~70% of time expended.

@mdbartos
Copy link
Member Author

After storing the bboxes, time breakdown for insert_point is:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   297         1        118.0    118.0      4.3          duplicate = self.find_duplicate(point, tolerance=tolerance)
   ...
   309        13       1936.0    148.9     70.9              cut_dimension, cut = self._insert_point_cut(point, bbox)
   ...
   323        12        138.0     11.5      5.1                  if point[node.q] <= node.p:
   ...
   347         1        168.0    168.0      6.2          self._tighten_bbox_upwards(branch)

And within _insert_point_cut:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   573        13         51.0      3.9      2.9          bbox_hat = np.empty(bbox.shape)
   574                                                   # Update the bounding box based on the internal point
   575        13        320.0     24.6     17.9          bbox_hat[0, :] = np.minimum(bbox[0, :], point)
   576        13        510.0     39.2     28.6          bbox_hat[-1, :] = np.maximum(bbox[-1, :], point)
   577        13        101.0      7.8      5.7          b_span = bbox_hat[-1, :] - bbox_hat[0, :]
   578        13        215.0     16.5     12.1          b_range = b_span.sum()
   579        13        149.0     11.5      8.4          r = np.random.uniform(0, b_range)
   580        13        170.0     13.1      9.5          span_sum = np.cumsum(b_span)
   581        13         19.0      1.5      1.1          cut_dimension = np.inf
   582        19         50.0      2.6      2.8          for j in range(len(span_sum)):
   583        19         37.0      1.9      2.1              if span_sum[j] >= r:
   584        13         13.0      1.0      0.7                  cut_dimension = j
   585        13         12.0      0.9      0.7                  break
   586        13         91.0      7.0      5.1          if not np.isfinite(cut_dimension):
   587                                                       raise ValueError("Cut dimension is not finite.")

@atthom
Copy link

atthom commented Apr 10, 2019

Hello,

This algorithm performs really great but insert_point is really slow.
Is there any update on this bottleneck and is it possible to create a batch insert which could potentially run faster?

@mdbartos
Copy link
Member Author

mdbartos commented Apr 11, 2019

If using batch mode is an option, it can improve speeds by quite a bit. For the taxicab dataset, changing from streaming to batch mode brought the total computation time from ~20 minutes down to 20-30 seconds. See this example: https://klabum.github.io/rrcf/taxi.html

As far as speeding up insert_point goes, I'm not really sure. The major time consumers are numpy's built-in maximum and minimum functions, which are already pretty optimized.

Because trees are independent, benefits could be gained through parallelization. My knowledge of parallelization in python is pretty limited though.

@atthom
Copy link

atthom commented Apr 11, 2019

I've seen this example. Does it matter if all the points are not in each of the trees of your forest? Because I'm currently inserting around 50 000 points by tree (it takes around 200 seconds). Maybe it's a bit too much? :)

I was thinking of a batch_insert_point that take an np.ndarray (l x d).
With this, you can compute the minimum and maximum with axis=0 so that you compute the min and max a single time for all the points. Likewise, find_duplicate could be run one time and maybe other checks you do inside insert_point could be done a unique time.

@atthom
Copy link

atthom commented Apr 11, 2019

Parallelization in python is incredibly easy. I already did it myself so I can write it down.

from pathos.multiprocessing import ProcessingPool
from multiprocessing import cpu_count()

    def _insert_multiproc(self, features, n_jobs:int = -1):
        nb_cpu = cpu_count()
        if n_jobs >= nb_cpu or n_jobs == -1:
            n_jobs = nb_cpu

        pool = ProcessingPool(nodes=cpu_count() - 1)
        all_distances = pool.map(RCTree._cmp_detect, [features]*self.nb_trees)
        distance = np.mean(np.array(all_distances), axis=0)
        thresh = np.quantile(distance, 1 - self.contamination)


    @staticmethod
    def _cmp_detect(features):
        tree = rrcf.RCTree() 
        [tree.insert_point(features[i], index=i) for i in range(features.shape[0])]
        return [tree.codisp(leaf) for leaf in tree.leaves]

@mdbartos
Copy link
Member Author

Hi @atthom,

Thanks for taking a look at this.

The points don't need to be in every single tree as long as you make sure you're averaging codisp properly. Ultimately, you're generating an estimator for codisp. As a side note, it would be interesting to look at the properties of the codisp estimator as a function of the number of trees, tree size, etc.

Have you tested how much of a speedup you're getting from parallelizing? I briefly experimented with using the multiprocessing module but wasn't able to get a significant speedup.

For batch insertion, you would need to show that the implementation works out mathematically: in other words, that the new tree is drawn from the distribution T' (S' U {p}): https://klabum.github.io/rrcf/insert-and-delete.html

It would probably require a completely different insert_point algorithm, because the current one assumes that points are added sequentially (which changes the bounding boxes/tree structure each time a point is inserted).

@atthom
Copy link

atthom commented Apr 12, 2019

I tested the parallelization. It's the perfect use case of parallelization because the trees are independents. Each processor can compute its tree separately.
With N processors and K trees in the forest, I nearly got K/N compute time, probably because the codisp computation is inside the multiprocessed function.

Now, I understand why it's not easy to create this insert multiple points!
I have to read a bit more about the bounding boxes and the data structure if I want a faster insert point :)

@mdbartos
Copy link
Member Author

Sounds good. If you want to contribute any parallelization code, feel free to submit a pull request.

@hokiegeek2
Copy link

If using batch mode is an option, it can improve speeds by quite a bit. For the taxicab dataset, changing from streaming to batch mode brought the total computation time from ~20 minutes down to 20-30 seconds. See this example: https://klabum.github.io/rrcf/taxi.html

As far as speeding up insert_point goes, I'm not really sure. The major time consumers are numpy's built-in maximum and minimum functions, which are already pretty optimized.

Because trees are independent, benefits could be gained through parallelization. My knowledge of parallelization in python is pretty limited though.

I wonder if Dask would make sense here?

@mdbartos
Copy link
Member Author

Definitely worth looking into 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants