insert_point is slow #28

mdbartos · 2018-11-17T22:35:57Z

Running a line profiler on insert_point shows the top time consumers:

%lprun -f tree.insert_point tree.insert_point(inlier, index='inlier')

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   ...
   302        13       6400.0    492.3     69.6              bbox = self.get_bbox(node)
   303        13       2214.0    170.3     24.1              cut_dimension, cut = self._insert_point_cut(point, bbox)
   ...

Within _insert_point_cut:

%lprun -f tree._insert_point_cut tree.insert_point(inlier, index='inlier')

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   ...
   520        13        267.0     20.5     10.1          bbox_hat = np.empty(bbox.shape)
   521                                                   # Update the bounding box based on the internal point
   522        13        363.0     27.9     13.8          bbox_hat[0, :] = np.minimum(bbox[0, :], point)
   523        13        808.0     62.2     30.7          bbox_hat[1, :] = np.maximum(bbox[1, :], point)
   524        13         85.0      6.5      3.2          b_span = bbox_hat[1, :] - bbox_hat[0, :]
   525        13        309.0     23.8     11.7          b_range = b_span.sum()
   526        13        213.0     16.4      8.1          r = np.random.uniform(0, b_range)
   527        13        246.0     18.9      9.3          span_sum = np.cumsum(b_span)
   ...

The text was updated successfully, but these errors were encountered:

mdbartos · 2018-11-17T22:36:43Z

Storing the bounding box for each branch could cut ~70% of time expended.

mdbartos · 2018-11-18T08:08:40Z

After storing the bboxes, time breakdown for insert_point is:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   297         1        118.0    118.0      4.3          duplicate = self.find_duplicate(point, tolerance=tolerance)
   ...
   309        13       1936.0    148.9     70.9              cut_dimension, cut = self._insert_point_cut(point, bbox)
   ...
   323        12        138.0     11.5      5.1                  if point[node.q] <= node.p:
   ...
   347         1        168.0    168.0      6.2          self._tighten_bbox_upwards(branch)

And within _insert_point_cut:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   573        13         51.0      3.9      2.9          bbox_hat = np.empty(bbox.shape)
   574                                                   # Update the bounding box based on the internal point
   575        13        320.0     24.6     17.9          bbox_hat[0, :] = np.minimum(bbox[0, :], point)
   576        13        510.0     39.2     28.6          bbox_hat[-1, :] = np.maximum(bbox[-1, :], point)
   577        13        101.0      7.8      5.7          b_span = bbox_hat[-1, :] - bbox_hat[0, :]
   578        13        215.0     16.5     12.1          b_range = b_span.sum()
   579        13        149.0     11.5      8.4          r = np.random.uniform(0, b_range)
   580        13        170.0     13.1      9.5          span_sum = np.cumsum(b_span)
   581        13         19.0      1.5      1.1          cut_dimension = np.inf
   582        19         50.0      2.6      2.8          for j in range(len(span_sum)):
   583        19         37.0      1.9      2.1              if span_sum[j] >= r:
   584        13         13.0      1.0      0.7                  cut_dimension = j
   585        13         12.0      0.9      0.7                  break
   586        13         91.0      7.0      5.1          if not np.isfinite(cut_dimension):
   587                                                       raise ValueError("Cut dimension is not finite.")

atthom · 2019-04-10T22:49:48Z

Hello,

This algorithm performs really great but insert_point is really slow.
Is there any update on this bottleneck and is it possible to create a batch insert which could potentially run faster?

mdbartos · 2019-04-11T07:00:23Z

If using batch mode is an option, it can improve speeds by quite a bit. For the taxicab dataset, changing from streaming to batch mode brought the total computation time from ~20 minutes down to 20-30 seconds. See this example: https://klabum.github.io/rrcf/taxi.html

As far as speeding up insert_point goes, I'm not really sure. The major time consumers are numpy's built-in maximum and minimum functions, which are already pretty optimized.

Because trees are independent, benefits could be gained through parallelization. My knowledge of parallelization in python is pretty limited though.

atthom · 2019-04-11T08:57:02Z

I've seen this example. Does it matter if all the points are not in each of the trees of your forest? Because I'm currently inserting around 50 000 points by tree (it takes around 200 seconds). Maybe it's a bit too much? :)

I was thinking of a batch_insert_point that take an np.ndarray (l x d).
With this, you can compute the minimum and maximum with axis=0 so that you compute the min and max a single time for all the points. Likewise, find_duplicate could be run one time and maybe other checks you do inside insert_point could be done a unique time.

atthom · 2019-04-11T09:12:45Z

Parallelization in python is incredibly easy. I already did it myself so I can write it down.

from pathos.multiprocessing import ProcessingPool
from multiprocessing import cpu_count()

    def _insert_multiproc(self, features, n_jobs:int = -1):
        nb_cpu = cpu_count()
        if n_jobs >= nb_cpu or n_jobs == -1:
            n_jobs = nb_cpu

        pool = ProcessingPool(nodes=cpu_count() - 1)
        all_distances = pool.map(RCTree._cmp_detect, [features]*self.nb_trees)
        distance = np.mean(np.array(all_distances), axis=0)
        thresh = np.quantile(distance, 1 - self.contamination)


    @staticmethod
    def _cmp_detect(features):
        tree = rrcf.RCTree() 
        [tree.insert_point(features[i], index=i) for i in range(features.shape[0])]
        return [tree.codisp(leaf) for leaf in tree.leaves]

mdbartos · 2019-04-11T19:11:33Z

Hi @atthom,

Thanks for taking a look at this.

The points don't need to be in every single tree as long as you make sure you're averaging codisp properly. Ultimately, you're generating an estimator for codisp. As a side note, it would be interesting to look at the properties of the codisp estimator as a function of the number of trees, tree size, etc.

Have you tested how much of a speedup you're getting from parallelizing? I briefly experimented with using the multiprocessing module but wasn't able to get a significant speedup.

For batch insertion, you would need to show that the implementation works out mathematically: in other words, that the new tree is drawn from the distribution T' (S' U {p}): https://klabum.github.io/rrcf/insert-and-delete.html

It would probably require a completely different insert_point algorithm, because the current one assumes that points are added sequentially (which changes the bounding boxes/tree structure each time a point is inserted).

atthom · 2019-04-12T13:43:56Z

I tested the parallelization. It's the perfect use case of parallelization because the trees are independents. Each processor can compute its tree separately.
With N processors and K trees in the forest, I nearly got K/N compute time, probably because the codisp computation is inside the multiprocessed function.

Now, I understand why it's not easy to create this insert multiple points!
I have to read a bit more about the bounding boxes and the data structure if I want a faster insert point :)

mdbartos · 2019-04-12T22:38:01Z

Sounds good. If you want to contribute any parallelization code, feel free to submit a pull request.

hokiegeek2 · 2019-05-28T10:44:11Z

If using batch mode is an option, it can improve speeds by quite a bit. For the taxicab dataset, changing from streaming to batch mode brought the total computation time from ~20 minutes down to 20-30 seconds. See this example: https://klabum.github.io/rrcf/taxi.html

As far as speeding up insert_point goes, I'm not really sure. The major time consumers are numpy's built-in maximum and minimum functions, which are already pretty optimized.

Because trees are independent, benefits could be gained through parallelization. My knowledge of parallelization in python is pretty limited though.

I wonder if Dask would make sense here?

mdbartos · 2019-05-28T17:05:19Z

Definitely worth looking into 👍

mdbartos mentioned this issue May 23, 2019

'branch' variable can be referenced before assignment #56

Closed

mdbartos mentioned this issue Jul 19, 2019

how to scale rrcf to detect thousands of time series #59

Open

mdbartos mentioned this issue Dec 18, 2019

inference on new data samples #70

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

insert_point is slow #28

insert_point is slow #28

mdbartos commented Nov 17, 2018

mdbartos commented Nov 17, 2018

mdbartos commented Nov 18, 2018

atthom commented Apr 10, 2019 •

edited

Loading

mdbartos commented Apr 11, 2019 •

edited

Loading

atthom commented Apr 11, 2019 •

edited

Loading

atthom commented Apr 11, 2019 •

edited

Loading

mdbartos commented Apr 11, 2019

atthom commented Apr 12, 2019 •

edited

Loading

mdbartos commented Apr 12, 2019

hokiegeek2 commented May 28, 2019

mdbartos commented May 28, 2019

insert_point is slow #28

insert_point is slow #28

Comments

mdbartos commented Nov 17, 2018

mdbartos commented Nov 17, 2018

mdbartos commented Nov 18, 2018

atthom commented Apr 10, 2019 • edited Loading

mdbartos commented Apr 11, 2019 • edited Loading

atthom commented Apr 11, 2019 • edited Loading

atthom commented Apr 11, 2019 • edited Loading

mdbartos commented Apr 11, 2019

atthom commented Apr 12, 2019 • edited Loading

mdbartos commented Apr 12, 2019

hokiegeek2 commented May 28, 2019

mdbartos commented May 28, 2019

atthom commented Apr 10, 2019 •

edited

Loading

mdbartos commented Apr 11, 2019 •

edited

Loading

atthom commented Apr 11, 2019 •

edited

Loading

atthom commented Apr 11, 2019 •

edited

Loading

atthom commented Apr 12, 2019 •

edited

Loading