Zarr sink Improvements #1713

annehaley · 2024-11-01T20:23:53Z

Resolves #1674. Resolves #1698.

manthey · 2024-11-07T15:59:32Z

I have a multiprocess code path that behaves poorly.

In the main process, create the sink and add a tile.
Create a subprocess. It doesn't know that they are already tiles present, so the newaxes list is non-empty. Something uses a lot of memory as a tile is added.

I think the solution is that if new_axes (https://github.com/girder/large_image/pull/1713/files#diff-c79d7356628b5b310aaeb154520b413576727848e175b84f16af2b0eaadb98deR735) is not empty, we set the updateMetadata flag.

My test case was to use the examples/algorithm_progression.py example using --multiprocessing and to hack in an sink.addTile with a 1 pixel record at the maximum sweep locations. That is,

diff --git a/examples/algorithm_progression.py b/examples/algorithm_progression.py
index 52271cfb..2a70c0dc 100755
--- a/examples/algorithm_progression.py
+++ b/examples/algorithm_progression.py
@@ -42,7 +42,7 @@ class SweepAlgorithm:
 
         self.combos = list(itertools.product(*[p['range'] for p in input_params.values()]))
 
-    def getOverallSink(self):
+    def getOverallSink(self, maxValues=None):
         msg = 'Not implemented'
         raise Exception(msg)
 
@@ -118,7 +118,10 @@ class SweepAlgorithm:
 
     def run(self):
         starttime = time.time()
-        sink = self.getOverallSink()
+        source = large_image.open(self.input_filename)
+        maxValues = {'x': source.sizeX, 'y': source.sizeY, 's': source.metadata['bandCount']}
+        maxValues.update({p['axis']: len(p['range']) for p in self.param_order.values()})
+        sink = self.getOverallSink(maxValues)
 
         print(f'Beginning {len(self.combos)} runs on {self.max_workers} workers...')
         num_done = 0
@@ -149,7 +152,7 @@ class SweepAlgorithm:
 
 
 class SweepAlgorithmMulti(SweepAlgorithm):
-    def getOverallSink(self):
+    def getOverallSink(self, maxValues=None):
         os.makedirs(os.path.splitext(self.output_filename)[0], exist_ok=True)
         algorithm_name = self.algorithm.__name__.replace('_', ' ').title()
         self.yaml_dict = {
@@ -270,10 +273,14 @@ class SweepAlgorithmMultiZarr(SweepAlgorithmMulti):
 
 
 class SweepAlgorithmZarr(SweepAlgorithm):
-    def getOverallSink(self):
+    def getOverallSink(self, maxValues=None):
         import large_image_source_zarr
 
-        return large_image_source_zarr.new()
+        sink = large_image_source_zarr.new()
+        if maxValues:
+            sink.addTile(np.zeros((1, 1, maxValues.get('s', 1))),
+                         **{k: v - 1 for k, v in maxValues.items() if k != 's'})
+        return sink
 
     def writeOverallSink(self, sink):
         sink.write(self.output_filename, lossy=self.lossy)

and then python examples/algorithm_progression.py ppc --param=hue_value,hue,0,1,100,open --param=hue_width,width,0.10,0.25,4 -w 36 --multiprocessing build/tox/externaldata/sample_Easy1.png /tmp/sweep.zarr.zip

manthey · 2024-11-07T16:57:22Z

Hmm... If we did that, then maybe we'd also need to call self._validateZarr() in the _initNew method if we didn't create the file, but then things fail if we aren't setting data before doing the multiprocessing fork.

annehaley · 2024-12-02T20:17:06Z

I took some time to observe the behavior of your example use case in multiple python versions and plot the output from memory-profiler to compare them. The resulting plot is shown below, with timestep (sampled every 0.1 seconds) along X and memory usage along Y. The longest blue line is the main process and all other lines are subprocesses.

For all of these python versions, the only spike in memory I experienced was in the main process after the multiprocessing stage, during the conversion step of the write function. Interestingly, 3.11 and 3.12 have this extra subprocess that hangs around and occupies a negligible but non-zero amount of memory after the multiprocessing stage (shown as a flat orange line).

In my experience, 3.10 performed the best and 3.11 performed the worst. Am I remembering correctly that you said you tried this with 3.11? If my 3.11 graph does not reflect what you experienced, can you send me the full list describing your environment so I can do my best to replicate? Otherwise, I'm not sure that we can do much to further improve memory usage, and this may just be a matter of python and other library versions.

EDIT: Here's the command I ran in each environment:

mprof run --multiprocess examples/algorithm_progression.py ppc --param=hue_value,hue,0,1,100,open --param=hue_width,width,0.10,0.25,4 --multiprocessing  /home/anne/data/large_image/Easy1.png /home/anne/data/large_image/generated/sweep.zarr.zip

annehaley added 4 commits October 31, 2024 20:42

Allow setting frame values in multiple processes

06d8e3b

Allow adding new axes after first tile

39db0db

Refactor to pass function complexity check

8051c3f

Merge branch 'master' into zarr-sink-improvements-1

0b2b7e8

annehaley marked this pull request as ready for review November 5, 2024 18:29

annehaley requested a review from manthey November 5, 2024 18:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zarr sink Improvements #1713

Zarr sink Improvements #1713

annehaley commented Nov 1, 2024 •

edited

Loading

manthey commented Nov 7, 2024

manthey commented Nov 7, 2024

annehaley commented Dec 2, 2024 •

edited

Loading

Zarr sink Improvements #1713

Are you sure you want to change the base?

Zarr sink Improvements #1713

Conversation

annehaley commented Nov 1, 2024 • edited Loading

manthey commented Nov 7, 2024

manthey commented Nov 7, 2024

annehaley commented Dec 2, 2024 • edited Loading

annehaley commented Nov 1, 2024 •

edited

Loading

annehaley commented Dec 2, 2024 •

edited

Loading