Batch Aggregation #785

zwolf · 2024-05-28T23:42:04Z

This PR contains the following:

Celery backbone for background job processing
New routes for starting a new run and checking the status of running tasks
BatchAggregation Celery task
BatchAggregation lib that handles logistics
Associated specs

This is ready for (re-)review. I have included some questions in the BatchAggregation spec file that could improve them if answered. This is dependent on zooniverse/panoptes#4303 which provides the API to save run data on a Panoptes resource. Looking for feedback on the whole pipeline, which looks like this:

Request sent to Panoptes --> Panoptes sends run_aggregation request --> celery job starts --> exports downloaded & processed --> extraction, reduction --> create csv files, zip them --> upload data to storage --> send request back to Panoptes containing run UUID.

Merging #783 was an accident, as I had intended to keep that branch-of-a-branch separate. I created a new batch-aggregation-staging branch to merge into that I can add a deployment template and deploy directly from for testing. cc @lcjohnso on the new PR.

CKrawczyk

This all looks good. I have left some feedback and comments throughout the code. One thing is to make sure os.path.join is used for any file paths to ensure compatibility with different OS (I know that we know the server OS we are running this on, but the tests might be run on a Windows computer by someone downloading the package).

Let me know if you need more clarification on the testing questions and I can dig into the specifics a bit more. When I am struggling with tests I also try to check I have not entered into "over testing" the code. I want to make sure the main goal of the code is tested, but not necessarily the specific implementation to get that result (i.e. if I refactor the code in the future to do the same thing in a different way, the test should still pass).

CKrawczyk · 2024-05-30T09:44:16Z

panoptes_aggregation/batch_aggregation.py

+
+        for reducer in reducer_list:
+            # This is an override. The workflow_reducer_config method returns a config object
+            # that is incompatible with the batch_utils batch_reduce method


If it helps the defaults are all taken from this dictionary: https://github.com/zooniverse/aggregation-for-caesar/blob/master/panoptes_aggregation/workflow_config.py#L4-L41

It can be used as a basis for batch_standard_reducers when we want to extend beyond the initial tests.

I used those as a starting point. The goal was to have multiple reducers run in certain cases (like question & question consensus). The batch_standard_reducers dict can be moved wherever makes sense later.

This override is because the output of workflow_config.workflow_reducer_config currently causes an error when passed to batch_utils.parse_reducer_config. Possibly a bug, or just a discrepancy between the function and yaml outputs. Since batch_utils is now used to process csvs, making changes to batch_utils felt out of scope for this PR.

CKrawczyk · 2024-05-30T09:52:12Z

panoptes_aggregation/batch_aggregation.py

+    }
+
+    for task_type, extract_df in extracted_data.items():
+        extract_df.to_csv(f'{ba.output_path}/{ba.workflow_id}_{task_type}.csv')


Minor point, it might be worth using os.path.join when creating file paths to keep these functions as OS independent as possible.

It is done~~

CKrawczyk · 2024-05-30T09:55:54Z

panoptes_aggregation/batch_aggregation.py

+            # that is incompatible with the batch_utils batch_reduce method
+            reducer_config = {'reducer_config': {reducer: {}}}
+            reduced_data[reducer] = batch_utils.batch_reduce(extract_df, reducer_config)
+            filename = f'{ba.output_path}/{ba.workflow_id}_reductions.csv'


If multiple reducers are in reducer_list will this keep overwriting the same file?

I'm using mode=a in the to_csv call to append to the file (if it exists). The ask here was for a single, concatenated reductions file.

What happens when reducers have different column names (I don't know that Pandas default behaviour)? Does it do an outer join?

It just concats at the end right now, so it'll have a new header row and then the reduction rows. However, reductions all come out with subject_id,workflow_id,task,reducer,data and should just glue together, no different columns expected. This is the current behavior (as opposed to one file per reducer).

Oh, I see, the batch reducer does not flatten out the data column the same way reduce_panoptes_csv does. This should not be an issue here.

CKrawczyk · 2024-05-30T09:56:45Z

panoptes_aggregation/batch_aggregation.py

+
+    def upload_files(self):
+        self.connect_blob_storage()
+        reductions_file = f'{self.output_path}/{self.workflow_id}_reductions.csv'


See the comment above about multiple reducers/reducer files.

CKrawczyk · 2024-05-30T09:57:19Z

panoptes_aggregation/batch_aggregation.py

+        r = http.request('GET', url, preload_content=False)
+        with open(filepath, 'wb') as out:
+            while True:
+                data = r.read(65536)


Out of curiosity, what does 65536 refer to in this case?

This is the chunk size (64kb, in this case) that the data is being read in at. Exports can be big and I'm avoiding reading the whole file into memory at once.

CKrawczyk · 2024-05-30T10:18:36Z