Clean up transaction management for file_complete handler #930

BenGalewsky · 2024-11-25T21:21:53Z

Problem

The TransformerFileComplete resource handler is the most critical code in the entire stack. It responds to each file in the dataset being transformed, and is responsible for updating the total number of files processed (either successfully, or failure) - these two counters are how we determine if the transform is complete. The endpoint will be hit repeatedly by all of the running transformers. Consequently, database transaction handing is very important to avoid missing files.

The current implementation uses implicit transactions and doesn't manage locks and flushing to the DB. It's possible that this allows for files to be lost during big transform requests.

Approach

Us the DB session to explicitly manage transactions
Update the record_file_complete to read the request with with_for_update flag set which will lock the record in the db
The increments to files are handled in the same transaction
To make the file more readable, the retry call arguments are captured in a single, new decorator.file_complete_ops_retry -

With this decorator, I ran into a problem with unit tests. Importing the module caused the current_app.logger expression to be evaluated. This would throw RuntimeError: Working outside of application context. in the unit tests. Worked around this in the decorator to only access that logger if we are inside the flask app

servicex_app/servicex_app/resources/internal/transformer_file_complete.py

ponyisi · 2024-12-05T23:37:05Z

servicex_app/servicex_app/resources/internal/transformer_file_complete.py

+            else:
+                transform_req.files_failed += 1
+
+        session.flush()  # Flush the changes to the database


What is the state of the DB if the transaction attempt fails three times? Do we again go out of sync?

What isolation level are we working at? (Are we certainly taking a row lock, or could the transaction fail because it's been updated somewhere else?)

There are three operations. One of them is on table of transform results. Each file complete has it's own row, so there is no contention there.

The part that causes us the most concern is the number of files processed column in the TransformTable. That's the one I put the heaviest level of locking on.

The other interesting one is declaring that the last file has been processed and so we can call the transform complete and shut down the transformers. I would argue that's not catastrophic if it happens twice (It is a big deal when it never happens!)

My biggest concern is deadlock, which is why I make the update of the TransformResult table a different transaction from the update of files remaining. So I think there is no chance for deadlock.

So, but good point about how to handle failures of each transaction/function call in this resource. Problem is, I don't think we have any good rollback options.

The update to the files_complete fails: We are in big trouble. There is no way to recover this and the transform will never quit.

The update to the transform_result fails... I don't think this will have much impact. As Illija points out I'm not sure who uses this table since we just scan the bucket for files to return in the client.

The status update fails. This is also a big deal, but not sure what we could do about it.

I suppose I should put some try/except blocks in this code to manage this, but as I say I don't know what to do if we catch an exception (and a function fails after three tries)

BenGalewsky marked this pull request as draft November 25, 2024 21:22

Base automatically changed from delete_fixes to develop November 26, 2024 04:33

BenGalewsky force-pushed the file_complete_transaction branch from 70a6702 to c2fb610 Compare December 4, 2024 13:50

BenGalewsky requested a review from ponyisi December 4, 2024 19:09

BenGalewsky marked this pull request as ready for review December 4, 2024 19:09

ponyisi reviewed Dec 5, 2024

View reviewed changes

BenGalewsky added 2 commits December 10, 2024 16:18

Clean up transaction management for file_complete handler

1b6e369

Remove unused methods on TransformRequest

4a4ca9a

BenGalewsky force-pushed the file_complete_transaction branch from c2fb610 to 4a4ca9a Compare December 10, 2024 22:18

BenGalewsky requested a review from ponyisi December 16, 2024 16:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up transaction management for file_complete handler #930

Clean up transaction management for file_complete handler #930

BenGalewsky commented Nov 25, 2024 •

edited

Loading

ponyisi Dec 5, 2024

BenGalewsky Dec 9, 2024

Clean up transaction management for file_complete handler #930

Are you sure you want to change the base?

Clean up transaction management for file_complete handler #930

Conversation

BenGalewsky commented Nov 25, 2024 • edited Loading

Problem

Approach

ponyisi Dec 5, 2024

Choose a reason for hiding this comment

BenGalewsky Dec 9, 2024

Choose a reason for hiding this comment

BenGalewsky commented Nov 25, 2024 •

edited

Loading