Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Execute item failure callback on bulk request error #627

Conversation

kellen-miller
Copy link
Contributor

@kellen-miller kellen-miller commented Oct 2, 2024

Description

Execute all bulk item failure callbacks (if they exist) in worker's buffer when a bulk request fails.

Issues Resolved

Fixes #626

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link

codecov bot commented Oct 2, 2024

Codecov Report

Attention: Patch coverage is 80.00000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 68.06%. Comparing base (06a6dc8) to head (3cee55e).
Report is 71 commits behind head on main.

Files with missing lines Patch % Lines
opensearchutil/bulk_indexer.go 80.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main     #627       +/-   ##
===========================================
+ Coverage   57.29%   68.06%   +10.76%     
===========================================
  Files         315      376       +61     
  Lines        9823     8867      -956     
===========================================
+ Hits         5628     6035      +407     
+ Misses       2902     1552     -1350     
+ Partials     1293     1280       -13     
Flag Coverage Δ
integration 60.32% <0.00%> (+9.48%) ⬆️
unit 15.48% <80.00%> (+2.63%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
opensearchutil/bulk_indexer.go 77.73% <80.00%> (+3.92%) ⬆️

... and 273 files with indirect coverage changes

@kellen-miller kellen-miller force-pushed the bugfix/execute-item-fail-callbacks-bulk-error branch 2 times, most recently from 8918410 to 77450d3 Compare October 2, 2024 16:41
@kellen-miller
Copy link
Contributor Author

kellen-miller commented Oct 2, 2024

I ordered the execution of the item OnFailure callbacks before the execution of the bulk indexer's OnError callback, but am open to feedback on whether that order should be reversed.

My thought process was if the OnError callback is executed first, a user may have some logic to shutdown/panic their application, which would result in the item OnFailure callbacks to not be executed inadvertently.

@kellen-miller kellen-miller force-pushed the bugfix/execute-item-fail-callbacks-bulk-error branch from 77450d3 to f416719 Compare October 2, 2024 16:51
Copy link
Member

@dblock dblock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Let's take this opportunity and add a few more tests to cover the edge cases (some may exist)?

  • fails only some items and makes sure that failedItems is correct
  • checks that OnError is called every time, and only once, if there's at least one failure

Copy link
Member

@dblock dblock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually looking at the code I see https://github.com/opensearch-project/opensearch-go/pull/627/files#diff-84320628b2df1ee872c513a0e5a15df089c6116a5c2b6e8d4d577102e0bcb6c6R526 that calls onFailure below if the response is > 201.

Something seems fishy, it's now either called twice, or it was correct before and there's another reason why it wasn't called.

@kellen-miller
Copy link
Contributor Author

kellen-miller commented Oct 2, 2024

Correct that OnFailure below if the response is > 201 is called when the bulk request is successful and an individual item in the request fails, say OpenSearch had an indexing failure with the item.

But if the whole bulk request errors, for example if OpenSearch goes down while the request is in flight, or the context deadlines, the flush function returns without executing the item failure callbacks and the workers buffer is cleared. So those items are lost and their callbacks are never executed.

@dblock
Copy link
Member

dblock commented Oct 2, 2024

Correct that OnFailure below if the response is > 201 is called when the bulk request is successful and an individual item in the request fails, say OpenSearch had an indexing failure with the item.

But if the whole bulk request errors, for example if OpenSearch goes down while the request is in flight, or the context deadlines, the flush function returns without executing the item failure callbacks and the workers buffer is cleared. So those items are lost and their callbacks are never executed.

Understood. And just to double-check, we don't call those callbacks twice in the successful response but some failed items case? Add a test for this?

@kellen-miller
Copy link
Contributor Author

kellen-miller commented Oct 2, 2024

Understood. And just to double-check, we don't call those callbacks twice in the successful response but some failed items case? Add a test for this?

Yep the failure callback will only be executed once. If the response is successful (the error here is nil), the code in this PR is never hit.

My unit test I added has a check to make sure that the number of times the failure callback is executed is what is expected i.e. nothing is called twice.
https://github.com/kellen-miller/opensearch-go/blob/bugfix/execute-item-fail-callbacks-bulk-error/opensearchutil/bulk_indexer_internal_test.go#L833

@dblock
Copy link
Member

dblock commented Oct 2, 2024

Understood. And just to double-check, we don't call those callbacks twice in the successful response but some failed items case? Add a test for this?

Yep the failure callback will only be executed once. If the response is successful (the error here is nil), the code in this PR is never hit.

My unit test I added has a check to make sure that the number of times the failure callback is executed is what is expected i.e. nothing is called twice. https://github.com/kellen-miller/opensearch-go/blob/bugfix/execute-item-fail-callbacks-bulk-error/opensearchutil/bulk_indexer_internal_test.go#L833

I think there's still a test missing for partial failure on some of the items, you want to check that if you have 5 items and 2 failures the callback is only called twice, for example?

@kellen-miller
Copy link
Contributor Author

kellen-miller commented Oct 2, 2024

I think there's still a test missing for partial failure on some of the items, you want to check that if you have 5 items and 2 failures the callback is only called twice, for example?

I think this unit test earlier in the file already covers that?
https://github.com/kellen-miller/opensearch-go/blob/bugfix/execute-item-fail-callbacks-bulk-error/opensearchutil/bulk_indexer_internal_test.go#L277

Specifically this check here in the unit test:
https://github.com/kellen-miller/opensearch-go/blob/bugfix/execute-item-fail-callbacks-bulk-error/opensearchutil/bulk_indexer_internal_test.go#L391-L398

I might be misunderstanding though, I'm happy to add more if that's not the use case you're referring too.

@dblock
Copy link
Member

dblock commented Oct 2, 2024

I might be misunderstanding though, I'm happy to add more if that's not the use case you're referring too.

First, thank you for hanging in here with me ;)

The old code did not call item.OnFailure(ctx, item, info, err) in this scenario at all, but did increment the number of fails (and then that was tested). So those tests with only some item callbacks also should check how many times item.OnFailure(ctx, item, info, err) has been invoked and that it was invoked with all the right items.

Without trying, some examples where I think all tests pass but the code is broken: call item.OnFailure(ctx, **the first item every time**, info, err). Or, with first 2/5 items failing call onFailure 5 times.

@kellen-miller
Copy link
Contributor Author

The old code did not call item.OnFailure(ctx, item, info, err) in this scenario at all, but did increment the number of fails (and then that was tested). So those tests with only some item callbacks also should check how many times item.OnFailure(ctx, item, info, err) has been invoked and that it was invoked with all the right items.

Without trying, some examples where I think all tests pass but the code is broken: call item.OnFailure(ctx, the first item every time, info, err). Or, with first 2/5 items failing call onFailure 5 times.

Ah gotcha gotcha, sorry for some much back and forth 😄. Let me update the other tests to check for this.

@dblock
Copy link
Member

dblock commented Oct 2, 2024

Ah gotcha gotcha, sorry for some much back and forth 😄. Let me update the other tests to check for this.

No stress! I enjoy this type of conversations because I am super OCD about tests :)

@kellen-miller kellen-miller force-pushed the bugfix/execute-item-fail-callbacks-bulk-error branch 2 times, most recently from 915f706 to b9a352c Compare October 3, 2024 18:03
@kellen-miller kellen-miller force-pushed the bugfix/execute-item-fail-callbacks-bulk-error branch from b9a352c to 3cee55e Compare October 3, 2024 18:14
@kellen-miller
Copy link
Contributor Author

kellen-miller commented Oct 3, 2024

No stress! I enjoy this type of conversations because I am super apache/lucene#711 (comment) :)

As you should be!

I've update the test to maintain a map of item key -> count called of what items have been passed into the OnFailure callback. Validate the count is always 1 for all keys and all expected keys have been called.

Let me know if there's any other test updates needed and if I addressed the test case you were referring to.

Copy link
Member

@dblock dblock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏

@dblock dblock merged commit 9267b75 into opensearch-project:main Oct 3, 2024
56 checks passed
@kellen-miller kellen-miller changed the title Execute item failure callback execution on bulk request error Execute item failure callback on bulk request error Oct 3, 2024
@kellen-miller
Copy link
Contributor Author

kellen-miller commented Oct 4, 2024

@dblock sorry to bother again! Is there any roadmap or release cadence to this library? Wondering when the process to cut a release with my fix will happen.

I can always target @main in the mean time, but I like to have tagged version.

@dblock
Copy link
Member

dblock commented Oct 4, 2024

Open a "release v. Next" issue and we'll take care of it quickly.

@kellen-miller kellen-miller mentioned this pull request Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Item OnFailure Callbacks Not Executed When Bulk Request Fails
2 participants