Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operation dispatcher err handling #125

Merged
merged 8 commits into from
Nov 6, 2024

Conversation

didierofrivia
Copy link
Member

@didierofrivia didierofrivia commented Oct 29, 2024

Fixes #123

Also fixes the following e2e tests:

This PR introduces a refactor that besides fixing the issue #123, makes the code a bit more consistent regarding the flow of the Operations State controlled by the OperationHandler. It aims to handle all the exceptions raised and report back to the Filter (or any caller).

@didierofrivia didierofrivia changed the base branch from main to e2e-missing-cluster October 29, 2024 11:17
Copy link
Contributor

@eguzki eguzki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few comments dropped

@@ -114,15 +127,15 @@ impl Context for Filter {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should check status_code. When grpc endpoint is unreachable, this is flagged on the status_code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to check the status_code from on_grpc_call_response input param?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not, the error will come from trying to parse the GRPC response, which may be empty because there is no GrpcMessageResponse being serialized

Copy link
Member Author

@didierofrivia didierofrivia Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be useful, however I'm a bit confused about what are the status_code valid values... since there's no docs, its signature is just u32 and it's confusing their examples... would it be a GrpcStatusCode or Status.. then it's a bit confusing in this particular example how they treat it. What do you reckon? what meaningful check can we do to avoid parsing and returning early?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the tests I have run, on successful GRPC request/response exchange, status_code is 0. Whereas when the service is unreachable, status_code is 14.

That all I know for now.

Let's merge this and you can enhance this in a following up PR's

src/operation_dispatcher.rs Show resolved Hide resolved
src/filter/http_context.rs Outdated Show resolved Hide resolved
src/filter/http_context.rs Outdated Show resolved Hide resolved
src/operation_dispatcher.rs Outdated Show resolved Hide resolved
@eguzki
Copy link
Contributor

eguzki commented Oct 29, 2024

This is still not working for me. I run the following configuration:

{
  "services": {
    "limitadorA": {
      "type": "ratelimit",
      "endpoint": "limitador",
      "failureMode": "deny"
    },
    "limitadorB": {
      "type": "ratelimit",
      "endpoint": "unknown",
      "failureMode": "deny"
    }
  },
  "actionSets": [
    {
      "name": "basic",
      "routeRuleConditions": {
        "hostnames": [
          "*.example.com"
        ]
      },
      "actions": [
        {
          "service": "limitadorA",
          "scope": "basic",
          "data": [
            {
              "expression": {
                "key": "a",
                "value": "1"
              }
            }
          ]
        },
        {
          "service": "limitadorB",
          "scope": "basic",
          "data": [
            {
              "expression": {
                "key": "a",
                "value": "1"
              }
            }
          ]
        }
      ]
    }
  ]
}

There are two actions, both with failureMode to false. And the second one targets a missing cluster. On running a request, the request makes its way to upstream and returns 200 regardless of failureMode.

@didierofrivia didierofrivia force-pushed the operation-dispatcher-err-handling branch from f07915d to 541b3f3 Compare October 29, 2024 13:49
@didierofrivia didierofrivia changed the base branch from e2e-missing-cluster to main October 29, 2024 13:50
@didierofrivia
Copy link
Member Author

didierofrivia commented Oct 29, 2024

@eguzki PR is still a draft, so many things won't work as you expect :) Will definitely ping you and add you as a reviewer once I think it's done ;)

@didierofrivia didierofrivia force-pushed the operation-dispatcher-err-handling branch from e991d6a to 88fb257 Compare October 29, 2024 15:34
@didierofrivia didierofrivia marked this pull request as ready for review October 29, 2024 15:44
@didierofrivia didierofrivia requested a review from eguzki October 29, 2024 15:44
@didierofrivia
Copy link
Member Author

Now @eguzki ! this is your time to shine !!! (and thanks in advance for the reviews 🙏🏼 )

@didierofrivia
Copy link
Member Author

(I've included the commits from your test, just for easy testing... I could remove them from this PR if you want)

Comment on lines 70 to 81
},
Ok(None) => {
Action::Continue // No operations left to perform
}
} else {
Action::Continue
Err(OperationError {
failure_mode: FailureMode::Deny,
..
}) => {
self.send_http_response(500, vec![], Some(b"Internal Server Error.\n"));
Action::Continue
}
_ => Action::Continue,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could omit

Ok(None) => {
                Action::Continue // No operations left to perform
            }

Left it there just so we now explicitly in case we want to perform something else

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, for all other Err, we ... Action::Continue?

Copy link
Member

@alexsnaps alexsnaps Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm saying that I'd leave the Ok(None) => Action::Continue, as an explicit branch but would have 3 Err branches:

  • FailureMode::Deny, as you do (log the actual error there? as this where it all ends, right?)
  • FailureMode::Allow, explicit with the actual error logged again
  • Err(_), that possibly panic! or is marked as unreachable! as that I think is the assumption this builds upon

Copy link
Member Author

@didierofrivia didierofrivia Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it explicit the handling of both failure modes, but the third branch makes clippy unhappy:

warning: unreachable pattern
  --> src/filter/http_context.rs:90:13
   |
90 |             Err(_) => unreachable!() 
   |             ^^^^^^
   |
   = note: `#[warn(unreachable_patterns)]` on by default

I could add exception to our clippy default rules tho if needed.

I'm also logging the status. Would that be enough?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know and I can include it in a following PR ;)


impl fmt::Display for OperationError {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
match self.status {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on the status we could add specific messages

@didierofrivia
Copy link
Member Author

By the way @eguzki, this PR also fixes your #122 ;)

@didierofrivia didierofrivia force-pushed the operation-dispatcher-err-handling branch 3 times, most recently from 5e13324 to b9a52d6 Compare October 30, 2024 16:37
* Mutating the state if found
* Returning Err with status NotFound and default failureMode

Signed-off-by: dd di cesare <[email protected]>
@didierofrivia didierofrivia force-pushed the operation-dispatcher-err-handling branch from b9a52d6 to 06f654f Compare October 31, 2024 07:42
@didierofrivia didierofrivia requested a review from eguzki October 31, 2024 07:43
Copy link
Contributor

@eguzki eguzki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@alexsnaps alexsnaps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking error handling from performing an operation could be pushed within the Operations themselves, making the orchestration of the pipeline/operation chain a little easier on the reader... not asking to change this tho.

Comment on lines 70 to 81
},
Ok(None) => {
Action::Continue // No operations left to perform
}
} else {
Action::Continue
Err(OperationError {
failure_mode: FailureMode::Deny,
..
}) => {
self.send_http_response(500, vec![], Some(b"Internal Server Error.\n"));
Action::Continue
}
_ => Action::Continue,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, for all other Err, we ... Action::Continue?

src/operation_dispatcher.rs Show resolved Hide resolved
@didierofrivia didierofrivia merged commit 983261d into main Nov 6, 2024
9 checks passed
@didierofrivia didierofrivia deleted the operation-dispatcher-err-handling branch November 6, 2024 08:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

missing cluster error handling
4 participants