Add TTL to Operator RPC Client #259

emlautarom1 · 2024-06-20T19:46:40Z

Current Behavior

Currently, failed messages are stored in a queue to be processed later by a background thread. The current design does not take into consideration the age of the message, meaning that old messages can still linger around for a lot of time. See #212 for more details

New Behavior

This PR completely changes the design of the RPC client:

The client requires a RpcClient that performs the actual RPC call. Go's native rpc.Client implements this interface.
Instead of using a background thread to process failed messages in a queue, the client will block while trying to send each message, applying a provided retry strategy.
The provided default strategy retries messages at most 10 times, with a delay of 2 seconds between each call, dropping messages older than 24 hrs (this is an arbitrary, very conservative value to be discussed)

The reasoning behind this changes is as follows:

Usages of the client are usually in the form of "fire and forget" (ex. go client.SendX), so we can handle retrying on each individual goroutine without requiring a background goroutine.
Due to this usage, my understanding is that the order of messages is not relevant, thus it makes no sense to enforce it (ex. if two parallel goroutines use the client there is no guarantee to which call will be executed first).
The original design had the retry mechanism deeply embedded, while this design allows for extension without modification.
Since we don't change users of the client (ie. we preserve all "fire and forget"s), we're still at risk of unbounded memory usage (the old design had this property + an unbounded queue of messages), though a reasonable retry strategy should mitigate this issue.

Breaking Changes

In case order of messages is a requirement, then this is a breaking change since now we definitely not guarantee it (the old design might have tried (unsuccessfully) to guarantee it).

Since now we're adding TTL to messages, messages that used to be sent with large delays now will be dropped altogether.

As usual with refactors of this kind, there might be other implicit details that have been unintentionally changed so thorough review is required.

Shows a clear data race

emlautarom1 · 2024-06-20T19:47:35Z

operator/operator.go

+	// TODO: We never close `httpRpcClient`
+	httpRpcClient, err := NewHTTPAggregatorRpcClient(c.AggregatorServerIpPortAddress, operatorId, registryCoordinatorAddress, logger)


Up to discussion: how do we want to handle this dependency, in particular closing it?

emlautarom1 · 2024-06-20T19:49:04Z

operator/rpc_client.go

-	listener RpcClientEventListener
+type RpcClient interface {
+	Call(serviceMethod string, args any, reply any) error
+	// TODO: Do we also want a `Close` method?


I'm against the idea of adding a Close method here, but it would allow us to add a Close method to the AggregatorRpcClient such that it would close the provided RpcClient. Ideally, a component should not be responsible for closing the provided dependencies, but it could be a quick & dirty solution.

emlautarom1 · 2024-06-20T19:54:33Z

tests/integration/integration_test.go

+	nodeConfig, _, _ := genOperatorConfig(t, ctx, "3", mainnetAnvil, rollupAnvils, rabbitMq)
+	operator := startOperator(t, ctx, nodeConfig)
+


We had to change the order of initialization in the integration test: if the operator cannot connect to the aggregator on start it will quickly crash. Note that this has no effect during processing: if the connection to the aggregator fails then requests will be retried according to a provided strategy

emlautarom1 · 2024-06-20T20:04:33Z

operator/rpc_client.go

+// By defaul, retry with a delay of 2 seconds between calls,
+// at most 10 times, and only if the error is recent enough (24 hours)
+// TODO: Discuss the "recent enough" part
+func DefaultAggregatorRpcRetry() RetryStrategy {
+	return RetryAnd(
+		RetryWithDelay(2*time.Second), RetryAnd(
+			RetryAtMost(10),
+			RetryIfRecentEnough(24*time.Hour)))


What would be a reasonable TTL for messages? This is a very conservative default just to start the discussion.

taco-paco · 2024-06-21T10:03:00Z

I guess let me first start with current design & misunderstandings regarding it, and then I will share my opinion about your approach.

There were some evolutions since first rework but I will describe current one:
Operator makes a call to Client on one of SendSigned* methods.
We get to sendOperatorMessage. If rpcClient isn't initialized in onTick yet, we add message to queue. If we have client we move to RPC call. In case of error we check If client was ShutDown and if it was we drop it so it can be reInited in onTick. Then we add the unsuccessful message to the queue.
And here its important moment for us that will find continuation in your implementation:
this is it, we drop and free goroutine. After that the resending is a problem of an onTick method which is a single goroutine.

Now we come to this part

The current design does not take into consideration the age of the message

We were aware of this problem and in one of the editions @Hyodar introduced message expiration here. So the message is a candidate to resend only 10 times.

Now about this proposal and its important flaw, which is a reason why in my opinion we can't proceed with it. It pretty much comes down to this part in SendSigned* methods.

	for err != nil && shouldRetry(submittedAt, err) {
		a.listener.IncErroredCheckpointSubmissions(retried)
		err = action()
		retried = true
	}

Instead of one goroutine trying to resend a message once in ResendInterval we keep all goroutines alive for the time of (max) 20 seconds while they trying to resend the message. Meanwhile new messages to operator keep coming thus creating even more goruotines in case aggregator is down. This solution is not scalable and very resource demanding especially with the fast TX rate from multiple rollups that we eager to support unlike the current implementation described above.

Overall I'm open to discussions here but would like also hear @Hyodar opinion on this radical change and would propose to wait with this PR until then.

taco-paco

Due to the comment above

emlautarom1 · 2024-06-21T15:42:12Z

this is it, we drop and free goroutine. After that the resending is a problem of an onTick method which is a single goroutine.

Instead of one goroutine trying to resend a message once in ResendInterval we keep all goroutines alive for the time of (max) 20 seconds while they trying to resend the message. Meanwhile new messages to operator keep coming thus creating even more goruotines in case aggregator is down. This solution is not scalable and very resource demanding especially with the fast TX rate from multiple rollups that we eager to support unlike the current implementation described above.

Appreciate the feedback. This is mentioned in the PR description but I believe is mostly a tradeoff:

In the current design, the unsentMessages queue grows unbounded, and we retry messages one at a time
In the proposed designs, the number of goroutines grows unbounded although each has a fixed size, and we retry sending messages independently of each other.

We were aware of this problem and in one of the editions @Hyodar introduced message expiration here. So the message is a candidate to resend only 10 times.

It was my understanding that #212 was asking for three retry conditions: Try at most 10 times with a timeout in between of each retry of 2 seconds while the messages are recent enough (TTL). Note that TTL != (Retry Count * Timeout) since a request could take several seconds to complete. If this is not correct, that is, TTL is already implemented as RetryCount * Timeout, then this PR is not required.

Hyodar · 2024-07-01T09:41:33Z

I don't particularly dislike this - the scalability thing is concerning but maybe not that much, would need a benchmark for that -, but I think it's not necessary indeed and the current solution is already more robust in this sense.

In #212 I already described and suggested solutions - IMO for the operator we just need to tidy up the retry mechanism a bit, maybe, and use a queue that already drops expired messages automatically, and for the aggregator just ignore old messages and fix the timing check. Those are simple changes and should be tackled before anything more complex.

emlautarom1 · 2024-07-01T14:48:20Z

In #212 I already described and suggested solutions - IMO for the operator we just need to tidy up the retry mechanism a bit, maybe, and use a queue that already drops expired messages automatically

Maybe I'm missing something but we're already dropping messages after 10 retries with 2 second delays in between. Do you suggest to add an additional goroutine that periodically inspects the queue and removes messages that are too old?

for the aggregator just ignore old messages and fix the timing check

Moving the discussion regarding the aggregator back to #212.

emlautarom1 added 30 commits June 19, 2024 11:48

Initial V2 impl for rpc_client

0b420fb

Add retry later

27f918b

Add submittedAt to actions

ae9904e

Extract retry and retry later strategies

9e1e849

Add retryIfRecentEnough

890ac87

Add TestRetryLaterIfRecentEnough

9a46a4f

Rename retry mechanism

4fef16a

Test retryAtMost

8e558f5

Store retryPredicate in struct member

3950423

Separate impl from tests

e0a82a0

Rename to AggregatorRpcClient

6a7c004

Don't use self

6b0107d

Abstract rpc.Client

3fc60c6

Add NewHTTPAggregatorRpcClient

658f752

Add AggregatorRpcClienter interface

17c134c

Remove interface (single implementation)

ca0820a

Rename client method names

b64cb17

Add TestGetAggregatedCheckpointMessages

5eae2a1

Shows a clear data race

Make GetAggregatedCheckpointMessages blocking

a0f833f

Replace event loop with blocking code

b5f5207

Add Metricable interface to AggregatorRpcClient

24b027e

Use listener in SendSignedCheckpointTaskResponseToAggregator

cf09334

Use listener in SendSignedStateRootUpdateToAggregator

5ec7d28

Use listener in SendSignedOperatorSetUpdateToAggregator

e576b5f

Add logs to GetAggregatedCheckpointMessages

22f5264

Add RetryWithDelay

fefaa66

Rename parameter

f3d197c

Add RetryAnd to compose rather than decorators

fc5665b

Add DefaultRetryStrategy

f71d642

Start aggregator before operator

6a6db7d

emlautarom1 added 2 commits June 20, 2024 15:13

Replace original with v2

28b0005

Increase coverage by using different messages

7bcc7ff

emlautarom1 requested review from Hyodar, jmederosalvarado, mikhailUshakoff and taco-paco June 20, 2024 19:46

emlautarom1 commented Jun 20, 2024

View reviewed changes

Ensure retry strategy is local to each method call

f4ee227

emlautarom1 commented Jun 20, 2024

View reviewed changes

Typo

9f3809e

taco-paco suggested changes Jun 21, 2024

View reviewed changes

emlautarom1 closed this Jun 24, 2024

emlautarom1 mentioned this pull request Jun 24, 2024

Timeout-based message expiration #212

Open

Hyodar deleted the feature/rpc_client_ttl branch October 3, 2024 17:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TTL to Operator RPC Client #259

Add TTL to Operator RPC Client #259

emlautarom1 commented Jun 20, 2024 •

edited

Loading

emlautarom1 Jun 20, 2024

emlautarom1 Jun 20, 2024

emlautarom1 Jun 20, 2024

emlautarom1 Jun 20, 2024

taco-paco commented Jun 21, 2024 •

edited

Loading

taco-paco left a comment

emlautarom1 commented Jun 21, 2024 •

edited

Loading

Hyodar commented Jul 1, 2024 •

edited

Loading

emlautarom1 commented Jul 1, 2024

		// TODO: We never close `httpRpcClient`
		httpRpcClient, err := NewHTTPAggregatorRpcClient(c.AggregatorServerIpPortAddress, operatorId, registryCoordinatorAddress, logger)

		nodeConfig, _, _ := genOperatorConfig(t, ctx, "3", mainnetAnvil, rollupAnvils, rabbitMq)
		operator := startOperator(t, ctx, nodeConfig)

Add TTL to Operator RPC Client #259

Add TTL to Operator RPC Client #259

Conversation

emlautarom1 commented Jun 20, 2024 • edited Loading

Current Behavior

New Behavior

Breaking Changes

emlautarom1 Jun 20, 2024

Choose a reason for hiding this comment

emlautarom1 Jun 20, 2024

Choose a reason for hiding this comment

emlautarom1 Jun 20, 2024

Choose a reason for hiding this comment

emlautarom1 Jun 20, 2024

Choose a reason for hiding this comment

taco-paco commented Jun 21, 2024 • edited Loading

taco-paco left a comment

Choose a reason for hiding this comment

emlautarom1 commented Jun 21, 2024 • edited Loading

Hyodar commented Jul 1, 2024 • edited Loading

emlautarom1 commented Jul 1, 2024

emlautarom1 commented Jun 20, 2024 •

edited

Loading

taco-paco commented Jun 21, 2024 •

edited

Loading

emlautarom1 commented Jun 21, 2024 •

edited

Loading

Hyodar commented Jul 1, 2024 •

edited

Loading