Batch task queue user data persistence updates #7039

dnr · 2024-12-30T20:52:24Z

What changed?

Multiple user data updates coming in for task queues in the same namespace within a short period of time get batched into a smaller number of persistence operations.

Why?

With deployments, we sometimes have to update user data on multiple task queues at once (all in the same namespace), and on cassandra, these updates go through an LWT. This could cause a backup since the throughput of LWTs is fairly low.

This change allows batching of multiple updates in one persistence operation (LWT on cassandra or transaction on sql). The batching is transparent: updates that come in within a short period of time automatically get batched (in matching engine).

How did you test it?

unit test for batcher componennt
existing tests for user data updates

Potential risks

small extra latency on all user data updates
retries/conflicts are not handled well yet: on a version conflict, all updates in the batch will fail and none will be retried, but the non-conflicting ones should be retried

ShahabT

Generally lgtm, but someone else should review line-by-line.

ychebotarev · 2025-01-08T23:23:52Z

common/persistence/sql/task.go

-			if err != nil {
-				return err
+			if m.Db.IsDupEntryError(err) {
+				return &persistence.ConditionFailedError{Msg: err.Error()}


with this type of error handling you will end up with "partially updated" state, but you don't really know which one are passed.
(unless I miss something).
Is there a reason to stop on error? or may it make sense to move forward, and return an array of failed updates?

ychebotarev · 2025-01-08T23:23:56Z

common/persistence/cassandra/matching_task_store.go

 	if err != nil {
 		return gocql.ConvertError("UpdateTaskQueueUserData", err)
 	}
+	defer iter.Close()


a bit confused about this.
In previous iteration code closes iter and process error.
You change it to defer, and remove error check.
so even if iter.err is not nill (that is what iter.Close() returns) you still execute that if !applied { section.
Which is different from what it was before.
Intended?

ychebotarev · 2025-01-08T23:25:08Z

common/persistence/task_manager.go

+	for taskQueue, update := range request.Updates {
+		userData, err := m.serializer.TaskQueueUserDataToBlob(update.UserData.Data, enumspb.ENCODING_TYPE_PROTO3)
+		if err != nil {
+			return err


same here. are you sure you want to break this iterator for a single failure? Or just drop this specific update?

ychebotarev · 2025-01-08T23:25:23Z

common/stream_batcher/batcher.go

@@ -0,0 +1,176 @@
+// The MIT License
+//
+// Copyright (c) 2024 Temporal Technologies Inc.  All rights reserved.


ychebotarev · 2025-01-08T23:26:19Z

common/stream_batcher/batcher.go

+
+// NewBatcher creates a Batcher. `fn` is the processing function, `opts` are the timing options.
+// `clock` is usually clock.NewRealTimeSource but can be a fake time source for testing.
+func NewBatcher[T, R any](fn func([]T) R, opts BatcherOptions, clock clock.TimeSource) *Batcher[T, R] {


timeSource instead of "clock" as a variable name.

ychebotarev · 2025-01-08T23:27:17Z

common/stream_batcher/batcher.go

+type Batcher[T, R any] struct {
+	fn    func([]T) R          // batch executor function
+	opts  BatcherOptions       // timing/size options
+	clock clock.TimeSource     // clock for testing


timeSource instead of clock, because it is actually TimeSource.

ychebotarev · 2025-01-08T23:45:35Z

service/matching/matching_engine_test.go

+			return err
+		}
+		tlm := m.getQueueManager(dbq)
+		tlm.Lock()


I may be out of depth here - will this work? I think you want lock/unlock per iteration?

ychebotarev · 2025-01-09T01:02:01Z

common/stream_batcher/batcher.go

+		// try to add more items. stop after a gap of MaxGap, total time of MaxTotalWait, or
+		// MaxItems items.
+		maxWaitC, maxWaitT := s.clock.NewTimer(s.opts.MaxDelay)
+	loop:


I'm not sure I understand all the possible implications, so I guess I will trust this work and covered by tests.

dnr added 7 commits December 30, 2024 20:25

coalesce wip

c2e140b

generic

33e4035

clock + renames

dd8d0b6

fix race

1bc9a1b

move and unit test

cc71013

renames, fix up matching

2690bd1

matching tests

824b702

dnr requested a review from a team as a code owner December 30, 2024 20:52

ShahabT reviewed Jan 8, 2025

View reviewed changes

ychebotarev reviewed Jan 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch task queue user data persistence updates #7039

Batch task queue user data persistence updates #7039

dnr commented Dec 30, 2024

ShahabT left a comment

ychebotarev Jan 8, 2025

ychebotarev Jan 8, 2025

ychebotarev Jan 8, 2025

ychebotarev Jan 8, 2025

ychebotarev Jan 8, 2025

ychebotarev Jan 8, 2025

ychebotarev Jan 8, 2025

ychebotarev Jan 9, 2025

Batch task queue user data persistence updates #7039

Are you sure you want to change the base?

Batch task queue user data persistence updates #7039

Conversation

dnr commented Dec 30, 2024

What changed?

Why?

How did you test it?

Potential risks

ShahabT left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment