feat(frontend): support idle in transaction session timeout #14566

chenzl25 · 2024-01-15T07:31:12Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Resolve feat: timeout of transactions #13940 Potentially pinning an epoch for long time in long transaction #13885
Introduce session variable idle_in_transaction_session_timeout and its default value is 60000(ms) i.e. 1min.

Checklist

I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
I have added test labels as necessary. See details.
I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
All checks passed in ./risedev check (or alias, ./risedev c)
My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)

My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

fuyufjh · 2024-01-15T07:37:56Z

IMO the major motivation of #13940 is avoiding "pinned Hummock epoch to last forever", which requires proactively timing and killing the transaction once it timed out. Can this PR achieve that?

chenzl25 · 2024-01-15T07:54:28Z

proactively

I would like to improve it. Currently, this PR needs another statement to trigger the abortion of that session/transaction.

BugenZhao · 2024-01-15T09:08:32Z

src/utils/pgwire/src/pg_protocol.rs

@@ -550,6 +550,12 @@ where
        record_sql_in_span(&sql);
        let session = self.session.clone().unwrap();

+        if session.check_idle_in_transaction_timeout() {
+            self.process_terminate();
+            return Err(PsqlError::SimpleQueryError(


What about adding a new error variant? I believe the handling logic can be quite similar to handling Panic here.

risingwave/src/utils/pgwire/src/pg_protocol.rs

Lines 350 to 360 in 240416f

PsqlError::Panic(_) => {

self.stream

.write_no_flush(&BeMessage::ErrorResponse(Box::new(e)))

.ok()?;

let _ = self.stream.flush().await;

// Catching the panic during message processing may leave the session in an

// inconsistent state. We forcefully close the connection (then end the

// session) here for safety.

return None;

}

BugenZhao · 2024-01-15T09:09:58Z

proactively

I would like to improve it. Currently, this PR needs another statement to trigger the abortion of that session/transaction.

I assume we can start a background task to monitor the timeout when a transaction is started.

fuyufjh · 2024-01-15T09:46:29Z

proactively

I would like to improve it. Currently, this PR needs another statement to trigger the abortion of that session/transaction.

I assume we can start a background task to monitor the timeout when a transaction is started.

Yeah, that's one approach. In my mind, we can set up a timer background task for each transaction.

In either way, I am afraid this implementation can't be compatible.

chenzl25 · 2024-01-15T13:15:27Z

src/frontend/src/session.rs

+        // Idle transaction background monitor
+        let join_handle = tokio::spawn(async move {
+            let mut check_idle_txn_interval =
+                tokio::time::interval(core::time::Duration::from_secs(5));
+            check_idle_txn_interval.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Skip);
+            check_idle_txn_interval.reset();
+            loop {
+                check_idle_txn_interval.tick().await;
+                sessions.read().values().for_each(|session| {
+                    let _ = session.check_idle_in_transaction_timeout();
+                })
+            }
+        });


Add a background monitor to unpin snapshots proactively

A little bit brute-force 🤣

It's okay to spawn many tasks in tokio. So what about one background task for one session? 🥲 Just remember to cancel it when session gets dropped, or simply use a weak reference.

Creating a task for each session seems much more heavy 🥵. This check could be triggered in a longer period and much more friendly to high concurrent workload I think.

BugenZhao · 2024-01-15T13:41:44Z

Will review it the next day. :)

fuyufjh · 2024-01-16T05:53:51Z

src/frontend/src/session.rs

+        // Idle transaction background monitor
+        let join_handle = tokio::spawn(async move {
+            let mut check_idle_txn_interval =
+                tokio::time::interval(core::time::Duration::from_secs(5));
+            check_idle_txn_interval.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Skip);
+            check_idle_txn_interval.reset();
+            loop {
+                check_idle_txn_interval.tick().await;
+                sessions.read().values().for_each(|session| {
+                    let _ = session.check_idle_in_transaction_timeout();
+                })
+            }
+        });


A little bit brute-force 🤣

fuyufjh · 2024-01-16T05:57:49Z

src/utils/pgwire/src/pg_server.rs

@@ -132,6 +138,12 @@ impl ExecContextGuard {
    }
 }

+impl Drop for ExecContext {
+    fn drop(&mut self) {
+        *self.last_idle_instant.lock() = Some(Instant::now());


As the last_idle_instant is only updated when execution completes, if a batch query runs for a long time e.g. scanning a large amount of data, will it be killed?

As the last_idle_instant is only updated when execution completes, if a batch query runs for a long time e.g. scanning a large amount of data, will it be killed?

According to the postgresql docs, this variable doesn't care about how long a statement is running, which should be handled by parameters like statement timeout

From PG's documents:

Terminate any session that has been idle (that is, waiting for a client query) within an open transaction for longer than the specified amount of time.

IIUC, that means a running batch query should not be killed. Do we behave at the same way?

Yes, we won't kill any query, but unpin the snapshot (either by background monitor or session termination triggered by a later statement). The session would be terminated when the later statement arrives.

but unpin the snapshot

By "killed" I mean any means that cause the query to end. When a snapshot is unpinned, it's possible that the (running) iterator runs into errors and terminates, is it the cases? If so, I was saying that we should avoid such thing.

I just ensure there is no running SQL in the current session during calling unpin_snapshot in check_idle_in_transaction_timeout, so we don't need to worry about that.

BugenZhao · 2024-01-17T04:01:29Z

Currently, this PR needs another statement to trigger the abortion of that session/transaction.

Oops, this seems to be the behavior of Postgres. 😂

…e_in_transaction_timeout

…meout

fuyufjh

LGTM

support idle in transaction session timeout

9f18c51

chenzl25 requested a review from a team as a code owner January 15, 2024 07:31

chenzl25 requested a review from BugenZhao January 15, 2024 07:31

github-actions bot added the type/feature label Jan 15, 2024

chenzl25 requested a review from fuyufjh January 15, 2024 07:31

TennyZhuang approved these changes Jan 15, 2024

View reviewed changes

BugenZhao reviewed Jan 15, 2024

View reviewed changes

chenzl25 added 2 commits January 15, 2024 18:42

refine

19f4b9f

add a background monitor

8a74663

chenzl25 requested a review from BugenZhao January 15, 2024 13:14

chenzl25 commented Jan 15, 2024

View reviewed changes

chenzl25 requested a review from wenym1 January 15, 2024 13:18

remove the test which could cause other test cases failed

0d39532

fuyufjh reviewed Jan 16, 2024

View reviewed changes

chenzl25 and others added 4 commits January 17, 2024 12:13

ensure no running sql when we dicide to unpin a snapshot in check_idl…

8a28391

…e_in_transaction_timeout

Merge branch 'main' into dylan/support_idle_in_transaction_session_ti…

5eabe82

…meout

fix

4f75d1c

fix when idle_in_transaction_session_timeout = 0

dcacd31

fuyufjh approved these changes Jan 17, 2024

View reviewed changes

chenzl25 added this pull request to the merge queue Jan 17, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 17, 2024

resolve conflict

3a0edd2

chenzl25 added this pull request to the merge queue Jan 17, 2024

Merged via the queue into main with commit 95cdfe9 Jan 17, 2024
26 of 27 checks passed

chenzl25 deleted the dylan/support_idle_in_transaction_session_timeout branch January 17, 2024 09:32

wenym1 mentioned this pull request Jan 17, 2024

Potentially pinning an epoch for long time in long transaction #13885

Closed

cyliu0 mentioned this pull request Jan 18, 2024

nightly-20240117 compute node OOM during sysbench select-random-limits #14634

Closed

Little-Wallace pushed a commit that referenced this pull request Jan 20, 2024

feat(frontend): support idle in transaction session timeout (#14566)

2880ca6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(frontend): support idle in transaction session timeout #14566

feat(frontend): support idle in transaction session timeout #14566

chenzl25 commented Jan 15, 2024 •

edited

Loading

fuyufjh commented Jan 15, 2024

chenzl25 commented Jan 15, 2024

BugenZhao Jan 15, 2024

BugenZhao commented Jan 15, 2024

fuyufjh commented Jan 15, 2024

chenzl25 Jan 15, 2024

fuyufjh Jan 16, 2024

BugenZhao Jan 16, 2024 •

edited

Loading

chenzl25 Jan 16, 2024

BugenZhao commented Jan 15, 2024

fuyufjh Jan 16, 2024

fuyufjh Jan 16, 2024

chenzl25 Jan 16, 2024

fuyufjh Jan 16, 2024

chenzl25 Jan 16, 2024

fuyufjh Jan 17, 2024 •

edited

Loading

chenzl25 Jan 17, 2024 •

edited

Loading

BugenZhao commented Jan 17, 2024

fuyufjh left a comment

	PsqlError::Panic(_) => {
	self.stream
	.write_no_flush(&BeMessage::ErrorResponse(Box::new(e)))
	.ok()?;
	let _ = self.stream.flush().await;

	// Catching the panic during message processing may leave the session in an
	// inconsistent state. We forcefully close the connection (then end the
	// session) here for safety.
	return None;
	}

feat(frontend): support idle in transaction session timeout #14566

feat(frontend): support idle in transaction session timeout #14566

Conversation

chenzl25 commented Jan 15, 2024 • edited Loading

What's changed and what's your intention?

Checklist

Documentation

Release note

fuyufjh commented Jan 15, 2024

chenzl25 commented Jan 15, 2024

Choose a reason for hiding this comment

BugenZhao commented Jan 15, 2024

fuyufjh commented Jan 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BugenZhao Jan 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BugenZhao commented Jan 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fuyufjh Jan 17, 2024 • edited Loading

Choose a reason for hiding this comment

chenzl25 Jan 17, 2024 • edited Loading

Choose a reason for hiding this comment

BugenZhao commented Jan 17, 2024

fuyufjh left a comment

Choose a reason for hiding this comment

chenzl25 commented Jan 15, 2024 •

edited

Loading

BugenZhao Jan 16, 2024 •

edited

Loading

fuyufjh Jan 17, 2024 •

edited

Loading

chenzl25 Jan 17, 2024 •

edited

Loading