Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(frontend): support idle in transaction session timeout #14566

Merged
merged 9 commits into from
Jan 17, 2024

Conversation

chenzl25
Copy link
Contributor

@chenzl25 chenzl25 commented Jan 15, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

@chenzl25 chenzl25 requested a review from a team as a code owner January 15, 2024 07:31
@chenzl25 chenzl25 requested a review from BugenZhao January 15, 2024 07:31
@chenzl25 chenzl25 requested a review from fuyufjh January 15, 2024 07:31
@fuyufjh
Copy link
Member

fuyufjh commented Jan 15, 2024

IMO the major motivation of #13940 is avoiding "pinned Hummock epoch to last forever", which requires proactively timing and killing the transaction once it timed out. Can this PR achieve that?

@chenzl25
Copy link
Contributor Author

proactively

I would like to improve it. Currently, this PR needs another statement to trigger the abortion of that session/transaction.

@@ -550,6 +550,12 @@ where
record_sql_in_span(&sql);
let session = self.session.clone().unwrap();

if session.check_idle_in_transaction_timeout() {
self.process_terminate();
return Err(PsqlError::SimpleQueryError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about adding a new error variant? I believe the handling logic can be quite similar to handling Panic here.

PsqlError::Panic(_) => {
self.stream
.write_no_flush(&BeMessage::ErrorResponse(Box::new(e)))
.ok()?;
let _ = self.stream.flush().await;
// Catching the panic during message processing may leave the session in an
// inconsistent state. We forcefully close the connection (then end the
// session) here for safety.
return None;
}

@BugenZhao
Copy link
Member

proactively

I would like to improve it. Currently, this PR needs another statement to trigger the abortion of that session/transaction.

I assume we can start a background task to monitor the timeout when a transaction is started.

@fuyufjh
Copy link
Member

fuyufjh commented Jan 15, 2024

proactively

I would like to improve it. Currently, this PR needs another statement to trigger the abortion of that session/transaction.

I assume we can start a background task to monitor the timeout when a transaction is started.

Yeah, that's one approach. In my mind, we can set up a timer background task for each transaction.

In either way, I am afraid this implementation can't be compatible.

@chenzl25 chenzl25 requested a review from BugenZhao January 15, 2024 13:14
Comment on lines 332 to 344
// Idle transaction background monitor
let join_handle = tokio::spawn(async move {
let mut check_idle_txn_interval =
tokio::time::interval(core::time::Duration::from_secs(5));
check_idle_txn_interval.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Skip);
check_idle_txn_interval.reset();
loop {
check_idle_txn_interval.tick().await;
sessions.read().values().for_each(|session| {
let _ = session.check_idle_in_transaction_timeout();
})
}
});
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a background monitor to unpin snapshots proactively

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little bit brute-force 🤣

Copy link
Member

@BugenZhao BugenZhao Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's okay to spawn many tasks in tokio. So what about one background task for one session? 🥲 Just remember to cancel it when session gets dropped, or simply use a weak reference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a task for each session seems much more heavy 🥵. This check could be triggered in a longer period and much more friendly to high concurrent workload I think.

@chenzl25 chenzl25 requested a review from wenym1 January 15, 2024 13:18
@BugenZhao
Copy link
Member

Will review it the next day. :)

Comment on lines 332 to 344
// Idle transaction background monitor
let join_handle = tokio::spawn(async move {
let mut check_idle_txn_interval =
tokio::time::interval(core::time::Duration::from_secs(5));
check_idle_txn_interval.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Skip);
check_idle_txn_interval.reset();
loop {
check_idle_txn_interval.tick().await;
sessions.read().values().for_each(|session| {
let _ = session.check_idle_in_transaction_timeout();
})
}
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little bit brute-force 🤣

@@ -132,6 +138,12 @@ impl ExecContextGuard {
}
}

impl Drop for ExecContext {
fn drop(&mut self) {
*self.last_idle_instant.lock() = Some(Instant::now());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the last_idle_instant is only updated when execution completes, if a batch query runs for a long time e.g. scanning a large amount of data, will it be killed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the last_idle_instant is only updated when execution completes, if a batch query runs for a long time e.g. scanning a large amount of data, will it be killed?

According to the postgresql docs, this variable doesn't care about how long a statement is running, which should be handled by parameters like statement timeout

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From PG's documents:

Terminate any session that has been idle (that is, waiting for a client query) within an open transaction for longer than the specified amount of time.

IIUC, that means a running batch query should not be killed. Do we behave at the same way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we won't kill any query, but unpin the snapshot (either by background monitor or session termination triggered by a later statement). The session would be terminated when the later statement arrives.

Copy link
Member

@fuyufjh fuyufjh Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but unpin the snapshot

By "killed" I mean any means that cause the query to end. When a snapshot is unpinned, it's possible that the (running) iterator runs into errors and terminates, is it the cases? If so, I was saying that we should avoid such thing.

Copy link
Contributor Author

@chenzl25 chenzl25 Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just ensure there is no running SQL in the current session during calling unpin_snapshot in check_idle_in_transaction_timeout, so we don't need to worry about that.

@BugenZhao
Copy link
Member

Currently, this PR needs another statement to trigger the abortion of that session/transaction.

Oops, this seems to be the behavior of Postgres. 😂

Copy link
Member

@fuyufjh fuyufjh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chenzl25 chenzl25 added this pull request to the merge queue Jan 17, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 17, 2024
@chenzl25 chenzl25 added this pull request to the merge queue Jan 17, 2024
Merged via the queue into main with commit 95cdfe9 Jan 17, 2024
26 of 27 checks passed
@chenzl25 chenzl25 deleted the dylan/support_idle_in_transaction_session_timeout branch January 17, 2024 09:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: timeout of transactions
4 participants