Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jdbc sink stuck on jvm side #18372

Closed
wenym1 opened this issue Sep 3, 2024 · 5 comments
Closed

jdbc sink stuck on jvm side #18372

wenym1 opened this issue Sep 3, 2024 · 5 comments
Labels
type/bug Something isn't working
Milestone

Comments

@wenym1
Copy link
Contributor

wenym1 commented Sep 3, 2024

Describe the bug

On a postgresql jdbc sink

log store read epoch does not increase for a long time

await-tree dump

>> Actor 63591
Actor 63591: `xxx` [4584.598s]
  Epoch 7084692759511040 [380.010ms]
    Sink F86700000002 [380.010ms]
      consume_log (sink_id 17099) [!!! 4584.598s]
        log_sinker_send_chunk (chunk 21) [!!! 4579.358s]
        log_sinker_wait_next_response [!!! 4579.358s]
      Merge F86700000001 [380.010ms]
        LocalInput (actor 63596) [380.010ms]
        LocalInput (actor 63595) [380.010ms]
        LocalInput (actor 63594) [380.010ms]
        LocalInput (actor 63593) [380.010ms]

jstack

"Thread-6964" #7058 prio=5 os_prio=0 cpu=329.34ms elapsed=5299.86s tid=0x0000fffe4786c800 nid=0x14d1 runnable  [0x0000fffbbbab8000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.ch.Net.poll([email protected]/Native Method)
	at sun.nio.ch.NioSocketImpl.park([email protected]/NioSocketImpl.java:186)
	at sun.nio.ch.NioSocketImpl.park([email protected]/NioSocketImpl.java:195)
	at sun.nio.ch.NioSocketImpl.implWrite([email protected]/NioSocketImpl.java:420)
	at sun.nio.ch.NioSocketImpl.write([email protected]/NioSocketImpl.java:445)
	at sun.nio.ch.NioSocketImpl$2.write([email protected]/NioSocketImpl.java:831)
	at java.net.Socket$SocketOutputStream.write([email protected]/Socket.java:1035)
	at sun.security.ssl.SSLSocketOutputRecord.deliver([email protected]/SSLSocketOutputRecord.java:345)
	at sun.security.ssl.SSLSocketImpl$AppOutputStream.write([email protected]/SSLSocketImpl.java:1308)
	at java.io.BufferedOutputStream.flushBuffer([email protected]/BufferedOutputStream.java:81)
	at java.io.BufferedOutputStream.write([email protected]/BufferedOutputStream.java:127)
	- locked <0x00000000ba305df8> (a java.io.BufferedOutputStream)
	at java.io.FilterOutputStream.write([email protected]/FilterOutputStream.java:108)
	at org.postgresql.core.PGStream.sendInteger2(PGStream.java:375)
	at org.postgresql.core.v3.QueryExecutorImpl.sendBind(QueryExecutorImpl.java:1707)
	at org.postgresql.core.v3.QueryExecutorImpl.sendOneQuery(QueryExecutorImpl.java:1968)
	at org.postgresql.core.v3.QueryExecutorImpl.sendQuery(QueryExecutorImpl.java:1488)
	at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:546)
	- locked <0x00000000bb87e4e8> (a org.postgresql.core.v3.QueryExecutorImpl)
	at org.postgresql.jdbc.PgStatement.internalExecuteBatch(PgStatement.java:893)
	at org.postgresql.jdbc.PgStatement.executeBatch(PgStatement.java:916)
	at org.postgresql.jdbc.PgPreparedStatement.executeBatch(PgPreparedStatement.java:1684)
	at com.risingwave.connector.JDBCSink$JdbcStatements.executeStatement(JDBCSink.java:343)
	at com.risingwave.connector.JDBCSink$JdbcStatements.execute(JDBCSink.java:324)
	at com.risingwave.connector.JDBCSink.write(JDBCSink.java:153)
	at com.risingwave.connector.SinkWriterStreamObserver.onNext(SinkWriterStreamObserver.java:132)
	at com.risingwave.connector.JniSinkWriterHandler.runJniSinkWriterThread(JniSinkWriterHandler.java:40)

The code that gets stuck

private void park(FileDescriptor fd, int event, long nanos) throws IOException {
        Thread t = Thread.currentThread();
        if (t.isVirtual()) {
            ...
        } else {
            long millis;
            if (nanos == 0) {
                millis = -1;
            } else {
                ...
            }
            Net.poll(fd, event, millis);
        }
    }

    private void park(FileDescriptor fd, int event) throws IOException {
        park(fd, event, 0);
    }

The 0 timeout is hard coded, so it's unlikely to add timeout with some config.

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

No response

Additional context

IMAGE: v1.10.1-patch-us-west-2-35-fix-serving-mapping

@wenym1 wenym1 added the type/bug Something isn't working label Sep 3, 2024
@github-actions github-actions bot added this to the release-2.1 milestone Sep 3, 2024
@wenym1
Copy link
Contributor Author

wenym1 commented Sep 3, 2024

The stuck can not be resolved automatically.

The solution to it is to first run select * from pg_stat_activity; to inspect the status of all connections. From the query result, filter out rows with application_name as PostgreSQL JDBC Driver. In the remaining rows, find out rows that has wait_event as ClientWrite, state as active and query as a prepared DML statement. These rows are likely to be the problematic connections. We can get the pid of these connections, and call SELECT pg_terminate_backend(<pid>) to kill the connections so that the jdbc sink can be unstuck and trigger retry.

Nevertheless, we still need to figure out why the connection is stuck.

@wenym1
Copy link
Contributor Author

wenym1 commented Sep 3, 2024

cc @StrikeW @hzxa21

@hzxa21
Copy link
Collaborator

hzxa21 commented Sep 3, 2024

Maybe related: pgjdbc/pgjdbc#194

@StrikeW
Copy link
Contributor

StrikeW commented Sep 4, 2024

Maybe related: pgjdbc/pgjdbc#194

I think so. The stack trace is similar to pgjdbc/pgjdbc#194 (comment). I am not very sure whether rewrite to Rust could improve the stability (#16745). cc @fuyufjh to take a look of the priority.

@hzxa21
Copy link
Collaborator

hzxa21 commented Sep 4, 2024

Maybe related: pgjdbc/pgjdbc#194

I think so. The stack trace is similar to pgjdbc/pgjdbc#194 (comment). I am not very sure whether rewrite to Rust could improve the stability (#16745). cc @fuyufjh to take a look of the priority.

I don't see user reporting issues related to stuck query in https://github.com/sfackler/rust-postgres/issues so I am optimistic about using rust to sink to PG can be more stable.

The issue with java/jdbc is that it is difficult (even impossible) to timeout a query because stmt.executeBatch() is not async and we cannot implement our own timeout mechanism as a safeguard to cancel the query and retry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants