Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote sink writer gets stuck when connector node throws exception #12521

Closed
wenym1 opened this issue Sep 25, 2023 · 4 comments · Fixed by #12525
Closed

remote sink writer gets stuck when connector node throws exception #12521

wenym1 opened this issue Sep 25, 2023 · 4 comments · Fixed by #12525
Assignees
Labels
type/bug Something isn't working
Milestone

Comments

@wenym1
Copy link
Contributor

wenym1 commented Sep 25, 2023

Describe the bug

When using jdbc sink, the sink writer gets stuck when the connector node throws an exception while writing a chunk to the external database.

Barriers get piled up. Sink executor has no throughput.

Error message/log

No response

To Reproduce

No response

Expected behavior

Manually comment out some code and keep only the following logic

  1. remote sink writer ignores all checkpoint barrier and on barrier it only sends a barrier request that don't need waiting for a response.
  2. remove all code of jdbc sink, and only keep some dummy logic to make the code compile
  3. on connector node, in the sink stream observer, manually throw an exception when processing the 10th write batch.

When we create the following datagen source and jdbc sink

CREATE source s1 (seq_id bigint, user_id bigint,
  user_name varchar)
WITH (                    
     connector = 'datagen',
     fields.seq_id.kind = 'sequence',
     fields.seq_id.start = '1',
     fields.seq_id.end = '100000000',
     fields.user_id.kind = 'random',
     fields.user_id.min = '1',
     fields.user_id.max = '10000000',
     fields.user_name.kind = 'random',
     fields.user_name.length = '10',
     datagen.rows.per.second = '500000'
 ) format plain encode JSON;

create sink s1_sink from s1
with (              
connector = 'jdbc', type='append-only'

The sink writer gets stuck.

How did you deploy RisingWave?

No response

The version of RisingWave

No response

Additional context

No response

@wenym1 wenym1 added the type/bug Something isn't working label Sep 25, 2023
@wenym1 wenym1 self-assigned this Sep 25, 2023
@github-actions github-actions bot added this to the release-1.3 milestone Sep 25, 2023
@wenym1
Copy link
Contributor Author

wenym1 commented Sep 25, 2023

Our current logic of remote sink writer:

On client side we have an bounded request channel. The receiver side is passed to the grpc client when making a new request. The remote sink writer holds a sender to send the request. For WriteBatch non-checkpoint Barrier request, we only send the request without waiting for a response. For checkpoint Barrier request, we will wait for the commit response from the server.

On server side, our implemented stream observer is like the following

void onNext() {
    try {
          // may throw some exception
    } catch (Exception ex) {
          responseObserver.onError(ex);
    }
}

@wenym1
Copy link
Contributor Author

wenym1 commented Sep 25, 2023

Just had a test on the behavior of java grpc server.

In our current implementation, when any exception is thrown in onNext, it will be caught with a try-catch. The exception is not thrown again, but will be passed to the responseObserver.onError.

Since responseObserver.onError is called, the stream handler will stop handling any new request. On client side, though the responseObserver.onError is called, since we only poll new response when we see a checkpoint barrier, if no checkpoint barrier is coming, we are not able to see the error in the response stream. Meanwhile, it seems that though the java stream handler has stopped working, the grpc stream still exists, and therefore the request receiver will not be dropped, and we can keep sending new request, and finally get stuck when the bounded channel gets full.

@wenym1
Copy link
Contributor Author

wenym1 commented Sep 25, 2023

Some new test.

If we change the logic to throw the caught exception

void onNext() {
    try {
          // may throw some exception
    } catch (Exception ex) {
          responseObserver.onError(ex);
          throw ex;
    }
}

The behavior is the same. The log of connector node will have an exception log from the grpc runtime.

SEVERE: Exception while executing runnable io.grpc.internal.ServerImpl

The onError or onComplete of the request observer will not be called.

@wenym1
Copy link
Contributor Author

wenym1 commented Sep 25, 2023

In conclusion, on either responseObserver.onError is called, or any exception is thrown from the call on onNext, the request handler will be seen as finished, and no more onError or onComplete will be called. Therefore we should also do the cleanup when any exception is thrown.

But in either way, on client side, the request receiver will not be dropped, and we can still send new request with the sender without any error, and will get stuck when the bounded channel gets full. The only way for the client to know about the error in stream is to poll the response stream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant