Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query very large dataset by grpc server streaming #3429

Closed
MichaelScofield opened this issue Mar 5, 2024 · 4 comments
Closed

Query very large dataset by grpc server streaming #3429

MichaelScofield opened this issue Mar 5, 2024 · 4 comments

Comments

@MichaelScofield
Copy link
Collaborator

What problem does the new feature solve?

Though unusual (and not recommended), sometimes user just want to query very large dataset. However, now the "big" queries are often resulted in timeout. The problems maybe reside in that huge volume of data are waiting to be gathered inside GreptimeDB, or the large amount of network/hard disk IO costs.

Similar issue: #2223

What does the feature do?

Make querying very large dataset do-able, by featuring GRPC server streaming. Inside GreptimeDB, the results of queries are all streams. So we may create a GRPC interface to poll(adapt) that stream.

Implementation challenges

No response

@YCCDSZXH
Copy link
Contributor

Need some help.

The origin *.proto may need to be modify

service GreptimeDatabase {
  rpc Handle(GreptimeRequest) returns (GreptimeResponse);
  rpc HandleRequests(stream GreptimeRequest) returns (GreptimeResponse);
}
...
message GreptimeResponse {
  ResponseHeader header = 1;
  oneof response { AffectedRows affected_rows = 2; }
}

to support stream, we may modify it like this?

service GreptimeDatabase {
  rpc Handle(GreptimeRequest) returns (stream GreptimeResponse);
  rpc HandleRequests(stream GreptimeRequest) returns (stream GreptimeResponse);
}
...
message GreptimeResponse {
  ResponseHeader header = 1;
  oneof response {
    AffectedRows affected_rows = 2;
    ARowInQueryResults row = 3;
  }
}

Insert and delete still work, and query result will be return as a stream.
But if we do so, the query response header will be sent multiple times.

@evenyag
Copy link
Contributor

evenyag commented Apr 25, 2024

@MichaelScofield Could you provide more information such as codes that need to be modified for this issue?

I remember that we are using Arrow Flight service for handling requests from clients, which is already streaming.

let flight_data_stream = response.into_inner();
let mut decoder = FlightDecoder::default();
let mut flight_message_stream = flight_data_stream.map(move |flight_data| {
flight_data
.map_err(Error::from)
.and_then(|data| decoder.try_decode(data).context(ConvertFlightDataSnafu))
});

@waynexia
Copy link
Member

The query is not defined in the *.proto file you mentioned above @YCCDSZXH. We are using Arrow Fligth to define query services, more specifically, the doGet method:

https://github.com/apache/arrow-rs/blob/11450ae8ddf902b57cb42491a3d824d9550a05ea/format/Flight.proto#L108

Which returns a stream as proposed

@MichaelScofield
Copy link
Collaborator Author

@YCCDSZXH Sorry this issue is a little outdated, we decide not to let GreptimeDatabase handle query requests. The QueryRequests are used in Arrow Flight service, and it's already streamed. I'm gonna close this issue, you are welcomed to take these two related issues instead: #3799 and #3798

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants