-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement support for Google Spanner #271
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
client-base/src/main/kotlin/app/cash/backfila/client/RealBackfillModule.kt
Outdated
Show resolved
Hide resolved
...-misk-spanner/src/main/kotlin/app/cash/backfila/client/misk/spanner/SpannerBackfillModule.kt
Outdated
Show resolved
Hide resolved
client-misk-spanner/src/main/kotlin/app/cash/backfila/client/misk/spanner/SpannerBackfill.kt
Outdated
Show resolved
Hide resolved
...isk-spanner/src/main/kotlin/app/cash/backfila/client/misk/spanner/internal/SpannerBackend.kt
Outdated
Show resolved
Hide resolved
...isk-spanner/src/main/kotlin/app/cash/backfila/client/misk/spanner/internal/SpannerBackend.kt
Show resolved
Hide resolved
...er/src/main/kotlin/app/cash/backfila/client/misk/spanner/internal/SpannerBackfillOperator.kt
Show resolved
Hide resolved
...nt-misk-spanner/src/test/kotlin/app/cash/backfila/client/misk/spanner/SpannerBackfillTest.kt
Show resolved
Hide resolved
|
||
val partitions = listOf( | ||
PrepareBackfillResponse.Partition.Builder() | ||
.backfill_range(request.range) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you requiring range to be passed in? In other implementations we compute the ranges if you don't pass it in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The range is actually completely ignored. Spanner is unlike many other DBs, where for optimal performance primary keys really can't be in anything like a monotonic increasing range. I don't know how to compute a range without doing a full table scan, which seems... suboptimal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can't ask for min/max primary key value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Primary keys are often random values like UUIDs and unordered for optimal performance. Min/max aren't valid concepts, as far as I can tell. Source: https://cloud.google.com/spanner/docs/schema-design#primary-key-prevent-hotspots
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backfila requires ordered key values to operate. I'm curious how you would use it if that's not the case. I haven't used spanner but my understanding was its ordered, you just want to avoid sequential writes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And to answer the original question - we don’t require a range to be passed in. That’s optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'm well aware how primary key design works in spanner, and you can have items added within the range. That's true even in auto increment, technically. It doesn't matter since the expectation is you are inserting new items that don't need backfilling.
It sounds like you are able to just ask spanner for records and it will give in some order, that should be fine I guess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this work like dynamo backfills? Dynamo is somewhat different, but has a scan mechanism we use, and I believe we don't do ranges on it either? You could check that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that you will essentially run your backfill single threaded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there must be some distributed way to process the whole data set in bulk? In Dynamo it is this idea of segments.
87ad6fd
to
aaaebc0
Compare
aaaebc0
to
1a6ad72
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
...er/src/main/kotlin/app/cash/backfila/client/misk/spanner/internal/SpannerBackfillOperator.kt
Outdated
Show resolved
Hide resolved
override fun getNextBatchRange(request: GetNextBatchRangeRequest): GetNextBatchRangeResponse { | ||
// Establish a range to scane - either we want to start at the first key, | ||
// or start from (and exclude) the last key that was scanned. | ||
val range = if (request.previous_end_key == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we're not using the backfill_range at all, that's what would be passed in by the user (or I missed it somewhere)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. If I'm not mistaken, the DynamoDB backend also ignores it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DynamoDb is pretty limited because of dynamo itself, the hibernate one is pretty good to copy from. Obviously, build whatever features you want, I won't be using it :P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need some guarantees around the end key otherwise you may be missing items, no? This was tricky with DynamoDb as well. We figured out some optimizations but since they weren't really documented we didn't add those to the client. In Dynamo we split up by segment but then don't complete the "batch" until the range is completed. Maybe Google has some better guarantees?
Co-authored-by: Mike Gershunovsky <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this isn't urgent, I'll just comment.
I think this is a good start, but we'd want to really make sure as much of backfila works as expected as possible. Having a single partition is okay, but could be somewhat challenging to scale. Upper and lower bounds for ranges are a reasonable tradeoff.
|
||
val partitions = listOf( | ||
PrepareBackfillResponse.Partition.Builder() | ||
.backfill_range(request.range) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this work like dynamo backfills? Dynamo is somewhat different, but has a scan mechanism we use, and I believe we don't do ranges on it either? You could check that.
|
||
val partitions = listOf( | ||
PrepareBackfillResponse.Partition.Builder() | ||
.backfill_range(request.range) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that you will essentially run your backfill single threaded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looking very good. Let's avoid misk except in test.
I wonder if you can use this to be more parallel?
https://cloud.google.com/spanner/docs/reference/rpc/google.spanner.v1#google.spanner.v1.Spanner.PartitionRead
Can you share a session among different machines? Your backfill might die if the session dies though.
// We do not want to leak client-base implementation details to customers. | ||
implementation(project(":client-base")) | ||
|
||
implementation(Dependencies.misk) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we limit our use of misk at least in non-test? Do we really need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking through your code I think these only need to be testImplementation dependencies. Let's move those dependencies to test, rename the module, and add a comment so they don't leak to the main implementation.
val partitions = listOf( | ||
PrepareBackfillResponse.Partition.Builder() | ||
.backfill_range(request.range) | ||
.partition_name("partition") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer something like single
or only
. This is exposed to the customer.
override fun getNextBatchRange(request: GetNextBatchRangeRequest): GetNextBatchRangeResponse { | ||
// Establish a range to scane - either we want to start at the first key, | ||
// or start from (and exclude) the last key that was scanned. | ||
val range = if (request.previous_end_key == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need some guarantees around the end key otherwise you may be missing items, no? This was tricky with DynamoDb as well. We figured out some optimizations but since they weren't really documented we didn't add those to the client. In Dynamo we split up by segment but then don't complete the "batch" until the range is completed. Maybe Google has some better guarantees?
|
||
val partitions = listOf( | ||
PrepareBackfillResponse.Partition.Builder() | ||
.backfill_range(request.range) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there must be some distributed way to process the whole data set in bulk? In Dynamo it is this idea of segments.
These changes add a new backend for backfilling Spanner databases integrated into Misk services.
I'm still adding unit tests to show that it all works, but I figured I would put it up for some early review and to discover CI issues.