-
Notifications
You must be signed in to change notification settings - Fork 452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snapshot Fetch optimisation #223
Comments
Can you develop a bit the use case ? |
I'm not sure I understand what you mean? Our particular use case is that we're building a book editor at Reedsy. The Editor is built on top of Quill, and Authors just typing can easily hit 100,000+ ops in a single chapter alone, and we have some extreme cases hitting 1,000,000+ ops. One of the features coming up on our roadmap is the ability for authors to look back at previous versions of their book and restore them. In order to let them browse their books at arbitrary points in time, we need to be able to quickly fetch snapshots at a particular moment in time, because they'll want to quickly browse between chapters in a certain version, and then compare against another version.
Again, I'm not really sure I follow...? What structure are you talking about? |
I've thought about this problem from time to time, and there's an alternative solution not mentioned here, more of an algorithmic solution, that would reduce the asymptotic running time from O(n) to O(log n). This would probably bring your performance problems within acceptable limits. The idea is this: construct ops that bridge gaps between distant versions, for every power of 2 size gap. Visually, here's the current data structure: The current solution for fetching a snapshot traverses all the ops from the beginning to the desired version, which takes O(n) time where n is the number of ops. In this visual example, to produce the snapshot at the last version, there are 22 versions, so 21 ops need to be traversed. Visually, here's the data structure I propose to introduce. The "meta ops" (shown here as arcs) would contain the diffs between distant versions, where distances are powers of 2: The algorithm to traverse this data structure would find the fewest hops needed to traverse. In the case shown visually here, instead of traversing all 21 ops to get a snapshot of the desired version, only 3 ops need to be traversed. My gut tells me that generally, the fewest number of ops that need to be traversed in order to get a snapshot of any version will be O(log n). Beyond that the details are not clear. The algorithm itself may be similar to Binary Search or Bisection. Also, if ops are stored for every gap with power of 2, it may be too much storage overhead. The base could change, e.g. powers of 4, or 16, 32, or 64. This would make the "meta ops" much more sparse, but would preserve the asymptotic running time. Architecturally, this feature seems like something that could be implemented in a separate library outside of ShareDB core. It might be better to organize the code that way, so any additional dependencies and complexity that arises from this would be isolated from ShareDB core logic. This library could maintain an additional collection in the database, e.g. "metaOps" or something like that, which would not interfere with ShareDB's original ops collections. This library could expose methods like If that approach is taken, any PRs against ShareDB would be to expose the primitives required. I believe the only primitive required to implement this is One way the "meta ops" could be generated is by computing snapshots at two different versions ( |
@curran that sounds like a great idea! It sounds a bit more complex than simply storing snapshots every I'd be very happy to help implement this.
That sounds fine to me. I'd be slightly concerned about bypassing ShareDb middleware, but maybe it's the kind of thing where the constructor can accept an instance of ShareDb, and copy the middleware across. What are your thoughts on how to move forward with this? I can try to put together a proposal for that library? |
@curran I do have one reservation: is the complexity of applying one of these huge meta ops O(1) or still O(n) (where n is the number of underlying operations)? Because if it tends towards O(n), then we're not really saving any time. Obviously we still make huge gains on database retrieval, but applying 1,000,000+ |
I think it would depend on the OT type and the ops being composed, with the complexity ranging from O(1) to O(n). If many ops cancel each other out, the complexity would tend towards O(1). If most ops add some new information, the complexity would tend towards O(n). The implementation of
Unfortunately, rich-text is not particularly efficient and its data types are not amenable to optimization. That's why I implemented https://github.com/Teamwork/ot-rich-text, which is based on the same principles as rich-text but is heavily optimized. I've seen performance improvements of up to 100x on processing edits on the client side (diff and apply do most of the work in that use case). The downside to that approach was that I had to bind the OT type to a contenteditable DIV (we use TinyMCE) completely from scratch. I think the result was worth the effort.
This is also the solution we've been leaning towards in Teamwork and I'd be happy to help. The most important feature for our use case would be a client API for creating a snapshot milestone at a particular document version. |
Okay so I ran a really quick, not-very-scientific test on our 1M+ Thinking about our particular use-case as well, I think it's highly likely we'll have a lot of deletions (our |
For me all that, including the PR for the snapshot is out of scope at least considering the current state of sharedb. To be efficient you should implement it at the driver level.
to get a snapshot you will get only one document and apply a maximum of operation equal to the bucket size you decided. you can also trigger a snapshot from you ui so your users can tag a version without loosing the full history possibility |
@dcharbonnier yes this is basically the approach I mean when I talk about "snapshot milestones" (I hadn't gotten as far as thinking about the schema details just yet). The problem is that this doesn't necessarily plug in to ShareDb as it exists currently, because (if I'm not mistaken), there's nothing in the driver API that would support this sort of behaviour? We'd basically have to add something like But from what I can tell from what everyone's saying, we basically shouldn't be touching ShareDb? Which is a bit of a shame. That's what I was originally going to do (hence my very original PR of asking to be able to fetch metadata along with ops). I just think it would be really nice for the core library to support things like fetching snapshot versions, because it's one of the most powerful things this library could do, and it seems silly that there are a lot of people out there trying to solve this problem separately, when it could be solved in one place. |
This is where I think a contribution to ShareDB can be made. Perhaps the best approach is to think about what APIs need to be exposed in order be able to implement this snapshot generation as a separate library. It may not be possible now, but exposing some internals may make it feasible. I think all we need for this really is solid support for Also, we need a way to compute the meta-ops in general for various OT types. The only way I can see right now, for super-clean and minimal meta-ops is to use json0-ot-diff, but that only supports It occurred to me that a meta-op that goes from the zeroth op to the end op should contain exactly the same data as the current snapshot. The thing about these meta ops is that they are not "additive" in the sense that, like @gkubisa pointed out, they tend to "cancel each other out". I'd be enthusiastic to contribute to this feature, at least for Totally agree @alecgibson re:
Let's try to identify the places where the current API is lacking. My guess is we'll need to add a method to the database driver API, then implement it in sharedb-mongo, then we can build the separate library on top of that, leveraging the new method as well as middleware hooks for maintaining the data structure. |
Sounds good. There's been discussion in Gitter about potentially having a
I used
It occurred to me (after I did it), which is why I tried it with two ops instead, because "applying" one op is kind of boring. |
Thanks for the Gitter link! I didn't realize it's still active. That's great.
Can you provide a link please for
Algorithm NuggetHere's a small but critical piece of the overall puzzle here - a function that computes the meta-ops required to reach (as close as possible to) a desired version: function metaOpId(current, next) {
return current + '-' + next;
}
function metaOpsToVersion({ version, base }) {
const metaOpIds = [];
let current = 0;
while (current + base <= version) {
let exponent = 1;
while (current + Math.pow(base, exponent + 1) <= version) {
exponent++;
}
const next = current + Math.pow(base, exponent);
metaOpIds.push(metaOpId(current, next));
current = next;
}
return {
metaOpIds,
reachesVersion: current
};
}
console.log(metaOpsToVersion({
version: 22,
base: 2
})); Output:
|
@alecgibson @gkubisa Here's what I see as a way we can move this forward:
|
I agree that the optimization, whatever it is, should be implemented as a separate project, however, I'd still keep the basic implementation in ShareDB, just like the in-memory DB and PubSub, so that ShareDB would work without any additional modules by default. IMHO the optimization should be added by hooking it up to a new middleware. So, I'd keep #220 as is, with an additional middleware which would allow hooking up optimizations. Re specific optimizations, I think we'll end up having several implementations, which might be suitable for different situations, eg meta ops, snapshots milestones, cache, etc. |
Yes, I agree with @gkubisa that a very basic implementation of snapshot retrieval belongs in ShareDB core. @curran this is what I was thinking of moving forwards:
|
It all looks great to me but for the optimization middleware I'd prefer not to add any new methods to the drivers. Ideally I'd like optimizations to be added like this: We'd just need to figure out what middleware actions we need to support and which params we need to pass in, to maximize flexibility and enable a wide range of optimization strategies. |
I agree that would be nice, but how do you see this working without doing so? Let's take the example of the snapshot milestone optimisation. For this to work properly, our optimiser needs to:
How can the middleware do this without some method exposed on the driver that lets it do these things at the driver level? |
Please just make the driver able to return a document and a list of ops of an arbitrary length to apply to this document. |
@dcharbonnier do you mean you'd prefer for optimisation to live at the driver level rather than introduce any middleware? |
The key point is that the optimization middleware's internal storage is separate from the ShareDB database driver. For example, the driver could use MongoDB and the middleware could store snapshot milestones in MongoDB, Redis, S3, in memory or anything else. We cannot foresee all optimization strategies people might want to implement. It's certainly possible to reuse the same MongoDB connection and db between the driver and the optimization middleware, however, that should not be a requirement. Given the above, "store snapshot milestones" and "retrieve snapshot milestones, querying by version or timestamp" could be implemented in the middleware using the MongoDB driver directly (not using ShareDB database driver). "fetch ops between two versions (already implemented as getOps)" could be implemented by calling It's also important to note that the optimization module could attach itself to more than one middleware action. For example, it could store snapshots in
I'd prefer not to do that because it'd be an extra method in the driver's interface, which would support one specific optimization strategy, which might not be optimal for all use cases. In my opinion the middleware approach is much more flexible and could internally load a snapshot plus ops to apply, so that you'd end up with the same performance profile, as if the action was supported directly on the driver. |
You're making something easy complicated anticipating future potential needs. You're also proposing smart strategy where a dumb one is as efficient, for me that's a typical case of premature optimisation. Your middlewares will store all the possible snapshots or all the ops or work only on the second call like a dumb cache. If you end doing what the backend do I don't see the point. If you want an optional support in the backend it's also possible. |
Aha @gkubisa I didn't realise you'd wanted to side-step the driver entirely. That makes far more sense. I'm happy with that approach. So I guess our example of the snapshot milestones would look something like this?
|
Yes, that's essentially what I had in mind.
More like "bypasses" |
Great discussion! I still think that all we need is a Also, we can avoid the middleware issue entirely by lazily initializing the data structure for optimization. Meaning, when a request comes in for a snapshot at a particular version, the algorithm can run a function like |
At this point I'm optimizing the API, which I think is pretty important to get right the first time in order to avoid breaking changes in the future, or having to support legacy code indefinitely. I believe that exposing appropriate middleware hooks provides a great deal of flexibility and avoids committing to a specific solution which might not be optimal for all use cases, or even optimal in general.
I'm not sure yet what optimization strategy we're going to use at Teamwork.com. A kind of "intelligent" pre-caching (where cache entries are generated opportunistically based on what's going on in the app) is certainly a possibility, given that we already use this technique to speed up some of our APIs. Details aside, there are many different ways to optimize snapshot retrieval for specific situations, so I want the API to be flexible and avoid committing to a single solution, which might not be the best. |
@curran to clarify, do you mean we should have
Although this is nice from an architectural/implementation point-of-view, this isn't from a consumer point-of-view. If today I install an optimiser on my 1,000,000+ op documents, then the first time I fetch, I'll still have appalling performance, and will continue to do so until the optimiser manages to process some queue of jobs to lazily initialise. Although arguably, I guess you run into the same problem with what I proposed on submitting an op (for example if somebody fetches a snapshot before updating a document, or updates and fetches in quick succession, etc.). Ideally I'd like some sort of ability to migrate, but perhaps that could be achieved by just having this "lazy" approach, and then running a script that actively fetches all the ShareDB documents I have (we do something very similar for migrating data structures stored in JSON0). |
I believe the best place for I totally get your concerns re: performance, and desire for a smooth migration. It's true that the lazy approach will result in appalling performance for the first invocation, on 1,000,000+ op documents. Although, like you mention, you could run a script that requests all documents in order to initialize the data structure, before exposing the feature to users. |
In my opinion |
First solution merged in this PR: #236 |
Introduction
I recently opened a pull request to add the ability to fetch snapshots of a document at a given version. This was first discussed in this other issue.
In that issue, the possibility of some optimisations was discussed briefly, but deemed out-of-scope for an initial offering. Now that it seems like that feature is close to being merged, I would like to start discussing how to optimise it further.
Motivation
These optimisations are far from just an academic exercise. We're currently running ShareDb on a production server, and we have some documents hitting 1,000,000+ operations.
Running the new
getSnapshot
method on a document like this takes on the order of minutes to return on a development machine, which is not acceptable. This sort of operation would ideally take < 10s independent of number of operations.Deliverable
What I would like to establish with this issue ticket is a sensible optimisation to start work on.
Possible approaches
There are a number of possible approaches, some of which were discussed in the original issue. Those are recapped here.
Reversible types
The current approach for
getSnapshot
only works forwards, starting with the very first op and moving forwards. Depending on use case, it may often be likely that consumers are more interested in later versions of the document. In this case, starting from the current snapshot and working backwards would be much faster.Pros
Cons
rich-text
] OT type is not reversible. In these cases, this optimisation would not have any effect.Caching ops
A very large amount of time can be spent fetching ops. Anecdotally, it's responsible for over half the time spent retrieving a snapshot (using
sharedb-mongo
).Subsequent version queries could be sped up if the ops are cached. Any requests for a version lower than the latest version in the cache can be fed directly from the cache.
However, potentially this sort of optimisation belongs at the driver level?
Pros
Cons
Current version
Whenever we try to fetch the current snapshot through
getSnapshot
, it currently works its way through all the ops. We could instead trivially return the current snapshot. This is easy in the case of providing anull
version, but would require a lookup of the current snapshot first in all other cases.Pros
Cons
Build from latest creation
When a document is deleted and then re-created, if we want a version after re-creation, we only need to fetch and apply the ops after that new creation.
However, as far as I'm aware, there's currently nothing defined in the driver API that lets us fetch the latest creation operation, which means currently we would still need to fetch all the ops, and then perform an initial pass through the ops to determine if such an optimisation could be made.
Pros
Cons
Snapshot milestones
In a similar vein of thinking to building from the latest creation op, if fetching and applying ops is expensive, one of the best things you can do is fetch and apply fewer of them.
We could achieve this by periodically saving snapshots of documents (every 100,000 ops, say). Then we could fetch the latest applicable snapshot milestone and reconstruct the rest of the snapshot on top of that.
Pros
Cons
Conclusion
I think that the inversible types optimisation is probably the lowest-hanging fruit, but for purely selfish reasons this is the one I'm least likely to implement, because we use
rich-text
and ourjson0
ops fall into the trap I mentioned.I also quite like the op caching idea, but the fact that first load is unaffected is quite a bummer.
I personally really like the idea of saving snapshot milestones. This is without doubt the optimisation that would be the most work, but I think that the ability to put a cap on the retrieval of any given document is absolutely invaluable. If people would be willing to support me on this change, I'd be more than keen to do this, because honestly I think we'll have to fork and implement this if I don't do it here.
Obviously, if there are any other optimisations I've missed, please pile in with comments.
The text was updated successfully, but these errors were encountered: