Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot Fetch optimisation #223

Closed
alecgibson opened this issue Jul 12, 2018 · 30 comments
Closed

Snapshot Fetch optimisation #223

alecgibson opened this issue Jul 12, 2018 · 30 comments

Comments

@alecgibson
Copy link
Collaborator

Introduction

I recently opened a pull request to add the ability to fetch snapshots of a document at a given version. This was first discussed in this other issue.

In that issue, the possibility of some optimisations was discussed briefly, but deemed out-of-scope for an initial offering. Now that it seems like that feature is close to being merged, I would like to start discussing how to optimise it further.

Motivation

These optimisations are far from just an academic exercise. We're currently running ShareDb on a production server, and we have some documents hitting 1,000,000+ operations.

Running the new getSnapshot method on a document like this takes on the order of minutes to return on a development machine, which is not acceptable. This sort of operation would ideally take < 10s independent of number of operations.

Deliverable

What I would like to establish with this issue ticket is a sensible optimisation to start work on.

Possible approaches

There are a number of possible approaches, some of which were discussed in the original issue. Those are recapped here.

Reversible types

The current approach for getSnapshot only works forwards, starting with the very first op and moving forwards. Depending on use case, it may often be likely that consumers are more interested in later versions of the document. In this case, starting from the current snapshot and working backwards would be much faster.

Pros

  • Much faster to fetch recent versions of a document
  • Roughly halves the worst-case time

Cons

Caching ops

A very large amount of time can be spent fetching ops. Anecdotally, it's responsible for over half the time spent retrieving a snapshot (using sharedb-mongo).

Subsequent version queries could be sped up if the ops are cached. Any requests for a version lower than the latest version in the cache can be fed directly from the cache.

However, potentially this sort of optimisation belongs at the driver level?

Pros

  • Would be a huge speed boost over database drivers

Cons

  • ‘There are only two hard things in Computer Science: cache invalidation and naming things.’ — Phil Karlton
  • First fetch is still unaffected
  • Could lead to vastly increased memory usage

Current version

Whenever we try to fetch the current snapshot through getSnapshot, it currently works its way through all the ops. We could instead trivially return the current snapshot. This is easy in the case of providing a null version, but would require a lookup of the current snapshot first in all other cases.

Pros

  • Much faster for fetching current version

Cons

  • Would require an extra snapshot lookup before every fetch, which is added overhead for an (unlikely?) scenario

Build from latest creation

When a document is deleted and then re-created, if we want a version after re-creation, we only need to fetch and apply the ops after that new creation.

However, as far as I'm aware, there's currently nothing defined in the driver API that lets us fetch the latest creation operation, which means currently we would still need to fetch all the ops, and then perform an initial pass through the ops to determine if such an optimisation could be made.

Pros

  • Faster retrieval in some cases

Cons

  • Possibly requires a change to driver API
  • Otherwise still fetches all ops, and requires a second pass
  • (Re-creation may be relatively uncommon?)

Snapshot milestones

In a similar vein of thinking to building from the latest creation op, if fetching and applying ops is expensive, one of the best things you can do is fetch and apply fewer of them.

We could achieve this by periodically saving snapshots of documents (every 100,000 ops, say). Then we could fetch the latest applicable snapshot milestone and reconstruct the rest of the snapshot on top of that.

Pros

  • Limits the time of any retrieval
  • Could be made configurable?

Cons

  • Large infrastructure change
  • Requires a change to the driver API
  • Hard to add retroactively to existing documents

Conclusion

I think that the inversible types optimisation is probably the lowest-hanging fruit, but for purely selfish reasons this is the one I'm least likely to implement, because we use rich-text and our json0 ops fall into the trap I mentioned.

I also quite like the op caching idea, but the fact that first load is unaffected is quite a bummer.

I personally really like the idea of saving snapshot milestones. This is without doubt the optimisation that would be the most work, but I think that the ability to put a cap on the retrieval of any given document is absolutely invaluable. If people would be willing to support me on this change, I'd be more than keen to do this, because honestly I think we'll have to fork and implement this if I don't do it here.

Obviously, if there are any other optimisations I've missed, please pile in with comments.

@dcharbonnier
Copy link
Contributor

Can you develop a bit the use case ?
The mongodb structure is already a problem from what I see so maybe we are trying to find a solution on the wrong level.

@alecgibson
Copy link
Collaborator Author

Can you develop a bit the use case ?

I'm not sure I understand what you mean? Our particular use case is that we're building a book editor at Reedsy. The Editor is built on top of Quill, and rich-text ops are piped to ShareDb on top of sharedb-mongo.

Authors just typing can easily hit 100,000+ ops in a single chapter alone, and we have some extreme cases hitting 1,000,000+ ops.

One of the features coming up on our roadmap is the ability for authors to look back at previous versions of their book and restore them. In order to let them browse their books at arbitrary points in time, we need to be able to quickly fetch snapshots at a particular moment in time, because they'll want to quickly browse between chapters in a certain version, and then compare against another version.

The mongodb structure is already a problem from what I see so maybe we are trying to find a solution on the wrong level.

Again, I'm not really sure I follow...? What structure are you talking about?

@curran
Copy link
Contributor

curran commented Jul 13, 2018

I've thought about this problem from time to time, and there's an alternative solution not mentioned here, more of an algorithmic solution, that would reduce the asymptotic running time from O(n) to O(log n). This would probably bring your performance problems within acceptable limits.

The idea is this: construct ops that bridge gaps between distant versions, for every power of 2 size gap.

Visually, here's the current data structure:

image

The current solution for fetching a snapshot traverses all the ops from the beginning to the desired version, which takes O(n) time where n is the number of ops. In this visual example, to produce the snapshot at the last version, there are 22 versions, so 21 ops need to be traversed.

Visually, here's the data structure I propose to introduce. The "meta ops" (shown here as arcs) would contain the diffs between distant versions, where distances are powers of 2:

image

The algorithm to traverse this data structure would find the fewest hops needed to traverse. In the case shown visually here, instead of traversing all 21 ops to get a snapshot of the desired version, only 3 ops need to be traversed.

image

My gut tells me that generally, the fewest number of ops that need to be traversed in order to get a snapshot of any version will be O(log n). Beyond that the details are not clear. The algorithm itself may be similar to Binary Search or Bisection.

Also, if ops are stored for every gap with power of 2, it may be too much storage overhead. The base could change, e.g. powers of 4, or 16, 32, or 64. This would make the "meta ops" much more sparse, but would preserve the asymptotic running time.

Architecturally, this feature seems like something that could be implemented in a separate library outside of ShareDB core. It might be better to organize the code that way, so any additional dependencies and complexity that arises from this would be isolated from ShareDB core logic. This library could maintain an additional collection in the database, e.g. "metaOps" or something like that, which would not interfere with ShareDB's original ops collections. This library could expose methods like initializeIfNeeded which would populate the "meta ops" data structure, and also middleware that would maintain the data structure incrementally as new ops flowed in.

If that approach is taken, any PRs against ShareDB would be to expose the primitives required. I believe the only primitive required to implement this is getOps(from, to), which is actually optimized at the database level to use both from AND to in the query (which unfortunately the MongoDB driver does not do).

One way the "meta ops" could be generated is by computing snapshots at two different versions (from and to), then using this library json0-ot-diff to generate the diffs.

@alecgibson
Copy link
Collaborator Author

@curran that sounds like a great idea! It sounds a bit more complex than simply storing snapshots every x ops, but definitely faster retrieval. You also need to store increasing numbers of snapshots every power of 2, but I don't think it'll ever get too bad, since it's a log problem (I think it would be fewer than 300 meta ops stored against the op with version number at the number of atoms in the observable universe...).

I'd be very happy to help implement this.

this feature seems like something that could be implemented in a separate library outside of ShareDB core

That sounds fine to me. I'd be slightly concerned about bypassing ShareDb middleware, but maybe it's the kind of thing where the constructor can accept an instance of ShareDb, and copy the middleware across.

What are your thoughts on how to move forward with this? I can try to put together a proposal for that library?

@alecgibson
Copy link
Collaborator Author

@curran I do have one reservation: is the complexity of applying one of these huge meta ops O(1) or still O(n) (where n is the number of underlying operations)? Because if it tends towards O(n), then we're not really saving any time.

Obviously we still make huge gains on database retrieval, but applying 1,000,000+ rich-text ops even without fetching still takes on the order of minutes.

@gkubisa
Copy link
Contributor

gkubisa commented Jul 13, 2018

is the complexity of applying one of these huge meta ops O(1) or still O(n) (where n is the number of underlying operations)?

I think it would depend on the OT type and the ops being composed, with the complexity ranging from O(1) to O(n). If many ops cancel each other out, the complexity would tend towards O(1). If most ops add some new information, the complexity would tend towards O(n). The implementation of compose could also be a factor, depending on how good it is at producing a minimal op, which still behaves like all the input ops when applied.

applying 1,000,000+ rich-text ops even without fetching still takes on the order of minutes

Unfortunately, rich-text is not particularly efficient and its data types are not amenable to optimization. That's why I implemented https://github.com/Teamwork/ot-rich-text, which is based on the same principles as rich-text but is heavily optimized. I've seen performance improvements of up to 100x on processing edits on the client side (diff and apply do most of the work in that use case). The downside to that approach was that I had to bind the OT type to a contenteditable DIV (we use TinyMCE) completely from scratch. I think the result was worth the effort.

I personally really like the idea of saving snapshot milestones. This is without doubt the optimisation that would be the most work, but I think that the ability to put a cap on the retrieval of any given document is absolutely invaluable.

This is also the solution we've been leaning towards in Teamwork and I'd be happy to help. The most important feature for our use case would be a client API for creating a snapshot milestone at a particular document version.

@alecgibson
Copy link
Collaborator Author

Okay so I ran a really quick, not-very-scientific test on our 1M+ json0 document. Ignoring fetch time, it takes ~80s to apply ops one-by-one. If we combine all ops into two huge meta-ops, then it's applied trivially quickly. So perhaps this is a solution?

Thinking about our particular use-case as well, I think it's highly likely we'll have a lot of deletions (our json0 is metadata, which will probably roughly have as many deletions as insertions; and for an author to hit 1M ops on a single chapter either means they've written a chapter longer than most books, or it's been heavily edited).

@dcharbonnier
Copy link
Contributor

dcharbonnier commented Jul 13, 2018

For me all that, including the PR for the snapshot is out of scope at least considering the current state of sharedb. To be efficient you should implement it at the driver level.
The correct mongodb structure for that should be buckets and the complexity could be fixed for any snapshot.

{
  v: Number // First version contained in this bucket
  s: {} // Snapshot of the document at v
  ds: Date // Date start, first date of this bucket
  de: Date // Date start, last date of this snapshot
  ops: [] // Fixed length of operations
}

to get a snapshot you will get only one document and apply a maximum of operation equal to the bucket size you decided.
You will increase the storage but storage is cheap and you will use less memory and that's the problem of mongodb
You expose that through an api that has nothing to do with share db and you solve your problem. You can even use the existing websocket and use an api over ws so you don't loose any performances (check https://github.com/dcharbonnier/hydrated-ws that I use for this purpose with sharedb). You go down to few ms for any case, you just need to develop a custom driver and that's easy and does not depend on sharedb design decisions that we know take years.

you can also trigger a snapshot from you ui so your users can tag a version without loosing the full history possibility

@alecgibson
Copy link
Collaborator Author

@dcharbonnier yes this is basically the approach I mean when I talk about "snapshot milestones" (I hadn't gotten as far as thinking about the schema details just yet).

The problem is that this doesn't necessarily plug in to ShareDb as it exists currently, because (if I'm not mistaken), there's nothing in the driver API that would support this sort of behaviour? We'd basically have to add something like fetchSnapshot to the driver API, and then yes the problem gets shunted down to the driver level.

But from what I can tell from what everyone's saying, we basically shouldn't be touching ShareDb? Which is a bit of a shame. That's what I was originally going to do (hence my very original PR of asking to be able to fetch metadata along with ops).

I just think it would be really nice for the core library to support things like fetching snapshot versions, because it's one of the most powerful things this library could do, and it seems silly that there are a lot of people out there trying to solve this problem separately, when it could be solved in one place.

@curran
Copy link
Contributor

curran commented Jul 13, 2018

The problem is that this doesn't necessarily plug in to ShareDb as it exists currently

This is where I think a contribution to ShareDB can be made. Perhaps the best approach is to think about what APIs need to be exposed in order be able to implement this snapshot generation as a separate library. It may not be possible now, but exposing some internals may make it feasible.

I think all we need for this really is solid support for getOps(from, to), optimized at the database driver level.

Also, we need a way to compute the meta-ops in general for various OT types. The only way I can see right now, for super-clean and minimal meta-ops is to use json0-ot-diff, but that only supports json0. I'm curious @alecgibson, in your experiment how did you "combine all ops into two huge meta-ops"? Did you just concatenate the original ops together? That will work I suppose as a fallback, but that preserves more information than necessary.

It occurred to me that a meta-op that goes from the zeroth op to the end op should contain exactly the same data as the current snapshot.

The thing about these meta ops is that they are not "additive" in the sense that, like @gkubisa pointed out, they tend to "cancel each other out".

I'd be enthusiastic to contribute to this feature, at least for json0, as I could make use of it.

Totally agree @alecgibson re:

would be really nice for the core library to support things like fetching snapshot versions, because it's one of the most powerful things this library could do, and it seems silly that there are a lot of people out there trying to solve this problem separately, when it could be solved in one place.

Let's try to identify the places where the current API is lacking. My guess is we'll need to add a method to the database driver API, then implement it in sharedb-mongo, then we can build the separate library on top of that, leveraging the new method as well as middleware hooks for maintaining the data structure.

@alecgibson
Copy link
Collaborator Author

I think all we need for this really is solid support for getOps(from, to)

Sounds good. There's been discussion in Gitter about potentially having a getOps(from, to) method, which actually returns/calls back function (snapshot, ops), so that the underlying driver can provide an (optional?) snapshot to start the ops from, helping with eg snapshot milestones.

how did you "combine all ops into two huge meta-ops"

I used json0.combine (and then compared against the doc which had just run one at a time to validate the contents were correct). It's obviously expensive to generate right now, but if you can generate these things using the meta ops as well, then it'd be trivially fast.

It occurred to me that a meta-op that goes from the zeroth op to the end op should contain exactly the same data as the current snapshot.

It occurred to me (after I did it), which is why I tried it with two ops instead, because "applying" one op is kind of boring.

@curran
Copy link
Contributor

curran commented Jul 14, 2018

Thanks for the Gitter link! I didn't realize it's still active. That's great.

getOps(from, to) : Promise<{snapshot, ops}> sounds like an interesting proposal, but I wonder, how would the DB driver generate snapshot internally?

Can you provide a link please for json0.combine? I can't see it anywhere in json0.js. Thanks!

if you can generate these things using the meta ops as well, then it'd be trivially fast.
That's a great idea! The meta-ops with larger gaps can be computed from the ones with smaller gaps, e.g. compute all gap-of-2 meta-ops, then all gap-of-4 meta ops by combining the gap-of-2 meta-ops, then all gap-of-8 meta ops by combining the gap-of-4 meta-ops, and so on.

Algorithm Nugget

Here's a small but critical piece of the overall puzzle here - a function that computes the meta-ops required to reach (as close as possible to) a desired version:

function metaOpId(current, next) {
  return current + '-' + next;
}

function metaOpsToVersion({ version, base }) {
  const metaOpIds = [];
  let current = 0;
  while (current + base <= version) {
    let exponent = 1;
    while (current + Math.pow(base, exponent + 1) <= version) {
      exponent++;
    }
    const next = current + Math.pow(base, exponent);
    metaOpIds.push(metaOpId(current, next));
    current = next;
  }
  return {
    metaOpIds,
    reachesVersion: current
  };
}

console.log(metaOpsToVersion({
  version: 22,
  base: 2
}));

Output:

{ metaOpIds: [ '0-16', '16-20', '20-22' ], reachesVersion: 22 }

image

@curran
Copy link
Contributor

curran commented Jul 14, 2018

This feels like the sort of thing the original ShareDB authors would get excited about.
Any thoughts on this thread @nateps @josephg ?

@curran
Copy link
Contributor

curran commented Jul 14, 2018

@alecgibson @gkubisa Here's what I see as a way we can move this forward:

  1. Continue working on this PR Add support for fetching a particular version of a snapshot #220 Add support for fetching a particular version of a snapshot, refactoring all the snapshot generation logic into a totally separate directory, as a sibling of ShareDB. Try to separate the logic as much as possible, and leave the minimal ShareDB changes required to make it work.
  2. Remove the snapshot generation directory from the PR, and submit what remains of the PR as changes to ShareDB that expose the API(s) required. This PR could probably get merged.
  3. Create a separate project with the content from the snapshot generation directory, including the linear search implementation as a method like getSnapshotLinear, which would give us a working un-optimized starting point.
  4. We implement the meta-ops solution within that project.

@gkubisa
Copy link
Contributor

gkubisa commented Jul 15, 2018

I agree that the optimization, whatever it is, should be implemented as a separate project, however, I'd still keep the basic implementation in ShareDB, just like the in-memory DB and PubSub, so that ShareDB would work without any additional modules by default. IMHO the optimization should be added by hooking it up to a new middleware.

So, I'd keep #220 as is, with an additional middleware which would allow hooking up optimizations. Re specific optimizations, I think we'll end up having several implementations, which might be suitable for different situations, eg meta ops, snapshots milestones, cache, etc.

@alecgibson
Copy link
Collaborator Author

Yes, I agree with @gkubisa that a very basic implementation of snapshot retrieval belongs in ShareDB core. @curran this is what I was thinking of moving forwards:

  1. Finalise the consumer-facing API for getSnapshot and getSnapshotAtTime
  2. Merge Add support for fetching a particular version of a snapshot #220 (getting snapshots)
  3. Add a new (optional?) function to the driver API: getSnapshotOps (see below)
  4. Update the ShareDB core Backend._getSnapshot method to use this new driver method
  5. Implement this method on any appropriate drivers with any appropriate optimisations (see below)

getSnapshotOps driver API

There has been discussion on (the new) Gitter about adding a new method to the driver API that would look something like this: getSnapshotOps(from, to, callback) => void, where callback has the signature callback(startingSnapshot, ops) => void.

Given that getOps already exists on the API, we could shim drivers in ShareDB core by attempting to access this method, and then falling back to just using getOps if it doesn't exist. Drivers themselves can quickly become compliant with the new API by simply adding something like the following (I know the signature is slightly wrong, but you get the idea):

getSnapshotOps(from, to, callback) {
  this.getOps(from, to, function (ops) {
    callback(null, ops);
  });
}

It is up to the driver (or middleware - see below) to determine the "appropriate" ops and snapshot to return to go from from to to. ShareDB core will start with the startingSnapshot (if present), and then apply whatever ops are provided. The return can be as simple as a completely unoptimised set of all ops, with an empty starting snapshot (which is what we have currently).

Both from and to are optional. If from is omitted, then we need the (optional) starting snapshot and ops to build a snapshot at the to version. If to is omitted, we simply return the current snapshot in startingSnapshot and an empty array of ops.

The reason for returning a startingSnapshot is to allow things like the bucket/milestone optimisation to work by returning a milestone snapshot, and then the (small number of) ops to transform that into the requested version.

Note that I see getSnapshotOps as a completely different API to getOps. To me, getOps(from, to) implies that I will get exactly one operation per version from from to to. I may want exactly one op per version (for example to view a history of edits), so I think this is a valuable API to leave in tact. getSnapshotOps is a method that exists solely to get some ops (not necessarily one per version) that will take us from one snapshot to another snapshot, optimising for speed of snapshot retrieval.

Where the optimisation takes place

As I see it, there are two places that the optimisation could take place. We could:

  1. Put optimisation in the drivers
  2. Add an optimisation layer between the drivers and ShareDB core

The pros and cons of each approach are discussed below, but in summary I believe we should introduce this intermediate optimisation layer.

Optimisation in drivers

Putting the optimisation in drivers is probably the simplest approach. To optimise the Mongo driver (for example), you would fork the unoptimised Mongo driver, and replace its unoptimised implementation of getSnapshotOps with whatever optimisation is desired (and add whatever other features are need to support it, such as milestone snapshot generation; meta op generation; etc.).

To use it, you simply use that driver with ShareDB.

Pros:

  • simple

Cons:

  • There will be a huge amount of code duplicated between two otherwise identical libraries
  • If you ever wanted another option for your driver (eg optimising some other feature), then you increase the number of drivers you have to maintain by 2^n, where n is the number of options you have

Optimisation in middleware

I don't know what @gkubisa has in mind when he talks about optimisation middleware, but I think I see it as this:

  • We would need to add even more methods to the driver API to enable certain optimisations to work (eg storeMilestoneSnapshot, storeMetaOp, etc.)
  • The middleware would then implement the logic of maintaining, and providing the optimisation using these exposed methods
  • The middleware is plugged into ShareDB core and overrides calls to getSnapshotOps with its own optimised version

Pros:

  • no code duplication; the same optimisation can be applied to any driver
  • better separation of concerns

Cons:

  • we would have to add (many?) driver API methods, leading to a period of instability in the driver API
  • concern is still a little bit "leaky"; the driver will have methods related to many optimisation methods

@gkubisa
Copy link
Contributor

gkubisa commented Jul 16, 2018

It all looks great to me but for the optimization middleware I'd prefer not to add any new methods to the drivers. Ideally I'd like optimizations to be added like this: share.use('requestSnapshot', snapshotOptimizationMiddleware). See https://github.com/share/sharedb#middlewares.

We'd just need to figure out what middleware actions we need to support and which params we need to pass in, to maximize flexibility and enable a wide range of optimization strategies.

@alecgibson
Copy link
Collaborator Author

I'd prefer not to add any new methods to the drivers

I agree that would be nice, but how do you see this working without doing so? Let's take the example of the snapshot milestone optimisation. For this to work properly, our optimiser needs to:

  • store snapshot milestones
  • retrieve snapshot milestones, querying by version or timestamp
  • fetch ops between two versions (already implemented as getOps)

How can the middleware do this without some method exposed on the driver that lets it do these things at the driver level?

@dcharbonnier
Copy link
Contributor

Please just make the driver able to return a document and a list of ops of an arbitrary length to apply to this document.

@alecgibson
Copy link
Collaborator Author

@dcharbonnier do you mean you'd prefer for optimisation to live at the driver level rather than introduce any middleware?

@gkubisa
Copy link
Contributor

gkubisa commented Jul 16, 2018

The key point is that the optimization middleware's internal storage is separate from the ShareDB database driver. For example, the driver could use MongoDB and the middleware could store snapshot milestones in MongoDB, Redis, S3, in memory or anything else. We cannot foresee all optimization strategies people might want to implement. It's certainly possible to reuse the same MongoDB connection and db between the driver and the optimization middleware, however, that should not be a requirement.

Given the above, "store snapshot milestones" and "retrieve snapshot milestones, querying by version or timestamp" could be implemented in the middleware using the MongoDB driver directly (not using ShareDB database driver). "fetch ops between two versions (already implemented as getOps)" could be implemented by calling getOps on ShareDB. getOps would have to be documented and made public, as you suggested earlier.

It's also important to note that the optimization module could attach itself to more than one middleware action. For example, it could store snapshots in afterSubmit and load them in requestSnapshot.

Please just make the driver able to return a document and a list of ops of an arbitrary length to apply to this document.

I'd prefer not to do that because it'd be an extra method in the driver's interface, which would support one specific optimization strategy, which might not be optimal for all use cases. In my opinion the middleware approach is much more flexible and could internally load a snapshot plus ops to apply, so that you'd end up with the same performance profile, as if the action was supported directly on the driver.

@dcharbonnier
Copy link
Contributor

You're making something easy complicated anticipating future potential needs. You're also proposing smart strategy where a dumb one is as efficient, for me that's a typical case of premature optimisation. Your middlewares will store all the possible snapshots or all the ops or work only on the second call like a dumb cache. If you end doing what the backend do I don't see the point. If you want an optional support in the backend it's also possible.

@alecgibson
Copy link
Collaborator Author

Aha @gkubisa I didn't realise you'd wanted to side-step the driver entirely. That makes far more sense. I'm happy with that approach. So I guess our example of the snapshot milestones would look something like this?

  1. We npm install the completely independent optimisation middleware
  2. We require it, and then share.use it. It registers with appropriate (existing) ShareDb middleware hooks.
  3. When I submit an op, the optimisation middleware checks if my doc has all the appropriate snapshots. And if one or more are required, then it generates them.
  4. When I then come back to fetch a versioned snapshot, the optimisation middleware overrides the core library's getSnapshot method, and uses its own storage, as well as whatever methods are available on Backend to reconstruct the snapshot and return it.

@gkubisa
Copy link
Contributor

gkubisa commented Jul 16, 2018

Yes, that's essentially what I had in mind.

overrides the core library's getSnapshot method

More like "bypasses" getSnapshot - if the middleware provides a snapshot, it will be used, otherwise the library's getSnapshot will be called to get a snapshot.

@curran
Copy link
Contributor

curran commented Jul 16, 2018

Great discussion!

I still think that all we need is a getSnapshotOps(from, to, callback) that allows the database to query against from and to.

Also, we can avoid the middleware issue entirely by lazily initializing the data structure for optimization. Meaning, when a request comes in for a snapshot at a particular version, the algorithm can run a function like initializeMetaOpsAsNeeded, which would initialize the entire data structure the first time it runs, and would fill in the missing parts for recently added ops on subsequent invocations. The same approach could conceivably work for the Milestone Snapshot idea as well. This way, the optimization module won't need any middleware at all.

@gkubisa
Copy link
Contributor

gkubisa commented Jul 16, 2018

You're making something easy complicated anticipating future potential needs. You're also proposing smart strategy where a dumb one is as efficient, for me that's a typical case of premature optimisation.

At this point I'm optimizing the API, which I think is pretty important to get right the first time in order to avoid breaking changes in the future, or having to support legacy code indefinitely. I believe that exposing appropriate middleware hooks provides a great deal of flexibility and avoids committing to a specific solution which might not be optimal for all use cases, or even optimal in general.

Your middlewares will store all the possible snapshots or all the ops or work only on the second call like a dumb cache. If you end doing what the backend do I don't see the point. If you want an optional support in the backend it's also possible.

I'm not sure yet what optimization strategy we're going to use at Teamwork.com. A kind of "intelligent" pre-caching (where cache entries are generated opportunistically based on what's going on in the app) is certainly a possibility, given that we already use this technique to speed up some of our APIs. Details aside, there are many different ways to optimize snapshot retrieval for specific situations, so I want the API to be flexible and avoid committing to a single solution, which might not be the best.

@alecgibson
Copy link
Collaborator Author

@curran to clarify, do you mean we should have getSnapshotOps on the driver API, or on a completely separate module?

we can avoid the middleware issue entirely by lazily initializing the data structure for optimization

Although this is nice from an architectural/implementation point-of-view, this isn't from a consumer point-of-view.

If today I install an optimiser on my 1,000,000+ op documents, then the first time I fetch, I'll still have appalling performance, and will continue to do so until the optimiser manages to process some queue of jobs to lazily initialise.

Although arguably, I guess you run into the same problem with what I proposed on submitting an op (for example if somebody fetches a snapshot before updating a document, or updates and fetches in quick succession, etc.).

Ideally I'd like some sort of ability to migrate, but perhaps that could be achieved by just having this "lazy" approach, and then running a script that actively fetches all the ShareDB documents I have (we do something very similar for migrating data structures stored in JSON0).

@curran
Copy link
Contributor

curran commented Jul 16, 2018

I believe the best place for getSnapshotOps would indeed be in the driver API, not a separate module.

I totally get your concerns re: performance, and desire for a smooth migration. It's true that the lazy approach will result in appalling performance for the first invocation, on 1,000,000+ op documents. Although, like you mention, you could run a script that requests all documents in order to initialize the data structure, before exposing the feature to users.

@gkubisa
Copy link
Contributor

gkubisa commented Jul 16, 2018

I believe the best place for getSnapshotOps would indeed be in the driver API, not a separate module.

In my opinion getSnapshotOps is too specific to a single optimization strategy to be included in the driver's interface. I'd much prefer all optimization strategies to be implemented in the same way - as ShareDB middleware.

@alecgibson
Copy link
Collaborator Author

First solution merged in this PR: #236

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants