Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does LSIO need to exist?! Does object_store already do everything we need? If not, can we extend object_store instead of creating LSIO? #27

Closed
JackKelly opened this issue Jan 23, 2024 · 8 comments
Assignees

Comments

@JackKelly
Copy link
Owner

JackKelly commented Jan 23, 2024

object_store is a mature, well-supported crate which already does a lot of what we need:

  • Provides an async, common interface to local filesystems and cloud filesystems (gcp, aws, azure, HTTP, WebDAV).
  • ObjectStore::get_ranges reads multiple byte ranges for a given path and - I think - returns multiple buffers.
  • ObjectStore::put_multipart already splits large uploads into parallel chunks

Some things that object_store doesn't (yet) appear to support:

  • io_uring
  • In general, what happens if the user tries to launch, say, 1 million in-flight I/O operations. Will object_store submit all those million operations to the underlying storage subsystem? Or will object_store submit, say, 64 requests. And keep the number of in-flight requests fairly constant at around 64 requests in-flight at any moment.
  • Reading into an existing buffer (such as reading into an existing numpy array in memory)
  • get_multipart (to download large objects in parallel)

Relevant links:

object_store issues:

Discussion on object_store PRs:

@JackKelly JackKelly self-assigned this Jan 23, 2024
@JackKelly JackKelly changed the title Does LSIO need to exist?! Does object_store already do everything we need? If not, can we extend object_store instead of creating LSIO? Does LSIO need to exist?! Does [object_store](https://docs.rs/object_store/latest/object_store/) already do everything we need? If not, can we extend object_store instead of creating LSIO? Jan 23, 2024
@JackKelly JackKelly changed the title Does LSIO need to exist?! Does [object_store](https://docs.rs/object_store/latest/object_store/) already do everything we need? If not, can we extend object_store instead of creating LSIO? Does LSIO need to exist?! Does object_store already do everything we need? If not, can we extend object_store instead of creating LSIO? Jan 23, 2024
@JackKelly JackKelly moved this to Todo in light-speed-io Jan 23, 2024
@JackKelly JackKelly moved this from Todo to In Progress in light-speed-io Jan 23, 2024
@JackKelly
Copy link
Owner Author

JackKelly commented Jan 23, 2024

What would "io_uring in object_store" look like?

Options which don't change object_store's public API:

  • impl ObjectStore for IoUringLocal
  • get* methods would stat the file, create a buffer, then chain open-read-close in io_uring.
  • Perhaps the default would be for the kernel to have a thread which checks the submission queue, so we don't have to do a system call for every file.
  • We'd then have to figure out how to "unblock" the correct Future when a CQE appears on io_uring's completion queue. Maybe it's as simple as using a raw pointer to the Future in the io_uring user_data? Not sure.
  • I guess the IoUringLocal struct would own a single io_uring instance? (Is that possible? Or would each Future returned by get* have its own io_uring instance?? Having one io_uring per Future sounds slow!)

What happens if the user naively tries to read one million files at once (by calling get* one million times)?:

  • When running in tokio, every call to get* will be run in tokio::spawn_blocking, which will launch a new thread. This is fine for a few hundred calls. But might not be fine for a few million calls??
    • Although, maybe I could change that behaviour?
  • Whichever async runtime is used, it'll have to keep around one Future for every file.
  • We'd have at least one system call per file.
  • We can't re-use buffers.
  • But we can't submit one million SQEs in one go (and it's not performant to submit more than about 128). So I guess I'd need to implement some logic to create a queue of requests. Every time the user calls get*, that request would first go into a queue, and then we'd keep the io_uring SQ topped up with up to, say, 64 SQEs in flight at any one time.

What happens if the user takes responsibility for only having, say, a max of 64 operations in flight at any one time?

  • We no longer need to worry about spawning too many tokio threads, or having too many Futures around.
  • We'd still have to worry about having one system call per file. Which somewhat reduces the benefits of using io_uring!
  • We can't re-use buffers.
  • UPDATE: As described in the section above, let's work on the assumption that we can implement the functionality required, so the user can call get* as many times as they want, and IoUringLocal will do the right thing.

Options which do change object_store's public API:

Add these methods to the ObjectStore trait:

  • get_objects: returns a Stream of buffers.
    • Provide a simple default implementation of get_objects which just calls ObjectStore::get in a loop, whilst being careful not have more than some max number of operations in flight at any moment.
  • get_with_buffer: Reads data directly into a user-supplied buffer
    • Provide a simple default implementation which calls get and memcopies into the user-supplied buffer.

Some initial conclusions:

  • My hunch is that users won't see much benefit from io_uring in object_store unless we add a new get_objects method. But I could be wrong! UPDATE: I should try my hardest to build a high-performance io_uring implementation which does not change the public API. I've added some more thoughts to the text above.
  • If io_uring demonstrates significant performance improvements, I would prefer to help add io_uring to object_store rather than build a completely separate LSIO crate (which would duplicate a lot of object_store's functionality). But this is all pending the object_store maintainer's blessing, of course! (more on that in my next comment...) UPDATE: Maybe the ideal way forwards is, at least to start with, have LSIO be an "extension crate" to object_store.

@JackKelly
Copy link
Owner Author

JackKelly commented Jan 23, 2024

Next steps

Quite rightly, the object_store maintainer would like to see some benchmarks. If io_uring isn't demonstrably faster than existing object_store, then what's the point!

I should order my new workstation! (Threadripper 7000 + at least one PCIe gen5 SSD)

UPDATE: Specific thoughts about benchmarks moved to issue #31

  • If io_uring shows significant performance improvements then:
    • Talk with the object_store maintainer.
    • Probably prepare a second set of benchmarks to compare:
      • object_store with io_uring, using existing object_store API
      • object_store with io_uring, using a new get_objects method.
    • If the object_store folks don't want io_uring within object_store itself, then maybe build a separate crate which depends on object_store, and implements ObjectStore for IoUringLocal.
    • Further down the line, we could add (either as a PR to object_store, or in an extension crate):
      • kqueue for async local files on Mac OS X
      • Windows IO ring for async local files on Windows
      • Cloud bucket access using io_uring and kqueue and Windows IO ring (although there might be less performance "left on the table" for cloud buckets)
  • If io_uring doesn't show a performance improvement then don't implement LSIO or a PR to object_store!

@JackKelly
Copy link
Owner Author

JackKelly commented Jan 23, 2024

Actually, maybe the first step should be to compare the performance of fio (using io_uring) to object_store? (Although I don't think fio can decompress data? So I'd just be benchmarking "pure" IO, without any processing). This will give me a relatively quick feel for how io_uring compares to object_store; and allows me to make sure my io_uring code isn't performing substantially below fio's io_uring implementation!

JackKelly added a commit that referenced this issue Jan 23, 2024
@JackKelly
Copy link
Owner Author

Should I fork object_store? Or should LSIO be an extension crate for object_store?

Advantages of forking object_store (with the ultimate aim being to merge a PR into object_store)

  • Existing users of object_store would see performance benefits immediately, without any additional dependencies.

Advantages of LSIO being an extension crate

  • Doesn't burden object_store with io_uring code
  • I think that it should still be possible to implement everything I need. I can define a new GetObjects trait, with a get_objects method, with a default implementation that just calls get in a loop, and I can impl GetObjects for all existing object_store structs.
  • Once we have benchmarks and a clearer idea of how to add io_uring to object_store then maybe we can consider merging the relevant code into object_store. It's not uncommon in the Rust community for new ideas to begin life as a separate crate, and then be merged into another crate.
  • I can iterate faster

@JackKelly
Copy link
Owner Author

JackKelly commented Jan 24, 2024

Do we need a get_objects(objects) -> Stream method? Or will the existing ObjectStore::get suffice?

It'd be lovely to not have to change object_store's public API. But there are several questions that need answering. I'm not yet sure of the answers:

If we don't have a get_objects method, then:

  1. Can we efficiently submit operations to io_uring, and efficiently receive operations? Specifically:
    • is there an elegant way to limit the number of SQEs in-flight to, say, 64? And to keep io_uring's SQ topped up?
    • when a CQE arrives, how do we "unblock" the associated Future?
  2. How would users elegantly interleave IO with compute? Data might arrive from IO in any order. As soon as a buffer is filled with data from IO, we want to schedule computation on rayon's threadpool. Maybe it's as simple as something like the code outlined in Try interleaving compute with IO #37

My current hope is that the challenges outlined in this comment are solvable, and so it should be possible to just impl ObjectStore for IoUringLocal, and to not need a new get_objects method. But I should try both ways, and benchmark.

@JackKelly
Copy link
Owner Author

JackKelly commented Jan 27, 2024

To save us from having to re-implement every ObjectStore method for IoUringLocal, maybe the IoUringLocal struct can contain a LocalFileSystem, and we can use the delegate crate to delegate method calls to the inner LocalFileSystem. That way, the IoUringLocal struct can be fully functional from day one, and we can slowly replace delegated functions with custom io_uring implementations.

Also, we could have a BatchObjectStore trait with batch equivalents of ObjectStore methods, which accept a vector of operations, and return a stream. The trait would include default implementations which just call the single-op ObjectStore methods in a loop, like the code in the comment above.

@JackKelly
Copy link
Owner Author

I think I've moved all the ideas from this issue into a bunch of separate tasks (each described by GitHub issues). I'll close this (very long) issue now, in preference of the new milestone: https://github.com/JackKelly/light-speed-io/milestone/1

@github-project-automation github-project-automation bot moved this from In Progress to Done in light-speed-io Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

1 participant