Skip to content

Commit

Permalink
Update the update message (!) to refer to issue #27
Browse files Browse the repository at this point in the history
  • Loading branch information
JackKelly committed Jan 29, 2024
1 parent 069569e commit fce5029
Showing 1 changed file with 3 additions and 3 deletions.
6 changes: 3 additions & 3 deletions design.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,14 @@

`light-speed-io` (or "LSIO", for short) will be a Rust library crate for loading and processing many chunks of files, as fast as the storage system will allow. **The aim is to to allow users to load and process on the order of 1 million 4 kB chunks per second from a single local SSD**.

**UPDATE (2024-01-23): THE DESIGN IS LIKELY TO CHANGE A LOT! SPECIFICALLY, MY PLAN IS TO SIMPLIFY LSIO SO THAT IT IS ONLY RESPONSIBLE FOR I/O (NOT FOR PROCESSING CHUNKS). USERS WILL STILL BE ABLE TO INTERLEAVE I/O WITH PROCESSING BECAUSE LSIO WILL RETURN A Rust `Stream` (AKA `AsyncIterator`) OF CHUNKS (see [this GitHub comment](https://github.com/JackKelly/light-speed-io/issues/25#issuecomment-1900536618)). AFTER BUILDING AN MVP OF LSIO, I PLAN TO BUILD A SECOND CRATE WHICH MAKES IT EASY TO APPLY AN ARBITRARY PROCESSING FUNCTION TO A STREAM, IN PARALLEL ACROSS CPU CORES. (See [this comment](https://github.com/JackKelly/light-speed-io/issues/26#issuecomment-1902182033))**

Why aim for 1 million chunks per second? See [this spreadsheet](https://docs.google.com/spreadsheets/d/1DSNeU--dDlNSFyOrHhejXvTl9tEWvUAJYl-YavUdkmo/edit#gid=0) an ML training use-case that comfortably requires hundreds of thousands of chunks per second.

But, wait, isn't it inefficient to load tiny chunks? [Dask recommends chunk sizes between 100 MB and 1 GB](https://blog.dask.org/2021/11/02/choosing-dask-chunk-sizes)! Modern SSDs are turning the tables: modern SSDs can sustain over 1 million input/output operations per second. And cloud storage looks like it is speeding up (for example, see the recent announcement of [AWS Express Zone One](https://aws.amazon.com/blogs/aws/new-amazon-s3-express-one-zone-high-performance-storage-class/); and there may be [ways to get high performance from existing cloud storage buckets](https://github.com/JackKelly/light-speed-io/issues/10), too). One reason that Dask recommends large chunk sizes is that Dask's scheduler takes on the order of 1 ms to plan each task. LSIO's data processing should be faster (see below).
But, wait, isn't it inefficient to load tiny chunks? [Dask recommends chunk sizes between 100 MB and 1 GB](https://blog.dask.org/2021/11/02/choosing-dask-chunk-sizes)! Modern SSDs are turning the tables: modern SSDs can sustain over 1 million input/output operations per second. And cloud storage is speeding up (for example, see the recent announcement of [AWS Express Zone One](https://aws.amazon.com/blogs/aws/new-amazon-s3-express-one-zone-high-performance-storage-class/); and there may be [ways to get high performance from existing cloud storage buckets](https://github.com/JackKelly/light-speed-io/issues/10), too). One reason that Dask recommends large chunk sizes is that Dask's scheduler takes on the order of 1 ms to plan each task. LSIO's data processing should be faster.

(See [this Google Doc](https://docs.google.com/document/d/1_T0ay9wXozgqq334E2w1SROdlAM7y6JSgL1rmXJnIO0/edit) for a longer discussion of LSIO.)

**UPDATE (2024-01-29): LSIO'S DESIGN IS LIKELY TO CHANGE A _LOT_! SPECIFICALLY, MY PLAN IS TO SIMPLIFY LSIO SO THAT IT IS ONLY RESPONSIBLE FOR I/O (NOT FOR PROCESSING CHUNKS) AND THAT LSIO WILL BE AN EXTENSION CRATE FOR [`object_store`](https://docs.rs/object_store/latest/object_store/). USERS WILL STILL BE ABLE TO INTERLEAVE I/O WITH PROCESSING IN TWO WAYS: IF USERS ARE USING THE EXISTING `object_store` API THEN I _THINK_ USERS CAN INTERLEAVE I/O WITH PROCESSING AS OUTLINED IN [THIS COMMENT](https://github.com/JackKelly/light-speed-io/issues/27#issuecomment-1907955443). LSIO MIGHT ALSO DEFINE A NEW `BatchObjectStore` TRAIT FOR PROCESSING MANY REQUESTS, AND LSIO WILL RETURN A Rust `Stream` (AKA `AsyncIterator`) OF CHUNKS (see [this GitHub comment](https://github.com/JackKelly/light-speed-io/issues/25#issuecomment-1900536618)). AFTER BUILDING AN MVP OF LSIO, I PLAN TO BUILD A SECOND CRATE WHICH MAKES IT EASY TO APPLY AN ARBITRARY PROCESSING FUNCTION TO A STREAM, IN PARALLEL ACROSS CPU CORES. (See [this comment](https://github.com/JackKelly/light-speed-io/issues/26#issuecomment-1902182033)).**

## Planned features

- [ ] Provide a simple API for reading and writing many chunks of files (and/or many files) with single API call. Users will be able to ask LSIO: "_Please get me these million file chunks, and apply this function to each chunk, and then move the resulting data to these array locations._".
Expand Down

0 comments on commit fce5029

Please sign in to comment.