Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tips for parallelization #190

Open
schell opened this issue Oct 2, 2023 · 2 comments
Open

Tips for parallelization #190

schell opened this issue Oct 2, 2023 · 2 comments

Comments

@schell
Copy link

schell commented Oct 2, 2023

First off great work, wonnx has been very easy to use and besides a few missing operators it "just works".

I'm in the optimization phase of building an app that does inference using wonnx. When I benchmark (with criterion) wonnx running a model I've found it's just about as fast as onnxruntime. I figured that this probably has to do with marshaling the data to the GPU (maybe the shader created by wonnx runs a little faster but the marshaling time is a little longer). If that's the case I figured I could get a throughput improvement by running my model in parallel. Unfortunately I am not in control of the models and cannot retrain or re-export the model with a dynamic batch size so instead I opted to edit the onnx model itself and clone the graph into 64 subgraphs, each with its own input. Even though it worked as expected and validated, it provided no gain in throughput (or latency, for that matter). My guess is that the shader that wonnx produces is probably not performing each subgraph in parallel, but I don't know.

My question is - is there a general method of parallelization that might yield "pretty good" results that doesn't involve re-training or other python tasks? I don't mind editing the onnx model to possibly use another method like SequenceMap (if that's supported), or something similar. Or maybe there's an opportunity to expand the wonnx API to support this out-of-the-box? Possibly by issuing multiple draw calls over an offset buffer? What do you think?

@pixelspark
Copy link
Collaborator

Thanks and good to hear wonnx is actually being used in production!

The way wonnx runs an ONNX graph is actually pretty simple: after a pass over the graph to perform some optimizations, the graph nodes are topologically sorted, and we generate a shader + invocation for each node (using a coloring algorithm to re-use buffers for intermediate values). The shaders are then invoked in series, which means there is only paralellism executing a single node. For many models this is fine as their graphs are more or less serial and the GPU can be saturated by running just one node's shader at a time.

In theory we could improve on this by executing nodes that are independent in parallel as well (it would however complicate the coloring algorithm and require some sort of waiting mechanism for when parallel branches join back together). Another option would be to attempt to generate one big shader containing all the ops in one (even more complicated).

Your other options are (1) to implement a custom onnx op with its own shader, or (2) to run multiple wonnx graphs (subgraphs of a bigger model) in parallel yourself (and do the joining/waiting at that level). The latter solution would also be useful to gain parallellism in a multi-GPU system.

@schell
Copy link
Author

schell commented Oct 2, 2023

Thanks for the explainer! I'm definitely interested in (1) and (2). Can you elaborate on how I might get started with (1) as well as what (2) would look like today?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants