-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Launch recursive tasks. #808
Comments
A few thoughts:
If you really feel like you need recursive tasks, Legion has always supported them and you can probably use Legion's APIs directly inside your Legate tasks to launch recursive sub-tasks. You'd almost certainly need to implement your own mapper though to avoid confusing Legate's mapper I think has some built-in assumptions that it's mapping a two-level tree of tasks.
Just because you can use recursive tasks doesn't always mean that you should. In general, all programs with sequential semantics are just a standard program starting with a main function which invokes sub-functions and those sub-functions invoke their own sub-functions, etc. In a sequential task-based programming model, all we're doing is marking some of those functions as special functions called tasks, which we use to look for parallelism and provide the opportunity to map them onto different compute resources. Just marking all functions as tasks will probably result in a too many fine-grained tasks to execute efficiently, so it's good to be judicious in deciding which functions to mark as tasks. Rules like "mark all entries into a library as a new task" is probably not what you want. You probably want to keep a good deal of logic for things like the global optimizer in the top-level task because it requires information from all al the sub-tasks to make decisions about what to do next. Recursive tasks are good for pure-functional things that decompose nicely in a recursive manner (e.g. matrix-multiply).
The dynamic evaluation of Legate means you don't need to be worried about saying what a task is going to do when you launch. It might launch many sub-tasks and it might launch none, Legate doesn't care.
Counter-intuitively, Legion has these restrictions precisely because it has a more expressive programming model than Dask. Dask defines away the problem by mandating that all futures be immutable. If everything is read-only after it is produced, well then any task can read it at any time and you don't have to worry about races or distributed coherence of the data. Like Dask, all Legion's restrictions will disappear too if you're willing to say that all of your data is always read-only immediately after it is produced. 😉 While Legion can express all the programs that Dask can, it can also express a much broader class of programs with hierarchical effects (both writes and arbitrary reductions) on mutable region data which Dask does not allow. In order to make the dependence analysis scalable, Legion requires that effects be encapsulated so that the mutation effects that a child task has is encapsulated in the declared effects of the parent task. This makes it possible for Legion to perform hierarchical dependence analysis that doesn't require global communication in a distributed system (potentially spanning thousands of nodes). If you want the nitty-gritty details, this paper gives a formal operational semantics of Legion's effect system and proofs that its dependence is analysis both sound and scalable.
Perhaps surprisingly, this has very little to with distributed computing. The same principles would apply even if you were working on a shared memory machine, because you'd still need to do the same analyses to avoid data races. The fact that the data might need to be kept coherent can be handled trivially if you can infer where there are data dependences. The same rules that Legion requires users to abide by are closely related to the rules that the Rust borrow checker enforces in a shared memory environment. Legion requires users to say what data they plan to access (what references are needed in Rust for calling a function), whether that data is going to be read/written/reduced (whether those references are mutable or not), and which task is going to be running (implicitly scoping the lifetime of the data access). Legion does its checking of these properties dynamically while Rust checks them statically, but the core theoretical principles are nearly identical. |
Thank you for the informative reply @lightsighter !
It's not absolutely necessary, strictly speaking, but it is a really nice tool to have for expressing computation.
Thank you for discussing the underlying theories of Legion and making a comparison between Legion and Dask. Now I have a clearer picture of what's going on. I will learn more about Legion in general, it's something new to me and quite exciting! |
I have been walking through the Legion tutorial. Out of curiosity (and for educational purposes), is there any particular reason for Legate to NOT support recursive tasks? I understand Legion is powerful and has a rigorous theoretical background, but what's the reason behind the decision that Legate should not inherit the recursive task feature from Legion? |
Others on the Legate team might have different opinions on this, but at least as far as I'm concerned, it's always been about simplicity. There really aren't many algorithms outside those that are found in linear algebra and dynamic programming that tend to decompose hierarchically. In order for algorithms to decompose well hierarchically, they need to have self-contained sub-problems that can be solved in isolation and then aggregated back together to produce a bigger solution. Many of the kinds of things we do with Legate don't have that property (at least not in a way that doesn't involve a giant amount of communication in the merge of sub-problems). For this reason, a two-level programming model makes a little bit more sense. I ultimately think Legate should have a hierarchical model eventually to support the problems that can be recursively decomposed, but it is a matter of priorities: a two-level model is good enough for now and nothing we're doing precludes generalizing it later. |
Currently, legate supports only top-level tasks and leaf tasks. It would be helpful to have trees of tasks (nested parallelism) as a more accessible and easier-to-compose programming model. One example use case is hyper-parameter tuning in machine learning, where we have three to four layers of libraries, from user application to HPO implementation to cross-validation to the actual ML algorithm, each of them may want to launch tasks, and it's difficult for the top-level application to have all these sorted out (how do I start partitioning cunumeric array for downstream tasks that may or may not get launched?). In addition, the experiment task launched by the HPO library might itself launch new experiments depending on the previous result, which is a recursive pattern.
Besides, since the task is usually modeled after functions in programming languages, calling other functions is just a natural thing to do. Also, whether a function (task) is called is entirely dynamic and usually cannot be known in advance by the top-level function (main).
In an offline conversation, @manopapad shared some insight into the difficulties of the feature, including resource management, region permission tracking, synchronization mechanisms between parents and children, etc. I'm no expert in distributed computing, and this issue is created for future reference. I would love to help if there's anything I can do. ;-)
Related project
The text was updated successfully, but these errors were encountered: