Easier graph modifications #417

eu9ene · 2024-02-02T18:47:18Z

We have a bunch of use cases when the DAG should be modified and we will have more in the future:

The number of chunks for translations
Skipping bilcleaner for a dataset (see Do not run bicleaner step if the threshold is 0 #415)
Skipping bicleaner entirely
Downloading and using a pre-trained backward model (changes upstream dependencies and required fetches, see When using a pre-trained model there shouldn't be dependencies on dataset steps #395)
Downloading and using the pre-trained teacher models (skip the whole data augmentation part of the graph, backward model is still needed for cross entropy filtering step)
Skipping data augmentation when we want to check how much boost it brings or when we don't have monolingual data (similar to previous one, but the merge-corpus step looses its dependencies on translated mono corpus)
Adding extra pre-processing steps when using pre-trained models of different architectures (see opusmt pr)

The conclusion is that the graph is not static. It should be adjusted based on the experiment config. Currently the only way to do this is using transforms that we already do to find upstream tasks related to the dataset cleaning. Each transform corresponds to only one kind. This approach makes it hard to say how exactly the graphs will look like in the end and it's hard to understand and maintain the code that changes the graph for each task independently. It also limits us, for example we can't easily introduce a new kind for pre-trained model downloading and have to modify the train- kinds for this which leads to issues with redundant dependencies and incorrect caching.

The ideal solution would be similar to Snakemake graph definition: one python file where we can fully form and connect the graph based on conditions. Instead of removing dependencies for some steps in the transforms we could skip their definition in kind.yaml and add the dependencies that are needed in this new "connect graph" script. We can think of it as graph-level transform.

If we define all the possible tasks/kinds in yaml and need only to connect the graph, the ideal place for this operation would be somewhere before full_task_graph step in graph generation. It would be somewhere here in the TaskGraphGenerator code, before we start edding edges to the graph based on kind dependencies. If the task is disconnected from the graph, we should just skip its analysis entirely. This requires adding support of this feature in TaskGraph.

Defining graphs in Python is a popular method in other workflow managers with explicit graph definition ( See Dagster, Metaflow, Airflow, Kubeflow). The mentioned Snakemake uses implicit approach based on the defined inputs and outputs but we control the flow by adding rule definitions based on conditions (example).

The text was updated successfully, but these errors were encountered:

eu9ene added the taskcluster Issues related to the Taskcluster implementation of the training pipeline label Feb 2, 2024

This was referenced Feb 2, 2024

Support custom hooks between full_task_set and full_task_graph phases taskcluster/taskgraph#424

Open

fix: adjust dependencies, fetches, and cache digests when using the use training contiunation mode #418

Merged

gregtatum mentioned this issue Feb 6, 2024

Support training without a monolingual corpus #423

Open

bhearsum mentioned this issue Feb 12, 2024

Add kind that demonstrates how to modify the upstream graph in a transform #438

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Easier graph modifications #417

Easier graph modifications #417

eu9ene commented Feb 2, 2024 •

edited

Loading

Easier graph modifications #417

Easier graph modifications #417

Comments

eu9ene commented Feb 2, 2024 • edited Loading

eu9ene commented Feb 2, 2024 •

edited

Loading