Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Easier graph modifications #417

Open
eu9ene opened this issue Feb 2, 2024 · 0 comments
Open

Easier graph modifications #417

eu9ene opened this issue Feb 2, 2024 · 0 comments
Labels
taskcluster Issues related to the Taskcluster implementation of the training pipeline

Comments

@eu9ene
Copy link
Collaborator

eu9ene commented Feb 2, 2024

We have a bunch of use cases when the DAG should be modified and we will have more in the future:

  1. The number of chunks for translations
  2. Skipping bilcleaner for a dataset (see Do not run bicleaner step if the threshold is 0 #415)
  3. Skipping bicleaner entirely
  4. Downloading and using a pre-trained backward model (changes upstream dependencies and required fetches, see When using a pre-trained model there shouldn't be dependencies on dataset steps #395)
  5. Downloading and using the pre-trained teacher models (skip the whole data augmentation part of the graph, backward model is still needed for cross entropy filtering step)
  6. Skipping data augmentation when we want to check how much boost it brings or when we don't have monolingual data (similar to previous one, but the merge-corpus step looses its dependencies on translated mono corpus)
  7. Adding extra pre-processing steps when using pre-trained models of different architectures (see opusmt pr)

The conclusion is that the graph is not static. It should be adjusted based on the experiment config. Currently the only way to do this is using transforms that we already do to find upstream tasks related to the dataset cleaning. Each transform corresponds to only one kind. This approach makes it hard to say how exactly the graphs will look like in the end and it's hard to understand and maintain the code that changes the graph for each task independently. It also limits us, for example we can't easily introduce a new kind for pre-trained model downloading and have to modify the train- kinds for this which leads to issues with redundant dependencies and incorrect caching.

The ideal solution would be similar to Snakemake graph definition: one python file where we can fully form and connect the graph based on conditions. Instead of removing dependencies for some steps in the transforms we could skip their definition in kind.yaml and add the dependencies that are needed in this new "connect graph" script. We can think of it as graph-level transform.

If we define all the possible tasks/kinds in yaml and need only to connect the graph, the ideal place for this operation would be somewhere before full_task_graph step in graph generation. It would be somewhere here in the TaskGraphGenerator code, before we start edding edges to the graph based on kind dependencies. If the task is disconnected from the graph, we should just skip its analysis entirely. This requires adding support of this feature in TaskGraph.

Defining graphs in Python is a popular method in other workflow managers with explicit graph definition ( See Dagster, Metaflow, Airflow, Kubeflow). The mentioned Snakemake uses implicit approach based on the defined inputs and outputs but we control the flow by adding rule definitions based on conditions (example).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
taskcluster Issues related to the Taskcluster implementation of the training pipeline
Projects
None yet
Development

No branches or pull requests

1 participant