You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Downloading and using the pre-trained teacher models (skip the whole data augmentation part of the graph, backward model is still needed for cross entropy filtering step)
Skipping data augmentation when we want to check how much boost it brings or when we don't have monolingual data (similar to previous one, but the merge-corpus step looses its dependencies on translated mono corpus)
Adding extra pre-processing steps when using pre-trained models of different architectures (see opusmt pr)
The conclusion is that the graph is not static. It should be adjusted based on the experiment config. Currently the only way to do this is using transforms that we already do to find upstream tasks related to the dataset cleaning. Each transform corresponds to only one kind. This approach makes it hard to say how exactly the graphs will look like in the end and it's hard to understand and maintain the code that changes the graph for each task independently. It also limits us, for example we can't easily introduce a new kind for pre-trained model downloading and have to modify the train- kinds for this which leads to issues with redundant dependencies and incorrect caching.
The ideal solution would be similar to Snakemake graph definition: one python file where we can fully form and connect the graph based on conditions. Instead of removing dependencies for some steps in the transforms we could skip their definition in kind.yaml and add the dependencies that are needed in this new "connect graph" script. We can think of it as graph-level transform.
If we define all the possible tasks/kinds in yaml and need only to connect the graph, the ideal place for this operation would be somewhere before full_task_graph step in graph generation. It would be somewhere here in the TaskGraphGenerator code, before we start edding edges to the graph based on kind dependencies. If the task is disconnected from the graph, we should just skip its analysis entirely. This requires adding support of this feature in TaskGraph.
Defining graphs in Python is a popular method in other workflow managers with explicit graph definition ( See Dagster, Metaflow, Airflow, Kubeflow). The mentioned Snakemake uses implicit approach based on the defined inputs and outputs but we control the flow by adding rule definitions based on conditions (example).
The text was updated successfully, but these errors were encountered:
eu9ene
added
the
taskcluster
Issues related to the Taskcluster implementation of the training pipeline
label
Feb 2, 2024
We have a bunch of use cases when the DAG should be modified and we will have more in the future:
merge-corpus
step looses its dependencies on translated mono corpus)The conclusion is that the graph is not static. It should be adjusted based on the experiment config. Currently the only way to do this is using transforms that we already do to find upstream tasks related to the dataset cleaning. Each transform corresponds to only one kind. This approach makes it hard to say how exactly the graphs will look like in the end and it's hard to understand and maintain the code that changes the graph for each task independently. It also limits us, for example we can't easily introduce a new kind for pre-trained model downloading and have to modify the
train-
kinds for this which leads to issues with redundant dependencies and incorrect caching.The ideal solution would be similar to Snakemake graph definition: one python file where we can fully form and connect the graph based on conditions. Instead of removing dependencies for some steps in the transforms we could skip their definition in
kind.yaml
and add the dependencies that are needed in this new "connect graph" script. We can think of it as graph-level transform.If we define all the possible tasks/kinds in yaml and need only to connect the graph, the ideal place for this operation would be somewhere before full_task_graph step in graph generation. It would be somewhere here in the TaskGraphGenerator code, before we start edding edges to the graph based on kind dependencies. If the task is disconnected from the graph, we should just skip its analysis entirely. This requires adding support of this feature in TaskGraph.
Defining graphs in Python is a popular method in other workflow managers with explicit graph definition ( See Dagster, Metaflow, Airflow, Kubeflow). The mentioned Snakemake uses implicit approach based on the defined inputs and outputs but we control the flow by adding rule definitions based on conditions (example).
The text was updated successfully, but these errors were encountered: