Possibility to use the output of a task in a workflow? #21

chrstn-hntschl · 2016-05-19T09:18:55Z

Hi,

I am using sciluigi in classification experiments where i would like to have one task per model trainer. The number of model trainer is defined by a list of categories/labels for which models need to be trained which is defined by a dataset descriptor file in yaml. I would like the ability to either define a subset of categories as a parameter (easy) or - if no categories are given - load the dataset from yaml (since this is a lengthy process due to some verification DatasetProvider is a task in itself, which validates the descriptor and stores a pickled version) and extract the total list of categories from that descriptor.
I.e. in my workflow() routine I have something like:

class MyWorkflow(sl.WorkflowTask):

dataset_path = luigi.Parameter(description="path to the dataset descriptor file")
categories = TupleParameter(default=(), description="tuple with all category labels for which models files should be trained")

def workflow():
   if not self.categories or not len(self.categories):
      FIXME: load categories from dataset_path (using a DatasetProviderTask) and set self.categories accordingly....
   ...
   for c in self.categories:
      ....
      model_trainer = self.new_task('model_trainer_' + c,
                                          ModelTrainer,
                                          trainer_params=...
                                          )

Any idea on how to solve this?
May thanks in advance!

samuell · 2016-11-08T17:37:55Z

Hi @chrstn-hntschl!

I don't know how I have managed to miss your issue 😕 ...

Did you solve this?

There is an inherent problem in Luigi that scheduling and running the workflow happens separately, and that you can't really access parameter values (as far as I know) during the scheduling phase of the workflow, but only at the running phase.

Thus, you can't easily set up the workflow differently based on parameter values, but have to rely on information that can be read in by normal python code during scheduling (in your workflow() method).

There is functionality for dynamic depencies in Luigi since some time ago, but it specifies dynamically upstream tasks only, and not downstream tasks, which is what I think is most often needed.

This constraint of Luigi's scheduling model is what made us start experimenting with a workflow engine based on the dataflow paradigm instead, where scheduling and execution happens concurrently all the time, which allows to do these kinds of things, SciPipe.

It is a bit crude yet, and not yet used in production, but it has quite some tests and example workflows, and is the tool we are plan to use for our upcoming computational projects in the near future.

Hope these pointers are of any help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possibility to use the output of a task in a workflow? #21

Possibility to use the output of a task in a workflow? #21

chrstn-hntschl commented May 19, 2016

samuell commented Nov 8, 2016

Possibility to use the output of a task in a workflow? #21

Possibility to use the output of a task in a workflow? #21

Comments

chrstn-hntschl commented May 19, 2016

samuell commented Nov 8, 2016