Skip to content
This repository has been archived by the owner on Oct 11, 2022. It is now read-only.

Updating extract_workflows_data.sh to be a python script #6

Open
CKrawczyk opened this issue Sep 12, 2019 · 0 comments
Open

Updating extract_workflows_data.sh to be a python script #6

CKrawczyk opened this issue Sep 12, 2019 · 0 comments

Comments

@CKrawczyk
Copy link
Contributor

All of the aggregation code's command line API can be called from withing a python script. This can make interfacing with the convert_to_ibcc easier in the long run. The python version of this script would look like:

from panoptes_aggregation.scripts.config_workflow_panoptes import config_workflow
from panoptes_aggregation.scripts.extract_panoptes_csv import extract_csv
from io import StringIO
import pandas
import os

WORKFLOW_FILE = ''
DATA_OUT_DIR = ''
DATA_IN_DIR = ''

config_dir = os.path.join(DATA_OUT_DIR, 'config')
classification_csv_file = os.path.join(DATA_IN_DIR, 'classifications.csv')
subject_csv_file = os.path.join(DATA_IN_DIR, 'subejcts.csv')

workflows = pandas.read_csv(WORKFLOW_FILE)
workflow_ids = workflows.workflow_id.unique()

for workflow_id in workflow_ids:
    extractor_config, _, task_label_config = config_workflow(
        WORKFLOW_FILE,
        workflow_id
    )
    print('Exporting data from workflow: {0}'.format(workflow_id))
    extract_filenames = extract_csv(
        classification_csv_file,
        StringIO(str(extractor_config)),
        output_dir=DATA_OUT_DIR,
        order=True,
        output_name='workflow_{0}_classifications'.format(workflow_id)
    )
    if len(extract_filenames > 0):
        point_extracts_filename = [filename for filename in extract_filenames if 'point_extractor_by_frame' in filename][0]
        question_extracts_filename = [filename for filename in extract_filenames if 'question_extractor' in filename][0]
        ## Call convert_to_ibcc here

Where the three empty strings at the top can be read in via the command line (e.g. sys.argv or argparse).

The config_workflow function returns the extractor config as a dict, the reducer configs as a list of dicts (currently there is bug and this is only returning the last reducer config instead of all of them), and the task labels as a dict.

The extract_csv function returns a list of file paths for each extraction file written to disk. The second arg to this function is expecting a filename (or any file like object), so I use StringIO to convert the extractor config from a dict to a string being read in. You could also construct the path to the config file that was written to disk if you wanted.

At the end of this script is where the call to the convert_to_ibcc code should be called. It might be best if convert_to_ibccs argparse bit was converted to a python function that can be imported and called directly.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant