Using sweep in the train pipeline errors out #115

lhd0430 · 2023-09-14T16:30:25Z

Describe the bug or the issue that you are facing

I'm trying to implement hyperparameter tuning in the default train pipeline by setting up a sweep job. It errors out during the run-model-training-pipeline / run-pipeline after running the workflow deploy-model-training-pipeline

Steps/Code to Reproduce

Replace the train_model in mlops/azureml/train/pipeline.yml with the following code (I used indents in my yml, the layout is not displayed correctly in this comment):

train_model:
name: train_model
display_name: train-model
type: sweep
trial:
code: ../../../data-science/src
command: >-
python train.py
--train_data ${{inputs.train_data}}
--model_output ${{outputs.model_output}}
--regressor__n_estimators ${{search_space.regressor__n_estimators}}
environment: azureml:taxi-train-env@latest
inputs:
train_data: ${{parent.jobs.prep_data.outputs.train_data}}
outputs:
model_output: ${{parent.outputs.trained_model}}
sampling_algorithm: random
search_space:
regressor__n_estimators:
type: choice
values: [100, 200]
objective:
goal: minimize
primary_metric: train_mse
limits:
max_total_trials: 4
max_concurrent_trials: 2
timeout: 7200

Revise the main function in data-science/src/train.py as follows:

def main(args):
'''Read train dataset, train model, save trained model'''

# Read train data
train_data = pd.read_parquet(Path(args.train_data))

# Split the data into input(X) and output(y)
y_train = train_data[TARGET_COL]
X_train = train_data[NUMERIC_COLS + CAT_NOM_COLS + CAT_ORD_COLS]

# Train a Random Forest Regression Model with the training set
model = RandomForestRegressor(n_estimators = args.regressor__n_estimators,
                              bootstrap = args.regressor__bootstrap,
                              max_depth = args.regressor__max_depth,
                              max_features = args.regressor__max_features,
                              min_samples_leaf = args.regressor__min_samples_leaf,
                              min_samples_split = args.regressor__min_samples_split,
                              random_state=0)

# log model hyperparameters
mlflow.log_param("model", "RandomForestRegressor")
mlflow.log_param("n_estimators", args.regressor__n_estimators)
mlflow.log_param("bootstrap", args.regressor__bootstrap)
mlflow.log_param("max_depth", args.regressor__max_depth)
mlflow.log_param("max_features", args.regressor__max_features)
mlflow.log_param("min_samples_leaf", args.regressor__min_samples_leaf)
mlflow.log_param("min_samples_split", args.regressor__min_samples_split)

# Train model with the train set
model.fit(X_train, y_train)

# Predict using the Regression Model
yhat_train = model.predict(X_train)

# Evaluate Regression performance with the train set
r2 = r2_score(y_train, yhat_train)
mse = mean_squared_error(y_train, yhat_train)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_train, yhat_train)

# log model performance metrics
mlflow.log_metric("train r2", r2)
mlflow.log_metric("train_mse", mse)
mlflow.log_metric("train rmse", rmse)
mlflow.log_metric("train mae", mae)

# Visualize results
plt.scatter(y_train, yhat_train,  color='black')
plt.plot(y_train, y_train, color='blue', linewidth=3)
plt.xlabel("Real value")
plt.ylabel("Predicted value")
plt.savefig("regression_results.png")
mlflow.log_artifact("regression_results.png")

# Save the model
mlflow.sklearn.save_model(sk_model=model, path="model")

from distutils.dir_util import copy_tree

# copy subdirectory example
from_directory = "model"
to_directory = args.model_output

copy_tree(from_directory, to_directory)

Run .github/workflows/tf-gha-deploy-infra.yml in Github Actions
Run .github/workflows/deploy-model-training-pipeline-classical.yml in Github Actions
Errors out during the run-model-training-pipeline / run-pipeline with the following msg:

Run run_id=$(az ml job create --file /home/runner/work/Azure_mlops_v2_demo/Azure_mlops_v2_demo/mlops/azureml/train/pipeline.yml --resource-group rg-mlopsv2-0040dev --workspace-name mlw-mlopsv2-0040dev --query name -o tsv)
Class WorkspaceHubOperations: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class AutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class AutoDeleteConditionSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseAutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class IntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class ProtectionLevelSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseIntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
ERROR: Failed to find referenced source for input binding $parent.jobs.train_model.outputs.model_output
Error: Process completed with exit code 1.

Expected Output

Execute .github/workflows/deploy-model-training-pipeline-classical.yml workflow with no errors

Versions

I'm using GitHub Actions and created my own repository following your guide and created a new dev branch.

Terraform

Azure ML CLI v2

Pre built examples from Tabular

Classic

Which platform are you using for deploying your infrastrucutre?

GitHub Actions (GitHub)

If you mentioned Others, please mention which platformm are you using?

No response

What are you using for deploying your infrastrucutre?

Terraform

Are you using Azure ML CLI v2 or Azure ML Python SDK v2

Azure ML CLI v2

Describe the example that you are trying to run?

Pre built examples from Tabular

The text was updated successfully, but these errors were encountered:

setuc · 2023-09-18T02:39:25Z

The error message you've received,
ERROR: Failed to find referenced source for input binding $parent.jobs.train_model.outputs.model_output

indicates that there is a problem with how you've defined the output binding for the model_output in your train_model job. In the error, it's looking for the output binding in $parent.jobs.train_model.outputs.model_output which seems incorrect because a job cannot reference its own outputs in this manner. Outputs generated by a job are usually used as inputs in subsequent jobs. Therefore, I storngly feel that this error is due to a circular reference between your steps.To resolve this, you will need to ensure that the model_output is correctly defined in the outputs section of the train_model job and that it is correctly referenced in the jobs that use it as an input. From the snippet you've posted, the model_output seems to be correctly defined in the outputs section of the train_model job:

outputs:
  model_output: ${{parent.outputs.trained_model}}

Now, you need to ensure that in the subsequent jobs where model_output is being used as an input, it is correctly referenced. For instance, if it is used in a job called evaluate_job, it should be referenced as:

inputs:
  model_input: ${{parent.jobs.train_model.outputs.model_output}}

Check where you are using model_output and correct the appropriate references.

On a different note, I noticed that several hyperparameters such as args.regressor__bootstrap, args.regressor__max_depth, etc., are used but they are not passed as command-line arguments in the command section of your train_model job definition. You will need to add these hyperparameters to your command line arguments. See some sample code that you can leverage.

command: >-
  python train.py
  --train_data ${{inputs.train_data}}
  --model_output ${{outputs.model_output}}
  --regressor__n_estimators ${{search_space.regressor__n_estimators}}
  --regressor__bootstrap ${{search_space.regressor__bootstrap}}
  --regressor__max_depth ${{search_space.regressor__max_depth}}

And define these hyperparameters in your search_space dictionary in your YAML file.

search_space:
  regressor__n_estimators:
    type: choice
    values: [100, 200]
  regressor__bootstrap:
    type: choice
    values: [true, false]
  regressor__max_depth:
    type: choice
    values: [10, 20, 30, None]
  regressor__max_features:
    type: choice
    values: ["auto", "sqrt", "log2", None]
  regressor__min_samples_leaf:
    type: choice
    values: [1, 2, 4]
  regressor__min_samples_split:
    type: choice
    values: [2, 5, 10]

I also recommend to use ArgurmentParser in your python file something like this

import argparse

parser = argparse.ArgumentParser()
parser.add_argument('--regressor__n_estimators', type=int, default=100)
parser.add_argument('--regressor__bootstrap', type=bool, default=True)
parser.add_argument('--regressor__max_depth', type=int, default=None)
parser.add_argument('--regressor__max_features', type=str, default='auto')
parser.add_argument('--regressor__min_samples_leaf', type=int, default=1)
parser.add_argument('--regressor__min_samples_split', type=int, default=2)
# Add other arguments here...

args = parser.parse_args()

Hope this helps

lhd0430 added Bug Something isn't working Needs Triage Needs Triage labels Sep 14, 2023

lhd0430 assigned setuc Sep 14, 2023

setuc added ✅ resolved The issue/bug/question has been resolved. and removed Needs Triage Needs Triage Bug Something isn't working labels Sep 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using sweep in the train pipeline errors out #115

Using sweep in the train pipeline errors out #115

lhd0430 commented Sep 14, 2023 •

edited

Loading

setuc commented Sep 18, 2023

Using sweep in the train pipeline errors out #115

Using sweep in the train pipeline errors out #115

Comments

lhd0430 commented Sep 14, 2023 • edited Loading

Describe the bug or the issue that you are facing

Steps/Code to Reproduce

Expected Output

Versions

Which platform are you using for deploying your infrastrucutre?

If you mentioned Others, please mention which platformm are you using?

What are you using for deploying your infrastrucutre?

Are you using Azure ML CLI v2 or Azure ML Python SDK v2

Describe the example that you are trying to run?

setuc commented Sep 18, 2023

lhd0430 commented Sep 14, 2023 •

edited

Loading