This code supports the three-part solution Predicting Customer Lifetime Value with Cloud ML Engine published on cloud.google.com.
This code is also used in the updated solution Predicting Customer Lifetime Value with AutoML Tables published on cloud.google.com.
This project shows how to use ML models to predict customer lifetime value in the following context:
- We apply the models using this data set [1].
- We provide an implementation using a TensorFlow DNN model with batch normalization and dropout.
- We provide an implementation, using the Lifetimes library in Python, of statistical models commonly used in industry to perform lifetime value prediction.
- We also provide an implementation using AutoML Tables.
The project also shows how to deploy a production-ready data processing pipeline for lifetime value prediction on Google Cloud Platform, using BigQuery and DataStore with orchestration provided by Cloud Composer.
The code works with python 2/3. Using Miniconda2:
sudo apt-get install -y git bzip2
wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh
bash Miniconda2-latest-Linux-x86_64.sh -b
export PATH=~/miniconda2/bin:$PATH
conda create -y -n clv
source activate clv
conda install -y -n clv python=2.7 pip
pip install -r requirements.txt
- Cloud Composer API
- Machine Learning API (for TensorFlow / Lifetimes models)
- Dataflow API
- AutoML Tables API (for AutoML Tables models)
Before running the training and Airflow scripts, you need some environment variables:
export PROJECT=$(gcloud config get-value project 2> /dev/null)
export BUCKET=gs://${PROJECT}_data_final
export REGION=us-central1
export DATASET_NAME=ltv
export COMPOSER_NAME="clv-final"
export COMPOSER_BUCKET_NAME=${PROJECT}_composer_final
export COMPOSER_BUCKET=gs://${COMPOSER_BUCKET_NAME}
export DF_STAGING=${COMPOSER_BUCKET}/dataflow_staging
export DF_ZONE=${REGION}-a
export SQL_MP_LOCATION="sql"
export LOCAL_FOLDER=$(pwd)
Creating the BigQuery workspace:
gsutil mb -l ${REGION} -p ${PROJECT} ${BUCKET}
gsutil mb -l ${REGION} -p ${PROJECT} ${COMPOSER_BUCKET}
bq --location=US mk --dataset ${PROJECT}:${DATASET_NAME}
Create a datastore database as detailed in the Datastore documentation
gsutil cp gs://solutions-public-assets/ml-clv/db_dump.csv ${BUCKET}
gsutil cp ${BUCKET}/db_dump.csv ${COMPOSER_BUCKET}
gsutil cp clv_mle/to_predict.json ${BUCKET}/predictions/
gsutil cp ${BUCKET}/predictions/to_predict.json ${COMPOSER_BUCKET}/predictions/
gsutil cp clv_mle/to_predict.csv ${BUCKET}/predictions/
gsutil cp ${BUCKET}/predictions/to_predict.csv ${COMPOSER_BUCKET}/predictions/
Creating a service account is important to make sure that your Cloud Composer instance can perform the required tasks within BigQuery, AutoML Tables, ML Engine, Dataflow, Cloud Storage and Datastore. It is also needed to run training for AutoML locally.
The following creates a service account called composer@[YOUR-PROJECT-ID].iam.gserviceaccount.com. and assigns the required roles to the service account.
gcloud iam service-accounts create composer --display-name composer --project ${PROJECT}
gcloud projects add-iam-policy-binding ${PROJECT} \
--member serviceAccount:composer@${PROJECT}.iam.gserviceaccount.com \
--role roles/composer.worker
gcloud projects add-iam-policy-binding ${PROJECT} \
--member serviceAccount:composer@${PROJECT}.iam.gserviceaccount.com \
--role roles/bigquery.dataEditor
gcloud projects add-iam-policy-binding ${PROJECT} \
--member serviceAccount:composer@${PROJECT}.iam.gserviceaccount.com \
--role roles/bigquery.jobUser
gcloud projects add-iam-policy-binding ${PROJECT} \
--member serviceAccount:composer@${PROJECT}.iam.gserviceaccount.com \
--role roles/storage.admin
gcloud projects add-iam-policy-binding ${PROJECT} \
--member serviceAccount:composer@${PROJECT}.iam.gserviceaccount.com \
--role roles/ml.developer
gcloud projects add-iam-policy-binding ${PROJECT} \
--member serviceAccount:composer@${PROJECT}.iam.gserviceaccount.com \
--role roles/dataflow.developer
gcloud projects add-iam-policy-binding ${PROJECT} \
--member serviceAccount:composer@${PROJECT}.iam.gserviceaccount.com \
--role roles/compute.viewer
gcloud projects add-iam-policy-binding ${PROJECT} \
--member serviceAccount:composer@${PROJECT}.iam.gserviceaccount.com \
--role roles/storage.objectAdmin
gcloud projects add-iam-policy-binding ${PROJECT} \
--member serviceAccount:composer@${PROJECT}.iam.gserviceaccount.com \
--role='roles/automl.editor'
Wait until the service account has all the proper roles setup.
Create a service account API key and download the json keyfile to run training for AutoML locally.
If using the TensorFlow or Lifetimes model, do this once. If you make changes to any of the Python files in clv_mle, you need to repeat.
cd ${LOCAL_FOLDER}/clv_mle
rm -rf clv_ml_engine.egg-info/
rm -rf dist
python setup.py sdist
gsutil cp dist/* ${COMPOSER_BUCKET}/code/
The notebooks
folder contains notebooks for data exploration and modeling with linear models and AutoML Tables.
jupyter notebook
If you are interested in using Jupyter with Datalab, you can do the following:
jupyter nbextension install --py datalab.notebook --sys-prefix
jupyter notebook
And enter in the first cell of your Notebook
%load_ext google.datalab.kernel
You can train the model using the script clv_automl/clv_automl.py. This takes several parameters. See usage for full params and default values.
Make sure you have downloaded the json API key. By default this is assumed to be in a file mykey.json
in the same directory as the script.
For example:
cd ${LOCAL_FOLDER}/clv_automl
python clv_automl.py --project_id [YOUR_PROJECT]
To run training or hypertuning for the non-automl models you can use the mltrain.sh script. It must be run from the top level directory, as in the examples below. For ML Engine jobs you must supply a bucket on GCS. The job data folder will be gs://bucket/data and the job directory will be gs://bucket/jobs. So your data files must already be in gs://bucket/data. If you use ${COMPOSER_BUCKET}, and the DAG has been run at least once, the data files will be present. For DNN models the data should be named 'train.csv', 'eval.csv' and 'test.csv', for probablistic models the file must be 'btyd.csv'.
For example:
Train the DNN model on local data:
cd ${LOCAL_FOLDER}
gsutil -m cp -r ${COMPOSER_BUCKET}/data .
run/mltrain.sh local data
Train the DNN model in a Cloud ML Engine job on data in the ${COMPOSER_BUCKET}:
run/mltrain.sh train ${COMPOSER_BUCKET}
Run hyperparameter tuning on Cloud ML Engine:
run/mltrain.sh tune gs://your-bucket
For statistical models:
run/mltrain.sh local data --model_type paretonbd_model --threshold_date 2011-08-08 --predict_end 2011-12-12
This code is set to run automatically using Cloud Composer, a google-managed version of Airflow. The following steps describe how to go from your own copy of the data to a deployed model with results exported both in Datastore and BigQuery.
See part three of the solution for more details.
This will take a while. This project assumes Airflow 1.9.0, which is the default for Cloud Composer as of March 2019.
gcloud composer environments create ${COMPOSER_NAME} \
--location ${REGION} \
--zone ${REGION}-f \
--machine-type n1-standard-1 \
--service-account=composer@${PROJECT}.iam.gserviceaccount.com
There are various ways of calling BigQuery queries. This solutions leverages BigQuery files directly. For them to be accessible by the DAGs, they need to be in the same folder.
The following command line, copies the entire sql folder as a subfolder in the Airflow dags folder.
cd ${LOCAL_FOLDER}/preparation
gcloud composer environments storage dags import \
--environment ${COMPOSER_NAME} \
--source ${SQL_MP_LOCATION} \
--location ${REGION} \
--project ${PROJECT}
Some files are important when running the DAG. They need to be placed in the composer bucket:
1 - The BigQuery schema file used to load data into BigQuery
cd ${LOCAL_FOLDER}
gsutil cp ./run/airflow/schema_source.json ${COMPOSER_BUCKET}
2 - A Javascript file used by the Dataflow template for processing.
gsutil cp ./run/airflow/gcs_datastore_transform.js ${COMPOSER_BUCKET}
Region where things happen
gcloud composer environments run ${COMPOSER_NAME} --location ${REGION} variables set \
-- \
region ${REGION}
Staging location for Dataflow
gcloud composer environments run ${COMPOSER_NAME} --location ${REGION} variables set \
-- \
df_temp_location ${DF_STAGING}
Zone where Dataflow should run
gcloud composer environments run ${COMPOSER_NAME} --location ${REGION} variables set \
-- \
df_zone ${DF_ZONE}
BigQuery working dataset
gcloud composer environments run ${COMPOSER_NAME} --location ${REGION} variables set \
-- \
dataset ${DATASET_NAME}
Composer bucket
gcloud composer environments run ${COMPOSER_NAME} --location ${REGION} variables set \
-- \
composer_bucket_name ${COMPOSER_BUCKET_NAME}
AutoML Dataset name
gcloud composer environments run ${COMPOSER_NAME} --location ${REGION} variables set \
-- \
automl_dataset "clv_solution"
AutoML Model name
gcloud composer environments run ${COMPOSER_NAME} --location ${REGION} variables set \
-- \
automl_model "clv_model"
AutoML training budget
gcloud composer environments run ${COMPOSER_NAME} --location ${REGION} variables set \
-- \
automl_training_budget "1"
gcloud composer environments storage dags import \
--environment ${COMPOSER_NAME} \
--source clv_automl \
--location ${REGION} \
--project ${PROJECT}
gcloud composer environments update ${COMPOSER_NAME} \
--update-pypi-packages-from-file run/airflow/requirements.txt \
--location ${REGION} \
--project ${PROJECT}
You need to run this for all your dag files. This solution has two DAGs located in the run/airflow/dags folder.
gcloud composer environments storage dags import \
--environment ${COMPOSER_NAME} \
--source run/airflow/dags/01_build_train_deploy.py \
--location ${REGION} \
--project ${PROJECT}
gcloud composer environments storage dags import \
--environment ${COMPOSER_NAME} \
--source run/airflow/dags/02_predict_serve.py \
--location ${REGION} \
--project ${PROJECT}
You now should have both DAGs and the SQL files in the Cloud Composer's reserved bucket. Because you probably want to run training and prediction tasks independently, you can run the following script as needed. For more automatic runs (like daily for example), refer to the Airflow documentation to setup your DAGs accordingly.
Airflow can take various parameters as inputs.
The following are used within the .sql files through the syntax {{ dag_run.conf['PARAMETER-NAME'] }}
- project: Project ID where the data is located
- dataset: Dataset that is used to write and read the data
- predict_end: When is the final date of the whole sales dataset
- threshold_date: What is the data used to split the data
Other variables are important as they depend on your environment and are passed directly to the Operators:
- model_type: Name of the model that you want to use. Should be either 'automl' or one of the options from model.py
- project: Your project id
- dataset: Your dataset id
- threshold_date: Date that separates features from target
- predict_end: End date of the dataset
- model_name: Name of the model saved to AutoML Tables or Machine Learning Engine
- model_version: Name of the version of model_name save to Machine Learning Engine (not used for AutoML Tables)
- tf_version: Tensorflow version to be used
- max_monetary: Monetary cap to discard all customers that spend more than that amount
gcloud composer environments run ${COMPOSER_NAME} \
--project ${PROJECT} \
--location ${REGION} \
dags trigger \
-- \
build_train_deploy \
--conf '{"model_type":"automl", "project":"'${PROJECT}'", "dataset":"'${DATASET_NAME}'", "threshold_date":"2011-08-08", "predict_end":"2011-12-12", "model_name":"clv_automl", "model_version":"v1", "tf_version":"1.10", "max_monetary":"15000"}'
gcloud composer environments run ${COMPOSER_NAME} \
--project ${PROJECT} \
--location ${REGION} \
dags trigger \
-- \
predict_serve \
--conf '{"model_name":"clv_automl", "model_version":"v1", "dataset":"'${DATASET_NAME}'"}'
[1]: Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.