Skip to content
This repository has been archived by the owner on Feb 22, 2024. It is now read-only.

Commit

Permalink
v1.0.1 release
Browse files Browse the repository at this point in the history
  • Loading branch information
jitendra42 committed Dec 29, 2022
1 parent 52d9781 commit d96507f
Show file tree
Hide file tree
Showing 17 changed files with 905 additions and 8 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
analytics/classical-ml/recsys/training/analytics-with-python
analytics/classical-ml/synthetic/inference/wafer-insights
analytics/tensorflow/ssd_resnet34/inference/video-streamer
big-data/aiok-ray/training/AIOK_Ray/
big-data/aiok-ray/inference/AIOK_Ray/
big-data/ppml/chronos
big-data/friesian/training/BigDL
big-data/ppml/bigDL-ppml
Expand Down
6 changes: 6 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -37,3 +37,9 @@
[submodule "transfer_learning/tensorflow/resnet50/training/transfer-learning-training"]
path = transfer_learning/tensorflow/resnet50/training/transfer-learning-training
url = https://github.com/intel/vision-based-transfer-learning-and-inference
[submodule "big-data/aiok-ray/inference/AIOK_Ray"]
path = big-data/aiok-ray/inference/AIOK_Ray
url = https://github.com/intel/e2eAIOK/
[submodule "big-data/aiok-ray/training/AIOK_Ray"]
path = big-data/aiok-ray/training/AIOK_Ray
url = https://github.com/intel/e2eAIOK/
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ Where `KEY` and `VALUE` pairs are environment variables that can be used to cust
|NLP workflow for Azure ML|PyTorch* and Jupyter|[Training](./language_modeling/pytorch/bert_base/training/) \| [Inference](./language_modeling/pytorch/bert_base/inference/)|
|Protein Structure Prediction|PyTorch*|[Inference](./protein-folding/pytorch/alphafold2/inference/)
|Quantization Aware Training and Inference|OpenVINO™|[Quantization Aware Training(QAT)](https://github.com/intel/nlp-training-and-inference-openvino/tree/v1.0/question-answering-bert-qat)|
|Ray Recommendation System|Ray with PyTorch*|[Training](./big-data/aiok-ray/training/) \| [Inference](./big-data/aiok-ray/inference)|
|RecSys Challenge Analytics With Python|Hadoop and Spark|[Training](./analytics/classical-ml/recsys/training/)|
|Video Streamer|TensorFlow|[Inference](./analytics/tensorflow/ssd_resnet34/inference/)|
|Vision Based Transfer Learning|TensorFlow|[Training](./transfer_learning/tensorflow/resnet50/training/) \| [Inference](./transfer_learning/tensorflow/resnet50/inference/)|
Expand Down
2 changes: 1 addition & 1 deletion analytics/classical-ml/synthetic/inference/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ End2End AI Workflow utilizing a Flask dashboard to allow users to predict FMAX/I
## Quick Start
* Pull and configure the dependent repo submodule `git submodule update --init --recursive`.

* Install [Pipeline Repository Dependencies]()
* Install [Pipeline Repository Dependencies](../../../../README.md)

* Get the public IP address of your machine with either `ip a` or https://whatismyipaddress.com/

Expand Down
1 change: 1 addition & 0 deletions big-data/aiok-ray/inference/AIOK_Ray
Submodule AIOK_Ray added at e559cd
171 changes: 171 additions & 0 deletions big-data/aiok-ray/inference/DEVCATALOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
# Distributed training and inference on Ray and Spark with Intel® End-to-End Optimization Kit

## Overview
Modern recommendation systems require a complex pipeline to handle both data processing and feature engineering at a tremendous scale, while promising high service level agreements for complex deep learning models. Usually, this leads to two separate clusters for data processing and training: a data process cluster, usually CPU based to process huge dataset (terabytes or petabytes) stored in distributed storage system, and a training cluster, usually GPU based for training. This separate data processing and training cluster results a complex software stack, heavy data movement cost.
Meanwhile, Deep learning models were commonly used for recommendation systems, quite often, those models are over-parameterized. It takes up to several days or even weeks to training a heavy model on CPU.
This workflow trying to address those pain points: unifies both data processing and training on Ray – the open source project that make it simple to scale any compute-intensive python workloads, and then optimize the E2E process, especially the training part, shorten training time weight lighter models, maintaining same accuracy, and delivers high-throughput models for inference with Intel® End-to-End Optimization Kit.

> Important: original source disclose - [Deep Learning Recommendation Model](https://github.com/facebookresearch/dlrm)
## How it works
![image](https://user-images.githubusercontent.com/18349036/209234932-12100303-16d7-4352-9d7b-ed23f4cf7028.png)

## Get Started
NOTE: Before you get started please make sure you have the trained model available from training pipeline.

### Prerequisites
```bash
git clone https://github.com/intel/e2eAIOK/ AIOK_Ray
cd AIOK_Ray
git checkout tags/aiok-ray-v0.2
git submodule update --init --recursive
sh dlrm_all/dlrm/patch_dlrm.sh
cd ..
```

### **Docker**
Below setup and how-to-run sessions are for users who want to use the provided docker image to run the entire pipeline.

##### Pull Docker Image
```
docker pull intel/ai-workflows:recommendation-ray-inference
```

#### How to run

The code snippet below runs the inference session. The model files will be generated and stored in the `/output/result` folder.

(Optional) Export related proxy into docker environment.
```bash
export DOCKER_RUN_ENVS="-e ftp_proxy=${ftp_proxy} \
-e FTP_PROXY=${FTP_PROXY} -e http_proxy=${http_proxy} \
-e HTTP_PROXY=${HTTP_PROXY} -e https_proxy=${https_proxy} \
-e HTTPS_PROXY=${HTTPS_PROXY} -e no_proxy=${no_proxy} \
-e NO_PROXY=${NO_PROXY} -e socks_proxy=${socks_proxy} \
-e SOCKS_PROXY=${SOCKS_PROXY}"
```

```bash
if [ ! "$(docker network ls | grep ray-inference)" ]; then
docker network create --driver=bridge ray-inference;
fi
export OUTPUT_DIR=/output
export DATASET_DIR=<Path to Dataset>
export RUN_MODE=<Pick from kaggle/criteo_small/criteo_full>

docker run \
-a stdout $DOCKER_RUN_ENVS \
--env RUN_MODE=${RUN_MODE} \
--env APP_DIR=/home/vmagent/app/e2eaiok \
--env OUTPUT_DIR=/output \
--volume ${DATASET_DIR}:/home/vmagent/app/dataset/criteo \
--volume $(pwd)/AIOK_Ray:/home/vmagent/app/e2eaiok \
--volume ${OUTPUT_DIR}:/output \
--workdir /home/vmagent/app/e2eaiok/dlrm_all/dlrm/ \
--privileged --init -it \
--shm-size=300g --network ray-inference \
--device=/dev/dri \
--hostname ray \
--name ray-inference \
intel/ai-workflows:recommendation-ray-inference \
bash -c 'bash $APP_DIR/scripts/run_inference_docker.sh $RUN_MODE'
```
------

## Useful Resources

## Recommended Hardware and OS

Operating System: Ubuntu 20.04
Memory Requirement: Minimum 250G

| Dataset Name | Recommended Disk Size |
| --- | --- |
| Kaggle | 90G |
| Criteo Small | 800G |
| Criteo Full | 4500G |

### Dataset
> Note: For kaggle run, train.csv and test.csv are required.
kaggle: https://go.criteo.net/criteo-research-kaggle-display-advertising-challenge-dataset.tar.gz

> Note: For criteo small run, day_0, day_1, day_2, day_3, day_23 are required.
> Note: For criteo full test, day_0-day_23 are required
criteo small and criteo full: https://labs.criteo.com/2013/12/download-terabyte-click-logs/

### Step by Step Guide

[option1] Build docker for single or multiple node and enter docker with click-to-run script
```
python3 scripts/start_e2eaiok_docker.py
sshpass -p docker ssh ${local_host_name} -p 12346
# If you met any network/package not found error, please follow log output to do the fixing and re-run above cmdline.
# If you are behind proxy, use below cmd
# python3 scripts/start_e2eaiok_docker.py --proxy "http://ip:port"
# sshpass -p docker ssh ${local_host_name} -p 12346
# If you disk space is limited, you can specify spark_shuffle_dir and dataset_path to different mounted volumn
# python3 scripts/start_e2eaiok_docker.py --spark_shuffle_dir "" --dataset_path ""
# sshpass -p docker ssh ${local_host_name} -p 12346
```

[option2] Build docker manually
```
# prepare a folder for dataset
cd frameworks.bigdata.AIOK
cur_path=`pwd`
mkdir -p ../e2eaiok_dataset
# download miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.12.0-Linux-x86_64.sh -P Dockerfile-ubuntu18.04/ -O Dockerfile-ubuntu18.04/miniconda.sh
# build docker from dockerfile
docker build -t e2eaiok-ray-pytorch Dockerfile-ubuntu18.04 -f Dockerfile-ubuntu18.04/DockerfilePytorch
# if you are behine proxy
# docker build -t e2eaiok-ray-pytorch Dockerfile-ubuntu18.04 -f Dockerfile-ubuntu18.04/DockerfilePytorch --build-arg http_proxy={ip}:{port} --build-arg https_proxy=http://{ip}:{port}
# run docker
docker run -it --shm-size=300g --privileged --network host --device=/dev/dri -v ${cur_path}/../e2eaiok_dataset/:/home/vmagent/app/dataset -v ${cur_path}:/home/vmagent/app/e2eaiok -v ${cur_path}/../spark_local_dir/:/home/vmagent/app/spark_local_dir -w /home/vmagent/app/ --name e2eaiok-ray-pytorch e2eaiok-ray-pytorch /bin/bash
```

### Test with other dataset (run cmd inside docker)
```
# active conda
conda activate pytorch_mlperf
# if behind proxy, please set proxy firstly
# export https_proxy=http://{ip}:{port}
# criteo test
cd /home/vmagent/app/e2eaiok/dlrm_all/dlrm/; bash run_aiokray_dlrm.sh criteo_small ${current_node_ip}
```

### Test full pipeline manually (run cmd inside docker)
```
# active conda
conda activate pytorch_mlperf
# if behind proxy, please set proxy firstly
# export https_proxy=http://{ip}:{port}
# prepare env
bash run_prepare_env.sh ${run_mode} ${current_node_ip}
# data process
bash run_data_process.sh ${run_mode} ${current_node_ip}
# train
bash run_train.sh ${run_mode} ${current_node_ip}
# inference
bash run_inference.sh ${run_mode} ${current_node_ip}
```

## Support

For questions and support, please contact Jian Shang at [email protected].

29 changes: 29 additions & 0 deletions big-data/aiok-ray/inference/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
DATASET_DIR ?= ./data
FINAL_IMAGE_NAME ?= recommendation-ray
CHECKPOINT_DIR ?= /output
RUN_MODE ?= kaggle
DOCKER_NETWORK_NAME = ray-inference

recommendation-ray:
if [ ! -d "AIOK_Ray/dlrm_all/dlrm/dlrm" ]; then \
CWD=${PWD}; \
cd AIOK_Ray/; \
sh dlrm_all/dlrm/patch_dlrm.sh; \
cd ${CWD}; \
fi
@wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.12.0-Linux-x86_64.sh \
-P AIOK_Ray/Dockerfile-ubuntu18.04/ \
-O AIOK_Ray/Dockerfile-ubuntu18.04/miniconda.sh
if [ ! "$(shell docker network ls | grep ${DOCKER_NETWORK_NAME})" ]; then \
docker network create --driver=bridge ${DOCKER_NETWORK_NAME} ; \
fi
@DATASET_DIR=${DATASET_DIR} \
FINAL_IMAGE_NAME=${FINAL_IMAGE_NAME} \
CHECKPOINT_DIR=${CHECKPOINT_DIR} \
RUN_MODE=${RUN_MODE} \
docker compose up recommendation-ray --build

clean:
docker network rm ${DOCKER_NETWORK_NAME}
DATASET_DIR=${DATASET_DIR} OUTPUT_DIR=${OUTPUT_DIR} docker compose down
sudo rm -rf ${OUTPUT_DIR}
Loading

0 comments on commit d96507f

Please sign in to comment.