diff --git a/examples/ML+DL-Examples/Optuna-Spark/optuna-examples/optuna-dataframe.ipynb b/examples/ML+DL-Examples/Optuna-Spark/optuna-examples/optuna-dataframe.ipynb index 94c70f03..a80ad8c5 100644 --- a/examples/ML+DL-Examples/Optuna-Spark/optuna-examples/optuna-dataframe.ipynb +++ b/examples/ML+DL-Examples/Optuna-Spark/optuna-examples/optuna-dataframe.ipynb @@ -92,15 +92,7 @@ "cell_type": "code", "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "File downloaded and saved to /home/rishic/Code/myforks/spark-rapids-examples/examples/ML+DL-Examples/Optuna-Spark/optuna-examples/data/winequality-red.csv\n" - ] - } - ], + "outputs": [], "source": [ "url = \"https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv\"\n", "\n", @@ -1008,8 +1000,8 @@ "source": [ "This second implementation uses **Spark I/O**.\n", "\n", - "By this we mean Spark reads the dataset and creates a duplicate of the dataset for each worker (1 partition = 1 duplicate), then maps the tuning task onto each partition. \n", - "In practice, this enables the code to be chained to other Dataframe operations (e.g. ETL stages) without the intermediate step of writing to DBFS, at the cost of memory overhead during duplication.\n" + "This means that Spark will read the dataset and create a duplicate of the dataset for each worker (1 partition = 1 duplicate), then map the tuning task onto each partition. \n", + "In practice, this enables the code to be chained to other Dataframe operations (e.g. ETL stages) without the intermediate step of writing to DBFS, at the cost of some overhead during duplication.\n" ] }, { @@ -1019,7 +1011,7 @@ "### Optuna Task\n", "\n", "We'll use the same task as before, but instead of reading the dataset from the filepath, the task_udf will be mapped onto the dataframe partition. \n", - "The task_udf will be given an iterator of batches over the partition. See the [mapInPandas docs](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.mapInPandas.html) for more info." + "The task_udf will be given an iterator of batches over the partition. See the [mapInPandas docs](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.mapInPandas.html) for more info on how data is passed in the UDF." ] }, { @@ -1051,7 +1043,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We'll create a new study for this run using the MySQL database, and define the number of tasks/trials." + "Like before, create a new study for this run using the MySQL database and define the number of tasks/trials." ] }, { @@ -1101,7 +1093,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We'll also define the following helper function, which will create *n* duplicates of a dataframe in separate partitions." + "We'll also define the following helper function, which will create duplicates of the dataframe held in separate partitions." ] }, { diff --git a/examples/ML+DL-Examples/Optuna-Spark/optuna-examples/optuna-joblibspark.ipynb b/examples/ML+DL-Examples/Optuna-Spark/optuna-examples/optuna-joblibspark.ipynb index bf69be32..4cda30e4 100644 --- a/examples/ML+DL-Examples/Optuna-Spark/optuna-examples/optuna-joblibspark.ipynb +++ b/examples/ML+DL-Examples/Optuna-Spark/optuna-examples/optuna-joblibspark.ipynb @@ -32,8 +32,7 @@ "# Distributed Hyperparameter Tuning: Optuna + JoblibSpark\n", "\n", "\n", - "This demo demonstrates distributed hyperparameter tuning for XGBoost using the [JoblibSpark backend](https://github.com/joblib/joblib-spark). \n", - "This builds on the [example from Databricks](https://docs.databricks.com/en/machine-learning/automl-hyperparam-tuning/optuna.html). \n", + "This demo demonstrates distributed hyperparameter tuning for XGBoost using the [JoblibSpark backend](https://github.com/joblib/joblib-spark), building on this [example from Databricks](https://docs.databricks.com/en/machine-learning/automl-hyperparam-tuning/optuna.html). \n", "We implement best practices to precompute data and maximize computations on the GPU. \n", "\n", "\n", @@ -92,17 +91,9 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "File downloaded and saved to /home/rishic/Code/myforks/spark-rapids-examples/examples/ML+DL-Examples/Optuna-Spark/optuna-examples/data/winequality-red.csv\n" - ] - } - ], + "outputs": [], "source": [ "url = \"https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv\"\n", "\n", @@ -526,19 +517,7 @@ "cell_type": "code", "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "24/12/06 19:47:10 WARN Utils: Your hostname, cb4ae00-lcedt resolves to a loopback address: 127.0.1.1; using 10.110.47.100 instead (on interface eno1)\n", - "24/12/06 19:47:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address\n", - "Setting default log level to \"WARN\".\n", - "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n", - "24/12/06 19:47:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n" - ] - } - ], + "outputs": [], "source": [ "from pyspark.sql import SparkSession\n", "from pyspark import SparkConf\n", @@ -677,7 +656,7 @@ "source": [ "Define the number of tasks, number of trials, and trials per task. \n", "\n", - "**NOTE**: for standalone users running on a single worker, the 4 tasks will all be assigned to the same worker and executed sequentially in this demonstration. However, this can easily be scaled up by adding more workers." + "**NOTE**: for standalone users running on a single worker, the 4 tasks will all be assigned to the same worker and executed sequentially in this demonstration. This can easily be scaled up by adding more workers." ] }, { @@ -752,21 +731,9 @@ }, { "cell_type": "code", - "execution_count": 20, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/home/rishic/anaconda3/envs/optuna-spark/lib/python3.10/site-packages/joblibspark/backend.py:115: UserWarning: Spark version does not support stage-level scheduling.\n", - " warnings.warn(\"Spark version does not support stage-level scheduling.\")\n", - "/home/rishic/anaconda3/envs/optuna-spark/lib/python3.10/site-packages/joblibspark/backend.py:154: UserWarning: User-specified n_jobs (4) is greater than the max number of concurrent tasks (1) this cluster can run now.If dynamic allocation is enabled for the cluster, you might see more executors allocated.\n", - " warnings.warn(f\"User-specified n_jobs ({n_jobs}) is greater than the max number of \"\n", - " \r" - ] - } - ], + "outputs": [], "source": [ "with joblib.parallel_backend(\"spark\", n_jobs=num_tasks):\n", " results = joblib.Parallel()(\n", @@ -820,7 +787,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.15" + "version": "undefined.undefined.undefined" } }, "nbformat": 4,