From 9bd24628cafa154178c23b0ed83be0277e60aea2 Mon Sep 17 00:00:00 2001 From: Partho Sarthi Date: Wed, 30 Oct 2024 12:13:20 -0700 Subject: [PATCH 1/3] Add Tools Notebooks for EMR --- tools/emr/README.md | 21 + ...rk] Profiling Tool Notebook Template.ipynb | 666 +++++++++++++++ ...Qualification Tool Notebook Template.ipynb | 802 ++++++++++++++++++ 3 files changed, 1489 insertions(+) create mode 100644 tools/emr/README.md create mode 100644 tools/emr/[RAPIDS Accelerator for Apache Spark] Profiling Tool Notebook Template.ipynb create mode 100644 tools/emr/[RAPIDS Accelerator for Apache Spark] Qualification Tool Notebook Template.ipynb diff --git a/tools/emr/README.md b/tools/emr/README.md new file mode 100644 index 00000000..dd3e4595 --- /dev/null +++ b/tools/emr/README.md @@ -0,0 +1,21 @@ +# EMR Tools Demo Notebooks + +The RAPIDS Accelerator for Apache Spark includes two key tools for understanding the benefits of +GPU acceleration as well as analyzing GPU Spark jobs. For customers on EMR, the demo +notebooks offer a simple interface for running the tools given a set of Spark event logs from +CPU (qualification) or GPU (profiling) application runs. + +To use a demo notebook, you can import the notebook in the EMR Workspace. + +Once the demo notebook is imported, you can enter in the log path location in the cell below the `User Input` in the +notebook. After that, click on the `fast-forward` icon which says *Restart the kernel, then re-run the whole notebook* to execute the tools for the specific logs in the log path. + +## Limitations +1. Currently, local and S3 event log paths are supported. +1. Eventlog path must follow the formats `/local/path/to/eventlog` for local logs or `s3://my-bucket/path/to/eventlog` for logs stored in S3. +1. The specified path can also be a directory. In such cases, the tool will recursively search for event logs within the directory. + - For example: `/path/to/clusterlogs` +1. To specify multiple event logs, separate the paths with commas. + - For example: `s3://my-bucket/path/to/eventlog1,s3://my-bucket/path/to/eventlog2` + +**Latest Tools Version Supported** 24.08.2 \ No newline at end of file diff --git a/tools/emr/[RAPIDS Accelerator for Apache Spark] Profiling Tool Notebook Template.ipynb b/tools/emr/[RAPIDS Accelerator for Apache Spark] Profiling Tool Notebook Template.ipynb new file mode 100644 index 00000000..012c7e91 --- /dev/null +++ b/tools/emr/[RAPIDS Accelerator for Apache Spark] Profiling Tool Notebook Template.ipynb @@ -0,0 +1,666 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "df33c614-2ecc-47a0-8600-bc891681997f", + "showTitle": false, + "title": "" + } + }, + "source": [ + "## Profiling Tool for the RAPIDS Accelerator for Apache Spark\n", + "\n", + "To run the profiling tool, enter the log path that represents the location of your Spark GPU event logs. Then, select \"Run all\" to execute the notebook. Once the notebook completes, various output tables will appear below. For more options on running the profiling tool, please refer to the [Profiling Tool User Guide](https://docs.nvidia.com/spark-rapids/user-guide/latest/profiling/quickstart.html#running-the-tool).\n", + "\n", + "### Note\n", + "- Currently, local and S3 event log paths are supported.\n", + "- Eventlog path must follow the formats `/local/path/to/eventlog` for local logs or `s3://my-bucket/path/to/eventlog` for logs stored in S3.\n", + "- The specified path can also be a directory. In such cases, the tool will recursively search for event logs within the directory.\n", + " - For example: `/path/to/clusterlogs`\n", + "- To specify multiple event logs, separate the paths with commas.\n", + " - For example: `s3://my-bucket/path/to/eventlog1,s3://my-bucket/path/to/eventlog2`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## User Input" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Path to the event log in S3 (or local path)\n", + "EVENTLOG_PATH = \"s3://my-bucket/path/to/eventlog\" # or \"/local/path/to/eventlog\"\n", + "\n", + "# S3 path with write access where the output will be copied. \n", + "S3_OUTPUT_PATH = \"s3://my-bucket/path/to/output\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "## Setup Environment" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "source_hidden": true + }, + "tags": [] + }, + "outputs": [], + "source": [ + "from IPython.display import display, Markdown\n", + "\n", + "TOOLS_VER = \"24.08.2\"\n", + "display(Markdown(f\"**Using Spark RAPIDS Tools Version:** {TOOLS_VER}\"))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%pip install spark-rapids-user-tools==$TOOLS_VER --user > /dev/null 2>&1" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "acf401a3-12d3-4236-a6c5-8fe8990b153a", + "showTitle": true, + "title": "Environment Setup" + }, + "jupyter": { + "source_hidden": true + }, + "tags": [] + }, + "outputs": [], + "source": [ + "import os\n", + "import pandas as pd\n", + "\n", + "# Update PATH to include local binaries\n", + "os.environ['PATH'] += os.pathsep + os.path.expanduser(\"~/.local/bin\")\n", + "\n", + "OUTPUT_PATH = \"/tmp\"\n", + "DEST_FOLDER_NAME = \"prof-tool-result\"\n", + "\n", + "# Set environment variables\n", + "os.environ[\"EVENTLOG_PATH\"] = EVENTLOG_PATH \n", + "os.environ[\"OUTPUT_PATH\"] = OUTPUT_PATH\n", + "\n", + "CONSOLE_OUTPUT_PATH = os.path.join(OUTPUT_PATH, 'console_output.log')\n", + "CONSOLE_ERROR_PATH = os.path.join(OUTPUT_PATH, 'console_error.log')\n", + "\n", + "os.environ['CONSOLE_OUTPUT_PATH'] = CONSOLE_OUTPUT_PATH\n", + "os.environ['CONSOLE_ERROR_PATH'] = CONSOLE_ERROR_PATH\n", + "\n", + "print(f'Console output will be stored at {CONSOLE_OUTPUT_PATH} and errors will be stored at {CONSOLE_ERROR_PATH}')\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "execution": { + "iopub.execute_input": "2024-10-24T21:27:00.924906Z", + "iopub.status.busy": "2024-10-24T21:27:00.924587Z", + "iopub.status.idle": "2024-10-24T21:27:00.928129Z", + "shell.execute_reply": "2024-10-24T21:27:00.927454Z", + "shell.execute_reply.started": "2024-10-24T21:27:00.924879Z" + } + }, + "source": [ + "## Run Profiling Tool" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "693b5ee0-7500-43f3-b3e2-717fd5468aa8", + "showTitle": true, + "title": "Run Profiling Tool" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "%%sh\n", + "spark_rapids profiling --platform emr --eventlogs \"$EVENTLOG_PATH\" -o \"$OUTPUT_PATH\" --verbose > \"$CONSOLE_OUTPUT_PATH\" 2> \"$CONSOLE_ERROR_PATH\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "f83af6c8-5a79-4a46-965b-38a4cb621877", + "showTitle": false, + "title": "" + } + }, + "source": [ + "## Console Output\n", + "Console output shows the top candidates and their estimated GPU speedup.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "c61527b7-a21a-492c-bab8-77f83dc5cabf", + "showTitle": true, + "title": "Show Console Output" + }, + "scrolled": true, + "tags": [] + }, + "outputs": [], + "source": [ + "%%sh\n", + "cat $CONSOLE_OUTPUT_PATH" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "f3c68b28-fc62-40ae-8528-799f3fc7507e", + "showTitle": true, + "title": "Show Logs" + }, + "scrolled": true, + "tags": [] + }, + "outputs": [], + "source": [ + "%%sh\n", + "cat $CONSOLE_ERROR_PATH" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "05f96ca1-1b08-494c-a12b-7e6cc3dcc546", + "showTitle": true, + "title": "Parse Output" + }, + "jupyter": { + "source_hidden": true + }, + "tags": [] + }, + "outputs": [], + "source": [ + "import re\n", + "import shutil\n", + "import os\n", + "\n", + "\n", + "def extract_file_info(console_output_path, output_base_path):\n", + " try:\n", + " with open(console_output_path, 'r') as file:\n", + " stdout_text = file.read()\n", + "\n", + " # Extract log file location\n", + " location_match = re.search(r\"Location: (.+)\", stdout_text)\n", + " if not location_match:\n", + " raise ValueError(\n", + " \"Log file location not found in the provided text.\")\n", + "\n", + " log_file_location = location_match.group(1)\n", + "\n", + " # Extract profiling output folder\n", + " qual_match = re.search(r\"prof_[^/]+(?=\\.log)\", log_file_location)\n", + " if not qual_match:\n", + " raise ValueError(\n", + " \"Output folder not found in the log file location.\")\n", + "\n", + " output_folder_name = qual_match.group(0)\n", + " output_folder = os.path.join(output_base_path, output_folder_name)\n", + " return output_folder, log_file_location\n", + "\n", + " except Exception as e:\n", + " raise RuntimeError(f\"Cannot parse console output. Reason: {e}\")\n", + "\n", + "\n", + "def copy_logs(destination_folder, *log_files):\n", + " try:\n", + " log_folder = os.path.join(destination_folder, \"logs\")\n", + " os.makedirs(log_folder, exist_ok=True)\n", + "\n", + " for log_file in log_files:\n", + " if os.path.exists(log_file):\n", + " shutil.copy2(log_file, log_folder)\n", + " else:\n", + " print(f\"Log file not found: {log_file}\")\n", + " except Exception as e:\n", + " raise RuntimeError(f\"Cannot copy logs to output. Reason: {e}\")\n", + "\n", + "\n", + "try:\n", + " output_folder, log_file_location = extract_file_info(\n", + " CONSOLE_OUTPUT_PATH, OUTPUT_PATH)\n", + " jar_output_folder = os.path.join(output_folder,\n", + " \"rapids_4_spark_profile\")\n", + " print(f\"Output folder detected {output_folder}\")\n", + " copy_logs(output_folder, log_file_location, CONSOLE_OUTPUT_PATH,\n", + " CONSOLE_ERROR_PATH)\n", + " print(f\"Logs successfully copied to {output_folder}\")\n", + "except Exception as e:\n", + " print(e)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Download Output" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "8c65adcd-a933-482e-a50b-d40fa8f50e16", + "showTitle": true, + "title": "Download Output" + }, + "jupyter": { + "source_hidden": true + }, + "tags": [] + }, + "outputs": [], + "source": [ + "import shutil\n", + "import os\n", + "import subprocess\n", + "from IPython.display import HTML, display\n", + "from urllib.parse import urlparse\n", + "\n", + "def display_error_message(error_message, exception):\n", + " error_message_html = f\"\"\"\n", + "
\n", + " Error: {error_message}.\n", + "
\n", + " Exception: {exception}\n", + "
\n", + " \"\"\"\n", + " display(HTML(error_message_html))\n", + "\n", + "def copy_file_to_s3(local_file: str, bucket: str, destination_folder_name: str):\n", + " try:\n", + " file_name = os.path.basename(local_file)\n", + " s3_path = f\"s3://{bucket}/{destination_folder_name}/{file_name}\"\n", + " subprocess.run([\"aws\", \"s3\", \"cp\", local_file, s3_path], check=True, capture_output=True, text=True)\n", + " return construct_download_url(file_name, bucket, destination_folder_name)\n", + " except subprocess.CalledProcessError as e:\n", + " raise Exception(f\"Error copying file to S3: {e.stderr}\") from e\n", + "\n", + "def get_default_aws_region():\n", + " try:\n", + " return subprocess.check_output(\n", + " \"aws configure list | grep region | awk '{print $2}'\",\n", + " shell=True,\n", + " text=True\n", + " ).strip()\n", + " except subprocess.CalledProcessError:\n", + " return \"Error: Unable to retrieve the region.\"\n", + "\n", + "def construct_download_url(file_name: str, bucket_name: str, destination_folder_name: str):\n", + " region = get_default_aws_region()\n", + " return f\"https://{region}.console.aws.amazon.com/s3/object/{bucket_name}?region={region}&prefix={destination_folder_name}/{file_name}\"\n", + "\n", + "def create_download_link(source_folder, bucket_name, destination_folder_name):\n", + " folder_to_compress = os.path.join(\"/tmp\", os.path.basename(source_folder))\n", + " local_zip_file_path = shutil.make_archive(folder_to_compress, 'zip', source_folder)\n", + " download_url = copy_file_to_s3(local_zip_file_path, bucket_name, destination_folder_name)\n", + "\n", + " download_button_html = f\"\"\"\n", + " \n", + "\n", + "
\n", + " Zipped output file created at {download_url}\n", + "
\n", + "
\n", + " Download Output\n", + "
\n", + " \"\"\"\n", + " display(HTML(download_button_html))\n", + "\n", + "try:\n", + " current_working_directory = os.getcwd()\n", + " parsed_s3_output_path = urlparse(S3_OUTPUT_PATH)\n", + " bucket_name = parsed_s3_output_path.netloc\n", + " destination_path = os.path.join(parsed_s3_output_path.path.strip(\"/\"), DEST_FOLDER_NAME.strip(\"/\"))\n", + " create_download_link(output_folder, bucket_name, destination_path)\n", + " \n", + "except Exception as e:\n", + " error_msg = f\"Failed to create download link for {output_folder}\"\n", + " display_error_message(error_msg, e)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "73b5e0b0-3a96-4cc6-8e6c-840e4b0d9d43", + "showTitle": false, + "title": "" + } + }, + "source": [ + "\n", + "## Application Status\n", + "\n", + "The report show the status of each eventlog file that was provided\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "c9ffbfdb-dbb6-4736-b9cb-2ac457cc6714", + "showTitle": true, + "title": "profiling_status.csv" + }, + "jupyter": { + "source_hidden": true + }, + "tags": [] + }, + "outputs": [], + "source": [ + "try:\n", + " status_output = pd.read_csv(\n", + " os.path.join(jar_output_folder, \"profiling_status.csv\"))\n", + "\n", + " # Set options to display the full content of the DataFrame\n", + " pd.set_option('display.max_rows', None) # Show all rows\n", + " pd.set_option('display.max_columns', None) # Show all columns\n", + " pd.set_option('display.width', None) # Adjust column width to fit the display\n", + " pd.set_option('display.max_colwidth', None) # Display full content of each column\n", + "\n", + " display(status_output)\n", + "except Exception as e:\n", + " error_msg = \"Unable to show Application Status\"\n", + " display_error_message(error_msg, e) \n", + " \n", + " " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "6756159b-30ca-407a-ab6b-9c29ced01ea6", + "showTitle": false, + "title": "" + } + }, + "source": [ + "## GPU Job Tuning Recommendations\n", + "This has general suggestions for tuning your applications to run optimally on GPUs.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "cdde6177-db5f-434a-995b-776678a64a3a", + "showTitle": true, + "title": "application_information.csv" + }, + "jupyter": { + "source_hidden": true + }, + "scrolled": true, + "tags": [] + }, + "outputs": [], + "source": [ + "try:\n", + " jar_output_folder = os.path.join(output_folder, \"rapids_4_spark_profile\")\n", + " app_df = pd.DataFrame(columns=['appId', 'appName'])\n", + "\n", + " for x in os.scandir(jar_output_folder):\n", + " if x.is_dir():\n", + " csv_path = os.path.join(x.path, \"application_information.csv\")\n", + " if os.path.exists(csv_path):\n", + " tmp_df = pd.read_csv(csv_path)\n", + " app_df = pd.concat([app_df, tmp_df[['appId', 'appName']]])\n", + "\n", + "\n", + " app_list = app_df[\"appId\"].tolist()\n", + " app_recommendations = pd.DataFrame(columns=['App', 'Recommended Configuration'])\n", + "\n", + " for app in app_list:\n", + " app_file = open(os.path.join(jar_output_folder, app, \"profile.log\"))\n", + " recommendations_start = 0\n", + " recommendations_str = \"\"\n", + " for line in app_file:\n", + " if recommendations_start == 1:\n", + " recommendations_str = recommendations_str + line\n", + " if \"### D. Recommended Configuration ###\" in line:\n", + " recommendations_start = 1\n", + " app_recommendations = pd.concat([app_recommendations, pd.DataFrame({'App': [app], 'Recommended Configuration': [recommendations_str]})], ignore_index=True)\n", + " html = app_recommendations.to_html().replace(\"\\\\n\", \"
\")\n", + " style = \"\"\n", + " display(HTML(html + style))\n", + "except Exception as e:\n", + " error_msg = \"Unable to show stage output\"\n", + " display_error_message(error_msg, e) " + ] + } + ], + "metadata": { + "application/vnd.databricks.v1+notebook": { + "dashboards": [ + { + "elements": [], + "globalVars": {}, + "guid": "", + "layoutOption": { + "grid": true, + "stack": true + }, + "nuid": "91c1bfb2-695a-4e5c-8a25-848a433108dc", + "origId": 2173122769183715, + "title": "Executive View", + "version": "DashboardViewV1", + "width": 1600 + }, + { + "elements": [], + "globalVars": {}, + "guid": "", + "layoutOption": { + "grid": true, + "stack": true + }, + "nuid": "62243296-4562-4f06-90ac-d7a609f19c16", + "origId": 2173122769183716, + "title": "App View", + "version": "DashboardViewV1", + "width": 1920 + }, + { + "elements": [], + "globalVars": {}, + "guid": "", + "layoutOption": { + "grid": true, + "stack": true + }, + "nuid": "854f9c75-5977-42aa-b3dd-c680b8331f19", + "origId": 2173122769183722, + "title": "Untitled", + "version": "DashboardViewV1", + "width": 1024 + } + ], + "environmentMetadata": null, + "language": "python", + "notebookMetadata": { + "mostRecentlyExecutedCommandWithImplicitDF": { + "commandId": 2173122769183704, + "dataframes": [ + "_sqldf" + ] + }, + "pythonIndentUnit": 2, + "widgetLayout": [ + { + "breakBefore": false, + "name": "Eventlog Path", + "width": 778 + }, + { + "breakBefore": false, + "name": "Output Path", + "width": 302 + } + ] + }, + "notebookName": "[RAPIDS Accelerator for Apache Spark] Profiling Tool Notebook Template" + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.18" + }, + "widgets": { + "application/vnd.jupyter.widget-state+json": { + "state": {}, + "version_major": 2, + "version_minor": 0 + } + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/tools/emr/[RAPIDS Accelerator for Apache Spark] Qualification Tool Notebook Template.ipynb b/tools/emr/[RAPIDS Accelerator for Apache Spark] Qualification Tool Notebook Template.ipynb new file mode 100644 index 00000000..f66e14fb --- /dev/null +++ b/tools/emr/[RAPIDS Accelerator for Apache Spark] Qualification Tool Notebook Template.ipynb @@ -0,0 +1,802 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "df33c614-2ecc-47a0-8600-bc891681997f", + "showTitle": false, + "title": "" + } + }, + "source": [ + "## Qualification Tool for the RAPIDS Accelerator for Apache Spark\n", + "\n", + "To run the qualification tool, enter the log path that represents the location of your Spark CPU event logs. Then, select \"Run all\" to execute the notebook. Once the notebook completes, various output tables will appear below. For more options on running the qualification tool, please refer to the [Qualification Tool User Guide](https://docs.nvidia.com/spark-rapids/user-guide/latest/qualification/quickstart.html#running-the-tool).\n", + "\n", + "### Note\n", + "- Currently, local and S3 event log paths are supported.\n", + "- Eventlog path must follow the formats `/local/path/to/eventlog` for local logs or `s3://my-bucket/path/to/eventlog` for logs stored in S3.\n", + "- The specified path can also be a directory. In such cases, the tool will recursively search for event logs within the directory.\n", + " - For example: `/path/to/clusterlogs`\n", + "- To specify multiple event logs, separate the paths with commas.\n", + " - For example: `s3://my-bucket/path/to/eventlog1,s3://my-bucket/path/to/eventlog2`\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## User Input" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Path to the event log in S3 (or local path)\n", + "EVENTLOG_PATH = \"s3://my-bucket/path/to/eventlog\" # or \"/local/path/to/eventlog\"\n", + "\n", + "# S3 path with write access where the output will be copied. \n", + "S3_OUTPUT_PATH = \"s3://my-bucket/path/to/output\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "## Setup Environment" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "jupyter": { + "source_hidden": true + }, + "tags": [] + }, + "outputs": [], + "source": [ + "from IPython.display import display, Markdown\n", + "\n", + "TOOLS_VER = \"24.08.2\"\n", + "display(Markdown(f\"**Using Spark RAPIDS Tools Version:** {TOOLS_VER}\"))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "%pip install spark-rapids-user-tools==$TOOLS_VER --user > /dev/null 2>&1" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "acf401a3-12d3-4236-a6c5-8fe8990b153a", + "showTitle": true, + "title": "Environment Setup" + }, + "jupyter": { + "source_hidden": true + }, + "tags": [] + }, + "outputs": [], + "source": [ + "import os\n", + "import pandas as pd\n", + "\n", + "# Update PATH to include local binaries\n", + "os.environ['PATH'] += os.pathsep + os.path.expanduser(\"~/.local/bin\")\n", + "\n", + "OUTPUT_PATH = \"/tmp\"\n", + "DEST_FOLDER_NAME = \"qual-tool-result\"\n", + "\n", + "# Set environment variables\n", + "os.environ[\"EVENTLOG_PATH\"] = EVENTLOG_PATH \n", + "os.environ[\"OUTPUT_PATH\"] = OUTPUT_PATH\n", + "\n", + "CONSOLE_OUTPUT_PATH = os.path.join(OUTPUT_PATH, 'console_output.log')\n", + "CONSOLE_ERROR_PATH = os.path.join(OUTPUT_PATH, 'console_error.log')\n", + "\n", + "os.environ['CONSOLE_OUTPUT_PATH'] = CONSOLE_OUTPUT_PATH\n", + "os.environ['CONSOLE_ERROR_PATH'] = CONSOLE_ERROR_PATH\n", + "\n", + "print(f'Console output will be stored at {CONSOLE_OUTPUT_PATH} and errors will be stored at {CONSOLE_ERROR_PATH}')\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "execution": { + "iopub.execute_input": "2024-10-24T21:27:00.924906Z", + "iopub.status.busy": "2024-10-24T21:27:00.924587Z", + "iopub.status.idle": "2024-10-24T21:27:00.928129Z", + "shell.execute_reply": "2024-10-24T21:27:00.927454Z", + "shell.execute_reply.started": "2024-10-24T21:27:00.924879Z" + } + }, + "source": [ + "## Run Qualification Tool" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "693b5ee0-7500-43f3-b3e2-717fd5468aa8", + "showTitle": true, + "title": "Run Qualification Tool" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "%%sh\n", + "spark_rapids qualification --platform emr --eventlogs \"$EVENTLOG_PATH\" -o \"$OUTPUT_PATH\" --verbose > \"$CONSOLE_OUTPUT_PATH\" 2> \"$CONSOLE_ERROR_PATH\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "f83af6c8-5a79-4a46-965b-38a4cb621877", + "showTitle": false, + "title": "" + } + }, + "source": [ + "## Console Output\n", + "Console output shows the top candidates and their estimated GPU speedup.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "c61527b7-a21a-492c-bab8-77f83dc5cabf", + "showTitle": true, + "title": "Show Console Output" + }, + "scrolled": true, + "tags": [] + }, + "outputs": [], + "source": [ + "%%sh\n", + "cat $CONSOLE_OUTPUT_PATH" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "f3c68b28-fc62-40ae-8528-799f3fc7507e", + "showTitle": true, + "title": "Show Logs" + }, + "scrolled": true, + "tags": [] + }, + "outputs": [], + "source": [ + "%%sh\n", + "cat $CONSOLE_ERROR_PATH" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "05f96ca1-1b08-494c-a12b-7e6cc3dcc546", + "showTitle": true, + "title": "Parse Output" + }, + "jupyter": { + "source_hidden": true + }, + "tags": [] + }, + "outputs": [], + "source": [ + "import re\n", + "import shutil\n", + "import os\n", + "\n", + "\n", + "def extract_file_info(console_output_path, output_base_path):\n", + " try:\n", + " with open(console_output_path, 'r') as file:\n", + " stdout_text = file.read()\n", + "\n", + " # Extract log file location\n", + " location_match = re.search(r\"Location: (.+)\", stdout_text)\n", + " if not location_match:\n", + " raise ValueError(\n", + " \"Log file location not found in the provided text.\")\n", + "\n", + " log_file_location = location_match.group(1)\n", + "\n", + " # Extract qualification output folder\n", + " qual_match = re.search(r\"qual_[^/]+(?=\\.log)\", log_file_location)\n", + " if not qual_match:\n", + " raise ValueError(\n", + " \"Output folder not found in the log file location.\")\n", + "\n", + " output_folder_name = qual_match.group(0)\n", + " output_folder = os.path.join(output_base_path, output_folder_name)\n", + " return output_folder, log_file_location\n", + "\n", + " except Exception as e:\n", + " raise RuntimeError(f\"Cannot parse console output. Reason: {e}\")\n", + "\n", + "\n", + "def copy_logs(destination_folder, *log_files):\n", + " try:\n", + " log_folder = os.path.join(destination_folder, \"logs\")\n", + " os.makedirs(log_folder, exist_ok=True)\n", + "\n", + " for log_file in log_files:\n", + " if os.path.exists(log_file):\n", + " shutil.copy2(log_file, log_folder)\n", + " else:\n", + " print(f\"Log file not found: {log_file}\")\n", + " except Exception as e:\n", + " raise RuntimeError(f\"Cannot copy logs to output. Reason: {e}\")\n", + "\n", + "\n", + "try:\n", + " output_folder, log_file_location = extract_file_info(\n", + " CONSOLE_OUTPUT_PATH, OUTPUT_PATH)\n", + " jar_output_folder = os.path.join(output_folder,\n", + " \"rapids_4_spark_qualification_output\")\n", + " print(f\"Output folder detected {output_folder}\")\n", + " copy_logs(output_folder, log_file_location, CONSOLE_OUTPUT_PATH,\n", + " CONSOLE_ERROR_PATH)\n", + " print(f\"Logs successfully copied to {output_folder}\")\n", + "except Exception as e:\n", + " print(e)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Download Output" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "8c65adcd-a933-482e-a50b-d40fa8f50e16", + "showTitle": true, + "title": "Download Output" + }, + "jupyter": { + "source_hidden": true + }, + "tags": [] + }, + "outputs": [], + "source": [ + "import shutil\n", + "import os\n", + "import subprocess\n", + "from IPython.display import HTML, display\n", + "from urllib.parse import urlparse\n", + "\n", + "def display_error_message(error_message, exception):\n", + " error_message_html = f\"\"\"\n", + "
\n", + " Error: {error_message}.\n", + "
\n", + " Exception: {exception}\n", + "
\n", + " \"\"\"\n", + " display(HTML(error_message_html))\n", + "\n", + "def copy_file_to_s3(local_file: str, bucket: str, destination_folder_name: str):\n", + " try:\n", + " file_name = os.path.basename(local_file)\n", + " s3_path = f\"s3://{bucket}/{destination_folder_name}/{file_name}\"\n", + " subprocess.run([\"aws\", \"s3\", \"cp\", local_file, s3_path], check=True, capture_output=True, text=True)\n", + " return construct_download_url(file_name, bucket, destination_folder_name)\n", + " except subprocess.CalledProcessError as e:\n", + " raise Exception(f\"Error copying file to S3: {e.stderr}\") from e\n", + "\n", + "def get_default_aws_region():\n", + " try:\n", + " return subprocess.check_output(\n", + " \"aws configure list | grep region | awk '{print $2}'\",\n", + " shell=True,\n", + " text=True\n", + " ).strip()\n", + " except subprocess.CalledProcessError:\n", + " return \"Error: Unable to retrieve the region.\"\n", + "\n", + "def construct_download_url(file_name: str, bucket_name: str, destination_folder_name: str):\n", + " region = get_default_aws_region()\n", + " return f\"https://{region}.console.aws.amazon.com/s3/object/{bucket_name}?region={region}&prefix={destination_folder_name}/{file_name}\"\n", + "\n", + "def create_download_link(source_folder, bucket_name, destination_folder_name):\n", + " folder_to_compress = os.path.join(\"/tmp\", os.path.basename(source_folder))\n", + " local_zip_file_path = shutil.make_archive(folder_to_compress, 'zip', source_folder)\n", + " download_url = copy_file_to_s3(local_zip_file_path, bucket_name, destination_folder_name)\n", + "\n", + " download_button_html = f\"\"\"\n", + " \n", + "\n", + "
\n", + " Zipped output file created at {download_url}\n", + "
\n", + "
\n", + " Download Output\n", + "
\n", + " \"\"\"\n", + " display(HTML(download_button_html))\n", + "\n", + "try:\n", + " current_working_directory = os.getcwd()\n", + " parsed_s3_output_path = urlparse(S3_OUTPUT_PATH)\n", + " bucket_name = parsed_s3_output_path.netloc\n", + " destination_path = os.path.join(parsed_s3_output_path.path.strip(\"/\"), DEST_FOLDER_NAME.strip(\"/\"))\n", + " create_download_link(output_folder, bucket_name, destination_path)\n", + " \n", + "except Exception as e:\n", + " error_msg = f\"Failed to create download link for {output_folder}\"\n", + " display_error_message(error_msg, e)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "bbe50fde-0bd6-4281-95fd-6a1ec6f17ab2", + "showTitle": false, + "title": "" + } + }, + "source": [ + "\n", + "## Summary\n", + "\n", + "The report provides a comprehensive overview of the entire application execution, estimated speedup, including unsupported operators and non-SQL operations. By default, the applications and queries are sorted in descending order based on the following fields:\n", + "\n", + "- Estimated GPU Speedup Category\n", + "- Estimated GPU Speedup" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "b8bca4a6-16d8-4b60-ba7b-9aff64bdcaa1", + "showTitle": true, + "title": "qualification_summary.csv" + }, + "jupyter": { + "source_hidden": true + }, + "tags": [] + }, + "outputs": [], + "source": [ + "def millis_to_human_readable(millis):\n", + " seconds = int(millis) / 1000\n", + " if seconds < 60:\n", + " return f\"{seconds:.2f} sec\"\n", + " else:\n", + " minutes = seconds / 60\n", + " if minutes < 60:\n", + " return f\"{minutes:.2f} min\"\n", + " else:\n", + " hours = minutes / 60\n", + " return f\"{hours:.2f} hr\"\n", + "\n", + "try: \n", + " # Read qualification summary \n", + " summary_output = pd.read_csv(os.path.join(output_folder, \"qualification_summary.csv\"))\n", + " summary_output = summary_output.drop(columns=[\"Unnamed: 0\"]).rename_axis('Index').reset_index()\n", + " summary_output['Estimated GPU Duration'] = summary_output['Estimated GPU Duration'].apply(millis_to_human_readable)\n", + " summary_output['App Duration'] = summary_output['App Duration'].apply(millis_to_human_readable)\n", + " \n", + " summary_output = summary_output[[\n", + " 'App Name', 'App ID', 'Estimated GPU Speedup Category', 'Estimated GPU Speedup', \n", + " 'Estimated GPU Duration', 'App Duration'\n", + " ]]\n", + " \n", + " # Read cluster information\n", + " cluster_df = pd.read_json(os.path.join(output_folder, \"app_metadata.json\"))\n", + " cluster_df['Recommended GPU Cluster'] = cluster_df['clusterInfo'].apply(\n", + " lambda x: f\"{x['recommendedCluster']['numWorkerNodes']} x {x['recommendedCluster']['workerNodeType']}\"\n", + " )\n", + " cluster_df['App ID'] = cluster_df['appId']\n", + " cluster_df = cluster_df[['App ID', 'Recommended GPU Cluster']]\n", + " \n", + " # Merge the results\n", + " results = pd.merge(summary_output, cluster_df, on='App ID', how='left')\n", + " display(results)\n", + "except Exception as e:\n", + " error_msg = \"Unable to show summary\"\n", + " display_error_message(error_msg, e)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "73b5e0b0-3a96-4cc6-8e6c-840e4b0d9d43", + "showTitle": false, + "title": "" + } + }, + "source": [ + "\n", + "## Application Status\n", + "\n", + "The report show the status of each eventlog file that was provided\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "c9ffbfdb-dbb6-4736-b9cb-2ac457cc6714", + "showTitle": true, + "title": "rapids_4_spark_qualification_output_status.csv" + }, + "jupyter": { + "source_hidden": true + }, + "tags": [] + }, + "outputs": [], + "source": [ + "try:\n", + " status_output = pd.read_csv(\n", + " os.path.join(jar_output_folder,\n", + " \"rapids_4_spark_qualification_output_status.csv\"))\n", + "\n", + " # Set options to display the full content of the DataFrame\n", + " pd.set_option('display.max_rows', None) # Show all rows\n", + " pd.set_option('display.max_columns', None) # Show all columns\n", + " pd.set_option('display.width', None) # Adjust column width to fit the display\n", + " pd.set_option('display.max_colwidth', None) # Display full content of each column\n", + "\n", + " display(status_output)\n", + "except Exception as e:\n", + " error_msg = \"Unable to show Application Status\"\n", + " display_error_message(error_msg, e) \n", + " \n", + " " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "6756159b-30ca-407a-ab6b-9c29ced01ea6", + "showTitle": false, + "title": "" + } + }, + "source": [ + "## Stages Output\n", + "\n", + "For each stage used in SQL operations, the Qualification tool generates the following information:\n", + "\n", + "1. App ID\n", + "2. Stage ID\n", + "3. Average Speedup Factor: The average estimated speed-up of all the operators in the given stage.\n", + "4. Stage Task Duration: The amount of time spent in tasks of SQL DataFrame operations for the given stage.\n", + "5. Unsupported Task Duration: The sum of task durations for the unsupported operators. For more details, see [Supported Operators](https://nvidia.github.io/spark-rapids/docs/supported_ops.html).\n", + "6. Stage Estimated: Indicates if the stage duration had to be estimated (True or False).\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "cdde6177-db5f-434a-995b-776678a64a3a", + "showTitle": true, + "title": "rapids_4_spark_qualification_output_stages.csv" + }, + "jupyter": { + "source_hidden": true + }, + "scrolled": true, + "tags": [] + }, + "outputs": [], + "source": [ + "try:\n", + " stages_output = pd.read_csv(\n", + " os.path.join(jar_output_folder,\n", + " \"rapids_4_spark_qualification_output_stages.csv\"))\n", + " display(stages_output)\n", + "except Exception as e:\n", + " error_msg = \"Unable to show stage output\"\n", + " display_error_message(error_msg, e) " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "4d7ce219-ae75-4a0c-a78c-4e7f25b8cd6f", + "showTitle": false, + "title": "" + } + }, + "source": [ + "## Execs Output\n", + "\n", + "The Qualification tool generates a report of the “Exec” in the “SparkPlan” or “Executor Nodes” along with the estimated acceleration on the GPU. Please refer to the [Supported Operators guide](https://nvidia.github.io/spark-rapids/docs/supported_ops.html) for more details on limitations on UDFs and unsupported operators.\n", + "\n", + "1. App ID\n", + "2. SQL ID\n", + "3. Exec Name: Example: Filter, HashAggregate\n", + "4. Expression Name\n", + "5. Task Speedup Factor: The average acceleration of the operators based on the original CPU duration of the operator divided by the GPU duration. The tool uses historical queries and benchmarks to estimate a speed-up at an individual operator level to calculate how much a specific operator would accelerate on GPU.\n", + "6. Exec Duration: Wall-clock time measured from when the operator starts until it is completed.\n", + "7. SQL Node ID\n", + "8. Exec Is Supported: Indicates whether the Exec is supported by RAPIDS. Refer to the Supported Operators section for details.\n", + "9. Exec Stages: An array of stage IDs.\n", + "10. Exec Children\n", + "11. Exec Children Node IDs\n", + "12. Exec Should Remove: Indicates whether the Op is removed from the migrated plan.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, + "inputWidgets": {}, + "nuid": "998b0c51-0cb6-408e-a01a-d1f5b1a61e1f", + "showTitle": true, + "title": "rapids_4_spark_qualification_output_execs.csv" + }, + "jupyter": { + "source_hidden": true + }, + "scrolled": true, + "tags": [] + }, + "outputs": [], + "source": [ + "try:\n", + " execs_output = pd.read_csv(\n", + " os.path.join(jar_output_folder,\n", + " \"rapids_4_spark_qualification_output_execs.csv\"))\n", + " display(execs_output)\n", + "except Exception as e:\n", + " error_msg = \"Unable to show Execs output\"\n", + " display_error_message(error_msg, e) " + ] + } + ], + "metadata": { + "application/vnd.databricks.v1+notebook": { + "dashboards": [ + { + "elements": [], + "globalVars": {}, + "guid": "", + "layoutOption": { + "grid": true, + "stack": true + }, + "nuid": "91c1bfb2-695a-4e5c-8a25-848a433108dc", + "origId": 2173122769183715, + "title": "Executive View", + "version": "DashboardViewV1", + "width": 1600 + }, + { + "elements": [], + "globalVars": {}, + "guid": "", + "layoutOption": { + "grid": true, + "stack": true + }, + "nuid": "62243296-4562-4f06-90ac-d7a609f19c16", + "origId": 2173122769183716, + "title": "App View", + "version": "DashboardViewV1", + "width": 1920 + }, + { + "elements": [], + "globalVars": {}, + "guid": "", + "layoutOption": { + "grid": true, + "stack": true + }, + "nuid": "854f9c75-5977-42aa-b3dd-c680b8331f19", + "origId": 2173122769183722, + "title": "Untitled", + "version": "DashboardViewV1", + "width": 1024 + } + ], + "environmentMetadata": null, + "language": "python", + "notebookMetadata": { + "mostRecentlyExecutedCommandWithImplicitDF": { + "commandId": 2173122769183704, + "dataframes": [ + "_sqldf" + ] + }, + "pythonIndentUnit": 2, + "widgetLayout": [ + { + "breakBefore": false, + "name": "Eventlog Path", + "width": 778 + }, + { + "breakBefore": false, + "name": "Output Path", + "width": 302 + } + ] + }, + "notebookName": "[RAPIDS Accelerator for Apache Spark] Qualification Tool Notebook Template" + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.18" + }, + "widgets": { + "application/vnd.jupyter.widget-state+json": { + "state": {}, + "version_major": 2, + "version_minor": 0 + } + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} From 870467ddf1acb488617fc87c59e64b4ebbd4a48d Mon Sep 17 00:00:00 2001 From: Partho Sarthi Date: Wed, 30 Oct 2024 13:33:46 -0700 Subject: [PATCH 2/3] Update README --- tools/emr/README.md | 17 +++++++++++++---- 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/tools/emr/README.md b/tools/emr/README.md index dd3e4595..53cad02d 100644 --- a/tools/emr/README.md +++ b/tools/emr/README.md @@ -5,10 +5,19 @@ GPU acceleration as well as analyzing GPU Spark jobs. For customers on EMR, the notebooks offer a simple interface for running the tools given a set of Spark event logs from CPU (qualification) or GPU (profiling) application runs. -To use a demo notebook, you can import the notebook in the EMR Workspace. +## Usage -Once the demo notebook is imported, you can enter in the log path location in the cell below the `User Input` in the -notebook. After that, click on the `fast-forward` icon which says *Restart the kernel, then re-run the whole notebook* to execute the tools for the specific logs in the log path. +### Pre-requisites: Setup EMR Studio and Workspace +1. Ensure that you have an **EMR cluster** running. +2. Set up **EMR Studio** and **Workspace** by following the instructions in the [AWS Documentation](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-studio-create-studio.html): + - Select **Custom Settings** while creating the Studio. + - Choose the **VPC** and **Subnet** where the EMR cluster is running. +3. Attach the Workspace to the running EMR cluster. For more details, refer to the [AWS Documentation](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-studio-create-use-clusters.html). + +### Running the Notebook +1. Import the notebook into the EMR Workspace by dragging and dropping the notebook file. +2. In the **User Input** section of the notebook, enter the path to event log files. +3. Click the **fast-forward** icon labeled *Restart the kernel, then re-run the whole notebook* to process the logs at the specified path. ## Limitations 1. Currently, local and S3 event log paths are supported. @@ -18,4 +27,4 @@ notebook. After that, click on the `fast-forward` icon which says *Restart the 1. To specify multiple event logs, separate the paths with commas. - For example: `s3://my-bucket/path/to/eventlog1,s3://my-bucket/path/to/eventlog2` -**Latest Tools Version Supported** 24.08.2 \ No newline at end of file +**Latest Tools Version Supported** 24.08.2 From 1e250e017ae3720182dacc1506f34cd1ca85744c Mon Sep 17 00:00:00 2001 From: Partho Sarthi Date: Wed, 30 Oct 2024 15:44:19 -0700 Subject: [PATCH 3/3] Sign-off commit Signed-off-by: Partho Sarthi