diff --git a/docs/2.developers/4.user-guide/60.deployment/25.azure-aci-deploy.md b/docs/2.developers/4.user-guide/60.deployment/25.azure-aci-deploy.md index b65452dd..e1838ce6 100644 --- a/docs/2.developers/4.user-guide/60.deployment/25.azure-aci-deploy.md +++ b/docs/2.developers/4.user-guide/60.deployment/25.azure-aci-deploy.md @@ -1,33 +1,33 @@ --- title: "Deploy to Azure" -description: "How to deploy Pathway in the cloud with Azure Container Instances" +description: "How to deploy Pathway in the cloud within Azure ecosystem" author: 'sergey' article: - date: '2024-09-09' + date: '2024-11-20' tags: ['showcase', 'data-pipeline'] thumbnail: '/assets/content/documentation/azure/azure-aci-overview-th.png' -keywords: ['Azure', 'ACI', 'cloud deployment', 'Docker', 'Azure ACI'] +keywords: ['Azure', 'ACI', 'cloud deployment', 'Docker', 'Azure ACI', 'Azure Marketplace', 'deployment'] docker_github_link: "https://github.com/pathwaycom/pathway/tree/main/examples/projects/azure-aci-deploy" deployButtons: false --- -# Running Pathway Program in Azure with Azure Container Instances +# Deploying Pathway Programs on Azure Made Easy -If you've already gone through the [AWS Deployment tutorial](/developers/user-guide/deployment/aws-fargate-deploy), feel free to skip the "ETL Example Pipeline" and "Pathway CLI" sections. You can jump directly to the sections on [**Pathway Dockerhub Container**](#pathway-dockerhub-container) and [**Running the Example in Azure Container Instances**](#running-the-example-in-azure-container-instances) for more advanced content. +If you've already gone through the [AWS Deployment tutorial](/developers/user-guide/deployment/aws-fargate-deploy), feel free to skip the "ETL Example Pipeline" and "Pathway CLI" sections. You can jump directly to the [**Running the Example in Azure**](#running-the-example-in-azure) section for more advanced content. The Pathway framework enables you to define and run various data processing pipelines. You can find numerous tutorials that guide you through building systems like [log monitoring](/developers/templates/realtime-log-monitoring), [ETL pipelines with Kafka](/developers/templates/kafka-etl), or [data preparation for Spark analytics](/developers/templates/delta_lake_etl). Once you've developed and tested these pipelines locally, the next logical step is to deploy them in the cloud. Cloud deployment allows your code to run remotely, minimizing interruptions from local machine issues. This step is crucial for moving your code into a production-ready environment. -There are several ways to deploy your code to the cloud. You can deploy it on [GCP](/developers/user-guide/deployment/gcp-deploy), using [Render](/developers/user-guide/deployment/render-deploy) or on [AWS Fargate](/developers/user-guide/deployment/aws-fargate-deploy), for example. In this tutorial, you will learn how to deploy your Pathway code on [Azure Container Instances](https://azure.microsoft.com/en-us/products/container-instances) using Pathway's tools and [Dockerhub](https://hub.docker.com/) as an image storage. +There are several ways to deploy your code to the cloud. You can deploy it on [GCP](/developers/user-guide/deployment/gcp-deploy), using [Render](/developers/user-guide/deployment/render-deploy) or on [AWS Fargate](/developers/user-guide/deployment/aws-fargate-deploy), for example. In this tutorial, you will learn how to deploy your Pathway code in the Azure ecosystem using the [Azure Marketplace offering](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/navalgo1695057418511.pathway-byol?tab=Overview) or [Azure Container Instances](https://azure.microsoft.com/en-us/products/container-instances) and Pathway's tools. -![Running Pathway ETL pipeline in Azure](/assets/content/documentation/azure/azure-aci-overview.svg) +![Deploying Pathway Programs on Azure Made Easy](/assets/content/documentation/azure/azure-aci-overview.svg) The tutorial is structured as follows: 1. Description of the ETL example pipeline. -2. Instructions on getting Pathway image from Dockerhub. -3. Step-by-step guide to setting up a deployment on Azure Container Instances. +2. Instructions on Pathway CLI usage for running a Github-hosted code. +3. Step-by-step guide to setting up a deployment with either Azure Marketplace Offer or via Azure Container Instances. 4. Results verifications. 5. Conclusions. @@ -49,13 +49,11 @@ Additionally, the README file has been updated to offer more guidance on using P There's an important point to consider regarding the task's output. Originally, there were two possible output modes: storing data in a locally-based Delta Lake or in an S3-based Delta Lake. In cloud deployment, using a locally-based Delta Lake isn't practical because it only exists within the container on a remote cloud worker and isn't accessible to the user. Therefore, this tutorial uses an S3-based Delta Lake to store the results, as it provides easy access afterward. This approach requires additional environment variables for the container to access the S3 service, which will be discussed further. -## Pathway CLI and the BYOL container - -### Pathway CLI +## Pathway CLI Pathway provides several tools that simplify both cloud deployment and development in general. -The first tool is the **Pathway CLI**. When you install Pathway, it comes with a command-line tool that helps you launch Pathway programs. For example, the `spawn` command lets you run code using multiple computational threads or processes. For example, `pathway spawn python main.py` runs your locally hosted `main.py` file using Pathway. +One of these tools is the **Pathway CLI**. When you install Pathway, it comes with a command-line tool that helps you launch Pathway programs. For example, the `spawn` command lets you run code using multiple computational threads or processes. For example, `pathway spawn python main.py` runs your locally hosted `main.py` file using Pathway. This tutorial highlights another feature: the ability to run code directly from a GitHub repository, even if it's not hosted locally. @@ -82,16 +80,88 @@ GITHUB_PERSONAL_ACCESS_TOKEN=YOUR_GITHUB_PERSONAL_ACCESS_TOKEN \ pathway spawn-from-env ``` -### Pathway Dockerhub Container +## Running the Example in Azure + +The Pathway framework makes it simple to deploy programs on Azure using the [**Pathway - BYOL**](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/navalgo1695057418511.pathway-byol?tab=Overview) listing, available in the Azure Marketplace. This listing is free. + +We recommend using the Azure Marketplace offering because it's straightforward: just follow a four-step deployment wizard and set up your project-specific settings. For detailed instructions on using this wizard, refer to the first dropdown section. + +If the Marketplace solution doesn't meet your needs, you can also deploy using [**Azure Container Instances**](https://azure.microsoft.com/en-us/products/container-instances). This method is outlined in the second section of this tutorial. However, keep in mind that it's more complex. + +::callout{type="basic"} +#summary +Easy deployment with Azure Marketplace + +#content +Pathway offers a **BYOL (Bring Your Own License) Container** on the [**Azure Marketplace**](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/navalgo1695057418511.pathway-byol?tab=Overview). This container provides a ready-to-use Docker image with Pathway and all required dependencies pre-installed, along with a configured Kubernetes environment for easy deployment within Azure. + +To use this container, you can get a free license key from the [Pathway website](https://www.pathway.com/get-license). **Access to the listing is free, so you won't incur any marketplace costs.** + +The container uses the `pathway spawn-from-env` command, making it easy to run your code on the marketplace. Simply provide the necessary spawn arguments and environment variables, and your code will run in the cloud. + +Keep in mind that the container operates like a standard cloud deployment. This means it expects the program to run continuously. If the program finishes for any reason, it will automatically restart with the same launch parameters and environment variables as before. + +The next section will guide you through setting up Pathway processes in Azure. + +### Performing Service Configuration + +To create the container, start by clicking the **"Create"** button on the listing page. This opens a setup wizard with four steps: + +1. **Basics**: + Select your subscription and resource group for the deployment using the dropdown menus. If you don't have a subscription, create one on the [Azure Subscriptions page](https://portal.azure.com/#view/Microsoft_Azure_Billing/SubscriptionsBladeV2). For a new resource group, click the **"Create new"** link under the **"Resource group"** dropdown, or use the Azure CLI: + + ```bash + az group create --name myResourceGroup --location eastus + ``` + + If using the CLI, log in first with `az login`. After selecting these fields, specify if you need a development cluster by selecting **"Yes"**, and then choose the cluster's region from the dropdown. +2. **Cluster Details**: + Enter an alphanumeric name for your deployment cluster in the **"AKS cluster name"** field. Next, select the latest Kubernetes version. Adjust hardware settings if needed; the defaults are generally suitable for most Pathway programs, but you can modify the vCPU and memory settings via the **"Change size"** link. Autoscaling is enabled by default, and you can set up multiple VMs if desired. + +3. **Application Details**: + First, enter a name for the cluster extension and a title for your application. In the **"Pathway App License"** field, insert your Pathway license key. Then, add the `PATHWAY_SPAWN_ARGS` to specify the `pathway spawn` command, directing the application to the required repository upon startup. Since the tutorial launches the pipeline from `airbyte-to-deltalake` repository, you can specify: `--repository-url https://github.com/pathway-labs/airbyte-to-deltalake python main.py`. + + **Additional Environment Variables**: Define all necessary environment variables. For instance, if using S3 and GitHub, like in the example pipeline, you'll need: + * `AWS_S3_OUTPUT_PATH`: The S3 path for output storage (e.g., `s3://your-bucket/output-path`). + * `AWS_S3_ACCESS_KEY`: S3 access key. + * `AWS_S3_SECRET_ACCESS_KEY`: S3 secret access key. + * `AWS_BUCKET_NAME`: S3 bucket name. + * `AWS_REGION`: Region of your S3 bucket. + * `GITHUB_PERSONAL_ACCESS_TOKEN`: GitHub access token (available [here](https://github.com/settings/tokens)). + * `INPUT_CONNECTOR_MODE` determines how commits are polled from GitHub. It can be set to either `"static"` or `"streaming"`. The default mode, `"static"`, scans all commits once, processes them, and then exits. In `"streaming"` mode, the program runs continuously, waiting for new commits and appending them to the output connection as they arrive. When deploying in Azure, the program restarts automatically after it finishes. Therefore, set `"streaming"` so the program runs without exiting and appends the new commits to the existing collections when they appear. + +4. **Review + Create**: + Review the Pathway BYOL Container's terms and privacy policy. Check all your entries, and once you confirm they're correct, press **"Create"**. + +Congratulations! Your service is now being created on Azure. + +### After the Service is Created + +After creating the service, you'll see a notification in the top-right corner indicating that the deployment is in progress. If everything goes smoothly, a **"Deployment succeeded"** message will soon appear in the same spot. + +In the **"Deployment details"** section on the page, you can view the resources created by the cluster. If any resource fails to deploy, click on it to view error details. A common error is "Insufficient regional vCPU quota left"; if this appears, increase the hardware limits in your resource group. + +The service creation process can take 5-10 minutes to complete. Once finished, you'll see a **"Your deployment is complete"** message. In the **"Next steps"** section, click the **"Go to resource group"** button. From there, navigate to the Kubernetes service linked to the app you configured. Your service is up and running at this point. + +### Conclusion + +That's all for the Azure Marketplace option! As you can see, it's straightforward - there's no need for complex steps as long as you have a subscription and a resource group. This is the recommended method for deploying your Pathway programs in the cloud. + +If you need to deploy your Pathway program to Azure Container Instances and set everything up from scratch, don't worry - we've provided detailed guidance in the [next section](#accessing-the-execution-results). +:: + +::callout{type="basic"} +#summary +Running a publicly available container in Azure Container Instances + +#content Another useful resource from Pathway is the Docker container, listed at [**Dockerhub**](https://hub.docker.com/r/pathwaycom/pathway). This listing offers a ready-to-use Docker image with Pathway and all its dependencies pre-installed, and without binding to a particular ecosystem. You can use the container without a license key, but entering one unlocks the full features of the framework. **The listing is free to use, so there's no cost associated with accessing it.** ![Pathway Dockerhub container](/assets/content/documentation/azure/pathway-dockerhub.svg) The container runs the `pathway spawn-from-env` command, allowing you to easily execute it on the marketplace by passing the `PATHWAY_SPAWN_ARGS` and other required environment variables into the container. This gets your code running in the cloud. The next section will guide you through setting up Pathway processes using [Azure Container Instances](https://azure.microsoft.com/en-us/products/container-instances), the recommended Azure solution for the task. -## Running the Example in Azure Container Instances - Since the container originates outside the Azure ecosystem, there are no steps required to acquire it from a specific marketplace. However, several steps are necessary to use the container: logging in, configuring Azure Container Instances, specifying the required variables for S3 data storage, and finally running the Pathway instance. All of these steps can be performed using a single launcher script (referred to as `launch.py` in this example), which should be run locally. This script is provided in [Pathway's repository](https://github.com/pathwaycom/pathway/tree/main/examples/projects/azure-aci-deploy). @@ -100,11 +170,11 @@ The process involves first configuring the system and obtaining tokens from all ### Step 1: Performing Azure Configuration -The **Azure Command-Line Interface (CLI)** is a powerful tool for managing Azure services. If you haven’t installed it yet, follow the installation guide [here](https://learn.microsoft.com/en-us/cli/azure/install-azure-cli). Once installed, you can move forward with this tutorial. +The **Azure Command-Line Interface (CLI)** is a powerful tool for managing Azure services. If you haven't installed it yet, follow the installation guide [here](https://learn.microsoft.com/en-us/cli/azure/install-azure-cli). Once installed, you can move forward with this tutorial. We will go through a few key steps to set up the necessary variables. -First, let’s set up the variables required for Azure. +First, let's set up the variables required for Azure. 1. **Log in to Azure**: Run the following command to log in to Azure: @@ -113,7 +183,7 @@ First, let’s set up the variables required for Azure. ``` This will open a browser for authentication. After you log in, choose the subscription tenant you want to use. Copy the **Subscription ID** (in UUID4 format) from the `Subscription ID` column. Set this value as the `AZURE_SUBSCRIPTION_ID` variable. - > *Note*: If you don’t have a subscription, you can create one on the [Subscriptions](https://portal.azure.com/#view/Microsoft_Azure_Billing/SubscriptionsBladeV2) page in the Azure Portal. + > *Note*: If you don't have a subscription, you can create one on the [Subscriptions](https://portal.azure.com/#view/Microsoft_Azure_Billing/SubscriptionsBladeV2) page in the Azure Portal. 2. **Get an Access Token**: Run this command to get a session token: @@ -122,7 +192,7 @@ First, let’s set up the variables required for Azure. ``` Copy the `accessToken` value from the resulting JSON and assign it to the `AZURE_TOKEN_CREDENTIAL` variable. - Since the token is quite long, it’s a good idea to store it as an environment variable. Keep in mind, this token expires every hour, so make sure it’s up to date before starting the container. + Since the token is quite long, it's a good idea to store it as an environment variable. Keep in mind, this token expires every hour, so make sure it's up to date before starting the container. 3. **Resource Group ID**: To find your resource group, list the existing ones by running: @@ -133,11 +203,11 @@ First, let’s set up the variables required for Azure. ```bash az group create --name myResourceGroup --location eastus ``` - Assign the resource group’s name to the `AZURE_RESOURCE_GROUP` variable. + Assign the resource group's name to the `AZURE_RESOURCE_GROUP` variable. -The remaining parameters are pre-defined and don’t need to be changed: +The remaining parameters are pre-defined and don't need to be changed: -- `AZURE_CONTAINER_GROUP_NAME` sets the name of the container group (in this case, there’s only one container). +- `AZURE_CONTAINER_GROUP_NAME` sets the name of the container group (in this case, there's only one container). - `AZURE_CONTAINER_NAME` specifies the name of the container itself. - `AZURE_LOCATION` indicates the Azure data center location (e.g., "eastus"). @@ -170,7 +240,7 @@ DOCKER_IMAGE_NAME = "pathwaycom/pathway:latest" ### Step 3: Configuring Backend for Delta Lake Storage -As mentioned earlier, the results must be stored in durable storage since the container’s files will be deleted once it finishes. For this, the tutorial uses **Amazon S3** to store the resulting Delta Lake. Hence, there is a need to configure S3-related variables. +As mentioned earlier, the results must be stored in durable storage since the container's files will be deleted once it finishes. For this, the tutorial uses **Amazon S3** to store the resulting Delta Lake. Hence, there is a need to configure S3-related variables. 1. **S3 Output Path, Bucket Name, and Region**: You need to set up the full output path, bucket name, and the region where your S3 bucket is located. Store these values in the following variables: - `AWS_S3_OUTPUT_PATH` (e.g., `s3://your-bucket/output-path`) @@ -191,7 +261,7 @@ AWS_S3_SECRET_ACCESS_KEY = "YOUR_AWS_S3_SECRET_ACCESS_KEY" ### Step 4: Providing Pathway License Key and Github PAT -To enable Delta Lake features and parse commits from GitHub, you’ll need two last remaining pieces: the **Pathway License Key** and a **GitHub Personal Access Token**. +To enable Delta Lake features and parse commits from GitHub, you'll need two last remaining pieces: the **Pathway License Key** and a **GitHub Personal Access Token**. You can get a free-tier Pathway license key from the [Pathway website](https://www.pathway.com/features). @@ -206,7 +276,7 @@ GITHUB_PERSONAL_ACCESS_TOKEN = "YOUR_GITHUB_PERSONAL_ACCESS_TOKEN" ### Step 5: Configuring a Container in Azure Container Instances -In **Azure Container Instances (ACI)**, a **Container** is a lightweight, standalone, and executable software package that includes everything needed to run an application: code, runtime, libraries, and dependencies. Each container runs in isolation but shares the host system’s kernel. ACI allows you to easily deploy and run containers without managing underlying infrastructure, offering a simple way to run applications in the cloud. +In **Azure Container Instances (ACI)**, a **Container** is a lightweight, standalone, and executable software package that includes everything needed to run an application: code, runtime, libraries, and dependencies. Each container runs in isolation but shares the host system's kernel. ACI allows you to easily deploy and run containers without managing underlying infrastructure, offering a simple way to run applications in the cloud. To manage containers and other resources in Azure efficiently, you can use the Azure Python SDK. For this tutorial, you'll need to install the `azure-identity` and `azure-mgmt-containerinstance` Python packages. You can install them using `pip`. @@ -348,7 +418,7 @@ This command will construct the Container Group settings, which are sufficient t ### Step 7: Launch The Container -Now that everything is set up, you’re ready to run the task. This involves creating an Azure cloud client instance and calling specific methods on it. +Now that everything is set up, you're ready to run the task. This involves creating an Azure cloud client instance and calling specific methods on it. To create the client, you first need to provide credentials. In an isolated environment, such as the Docker image used in this tutorial, the simplest way to handle authentication is by using a code wrapper that manages authentication for the Azure SDK. @@ -381,9 +451,14 @@ client.container_groups.begin_create_or_update( You can now go to the Azure Portal to view the execution stages and related metrics, such as resource usage. You can also stop the execution from the portal. +### Conclusion + +That's all for the Azure Container Instances option. As you can see, it's a bit more complex than using the Azure Marketplace. However, if you can't use the Marketplace or need to customize your deployment as a container instance, this is a viable alternative. +:: + ## Accessing the Execution Results -After the execution is complete, you can verify that the results are in the S3-based Delta Lake using the [`delta-rs`](https://pypi.org/project/deltalake/) Python package. +After the service had successfully started and performed the ETL step, you can verify that the results are in the S3-based Delta Lake using the [`delta-rs`](https://pypi.org/project/deltalake/) Python package. ```python [launch.py] from deltalake import DeltaTable @@ -412,10 +487,10 @@ pd_table_from_delta.shape[0] ``` ``` -700 +862 ``` -You can also verify the count: there were indeed 700 commits in the [`pathwaycom/pathway`](https://github.com/pathwaycom/pathway/) repository as of the time this text was written. +You can also verify the count: there were indeed 862 commits in the [`pathwaycom/pathway`](https://github.com/pathwaycom/pathway/) repository as of the time this text was written. ## Conclusions @@ -423,6 +498,8 @@ Cloud deployment is a key part of developing advanced projects. It lets you depl However, it can be complex, especially for beginners who might face a system with containers, cloud services, virtual machines, and many other components. -This tutorial taught you how to simplify program deployment on Azure cloud using Pathway CLI and Pathway Dockerhub container. At the end, you need to run a container from Dockerhub with the usage of the powerful Microsoft Azure instruments. +This tutorial explains how to simplify program deployment on the Azure cloud using the Pathway CLI and the Azure Marketplace listing. It also provides guidance on deploying with Azure Container Instances as an alternative if the Marketplace listing isn't an option for you. + +![Two Deployment Options Comparison](/assets/content/documentation/azure/azure-comparison.svg) Feel free to try it out and clone the example repository to develop your own data extraction solutions. We also welcome your feedback in our [Discord](https://discord.com/invite/pathway) community!