Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubernetes_pod_example fails w/ ERROR - Failed to execute job 3 for task pod-task #312

Closed
1 task done
exs208 opened this issue Sep 6, 2023 · 4 comments
Closed
1 task done
Labels
bug Something isn't working

Comments

@exs208
Copy link

exs208 commented Sep 6, 2023

Description

Please provide a clear and concise description of the issue you are encountering, and a reproduction of your configuration.

If your request is for a new feature, please use the Feature request template.

  • ✋ I have searched the open/closed issues and my issue is not listed.

⚠️ Note

Before you submit an issue, please perform the following for Terraform examples:

  1. Remove the local .terraform directory (! ONLY if state is stored remotely, which hopefully you are following that best practice!): rm -rf .terraform/
  2. Re-initialize the project root to pull down modules: terraform init
  3. Re-attempt your terraform plan or apply and check if the issue still persists
    ^^completed this step

Versions

  • Module version [Required]:

  • Terraform version:
    v1.5.5

  • Provider version(s):

  • provider registry.terraform.io/hashicorp/aws v5.15.0
  • provider registry.terraform.io/hashicorp/cloudinit v2.3.2
  • provider registry.terraform.io/hashicorp/helm v2.11.0
  • provider registry.terraform.io/hashicorp/kubernetes v2.23.0
  • provider registry.terraform.io/hashicorp/random v3.5.1
  • provider registry.terraform.io/hashicorp/time v0.9.1
  • provider registry.terraform.io/hashicorp/tls v4.0.4

Reproduction Code [Required]

Same code as kubernets_pod_example in schedulers/terraform/managed-airflow-mwaa

Steps to reproduce the behavior:
I just follow https://awslabs.github.io/data-on-eks/docs/blueprints/job-schedulers/aws-managed-airflow and get a timeout error from the dag not being able to connect to EKS.

Expected behavior

Pod Example runs (authentication is successful)

Actual behavior

Timout error

[2023-09-06, 16:53:10 UTC] {{taskinstance.py:1083}} INFO - Dependencies all met for <TaskInstance: kubernetes_pod_example.pod-task manual__2023-09-06T16:53:06.944898+00:00 [queued]>
[2023-09-06, 16:53:10 UTC] {{taskinstance.py:1083}} INFO - Dependencies all met for <TaskInstance: kubernetes_pod_example.pod-task manual__2023-09-06T16:53:06.944898+00:00 [queued]>
[2023-09-06, 16:53:10 UTC] {{taskinstance.py:1279}} INFO - 
--------------------------------------------------------------------------------
[2023-09-06, 16:53:10 UTC] {{taskinstance.py:1280}} INFO - Starting attempt 1 of 1
[2023-09-06, 16:53:10 UTC] {{taskinstance.py:1281}} INFO - 
--------------------------------------------------------------------------------
[2023-09-06, 16:53:10 UTC] {{taskinstance.py:1300}} INFO - Executing <Task(KubernetesPodOperator): pod-task> on 2023-09-06 16:53:06.944898+00:00
[2023-09-06, 16:53:10 UTC] {{standard_task_runner.py:55}} INFO - Started process 436 to run task
[2023-09-06, 16:53:10 UTC] {{standard_task_runner.py:82}} INFO - Running: ['airflow', 'tasks', 'run', 'kubernetes_pod_example', 'pod-task', 'manual__2023-09-06T16:53:06.944898+00:00', '--job-id', '3', '--raw', '--subdir', 'DAGS_FOLDER/mwaa_pod_example.py', '--cfg-path', '/tmp/tmp1mz298b4']
[2023-09-06, 16:53:10 UTC] {{standard_task_runner.py:83}} INFO - Job 3: Subtask pod-task
[2023-09-06, 16:53:10 UTC] {{task_command.py:388}} INFO - Running <TaskInstance: kubernetes_pod_example.pod-task manual__2023-09-06T16:53:06.944898+00:00 [running]> on host ip-10-0-24-245.us-west-2.compute.internal
[2023-09-06, 16:53:10 UTC] {{taskinstance.py:1507}} INFO - Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=aws
AIRFLOW_CTX_DAG_ID=kubernetes_pod_example
AIRFLOW_CTX_TASK_ID=pod-task
AIRFLOW_CTX_EXECUTION_DATE=2023-09-06T16:53:06.944898+00:00
AIRFLOW_CTX_TRY_NUMBER=1
AIRFLOW_CTX_DAG_RUN_ID=manual__2023-09-06T16:53:06.944898+00:00
[2023-09-06, 16:57:33 UTC] {{connectionpool.py:812}} WARNING - Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f17b7f0b4c0>, 'Connection to b80346ff3be18abadbf5fc89bd601cbc.gr7.us-west-2.eks.amazonaws.com timed out. (connect timeout=None)')': /api/v1/namespaces/mwaa/pods?labelSelector=dag_id%3Dkubernetes_pod_example%2Cexecution_date%3D2023-09-06T165306.9448980000-d1eddeb04%2Ctask_id%3Dpod-task
[2023-09-06, 17:01:55 UTC] {{connectionpool.py:812}} WARNING - Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f17b7f0b5e0>, 'Connection to b80346ff3be18abadbf5fc89bd601cbc.gr7.us-west-2.eks.amazonaws.com timed out. (connect timeout=None)')': /api/v1/namespaces/mwaa/pods?labelSelector=dag_id%3Dkubernetes_pod_example%2Cexecution_date%3D2023-09-06T165306.9448980000-d1eddeb04%2Ctask_id%3Dpod-task
[2023-09-06, 17:06:17 UTC] {{connectionpool.py:812}} WARNING - Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f17b7f0ab60>, 'Connection to b80346ff3be18abadbf5fc89bd601cbc.gr7.us-west-2.eks.amazonaws.com timed out. (connect timeout=None)')': /api/v1/namespaces/mwaa/pods?labelSelector=dag_id%3Dkubernetes_pod_example%2Cexecution_date%3D2023-09-06T165306.9448980000-d1eddeb04%2Ctask_id%3Dpod-task
[2023-09-06, 17:10:39 UTC] {{taskinstance.py:1768}} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/util/connection.py", line 95, in create_connection
    raise err
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection
    sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request
    self._validate_conn(conn)
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn
    conn.connect()
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connection.py", line 358, in connect
    self.sock = conn = self._new_conn()
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connection.py", line 179, in _new_conn
    raise ConnectTimeoutError(
urllib3.exceptions.ConnectTimeoutError: (<urllib3.connection.HTTPSConnection object at 0x7f17b7f0bd90>, 'Connection to b80346ff3be18abadbf5fc89bd601cbc.gr7.us-west-2.eks.amazonaws.com timed out. (connect timeout=None)')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py", line 351, in execute
    pod_list = self.client.list_namespaced_pod(self.namespace, label_selector=label_selector)
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/kubernetes/client/api/core_v1_api.py", line 12803, in list_namespaced_pod
    (data) = self.list_namespaced_pod_with_http_info(namespace, **kwargs)  # noqa: E501
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/kubernetes/client/api/core_v1_api.py", line 12891, in list_namespaced_pod_with_http_info
    return self.api_client.call_api(
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 340, in call_api
    return self.__call_api(resource_path, method,
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 172, in __call_api
    response_data = self.request(
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 362, in request
    return self.rest_client.GET(url,
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 237, in GET
    return self.request("GET", url,
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 210, in request
    r = self.pool_manager.request(method, url,
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/request.py", line 74, in request
    return self.request_encode_url(
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/request.py", line 96, in request_encode_url
    return self.urlopen(method, url, **extra_kw)
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/poolmanager.py", line 376, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 815, in urlopen
    return self.urlopen(
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 815, in urlopen
    return self.urlopen(
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 815, in urlopen
    return self.urlopen(
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    retries = retries.increment(
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='b80346ff3be18abadbf5fc89bd601cbc.gr7.us-west-2.eks.amazonaws.com', port=443): Max retries exceeded with url: /api/v1/namespaces/mwaa/pods?labelSelector=dag_id%3Dkubernetes_pod_example%2Cexecution_date%3D2023-09-06T165306.9448980000-d1eddeb04%2Ctask_id%3Dpod-task (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f17b7f0bd90>, 'Connection to b80346ff3be18abadbf5fc89bd601cbc.gr7.us-west-2.eks.amazonaws.com timed out. (connect timeout=None)'))
[2023-09-06, 17:10:39 UTC] {{taskinstance.py:1318}} INFO - Marking task as FAILED. dag_id=kubernetes_pod_example, task_id=pod-task, execution_date=20230906T165306, start_date=20230906T165310, end_date=20230906T171039
[2023-09-06, 17:10:39 UTC] {{standard_task_runner.py:100}} ERROR - Failed to execute job 3 for task pod-task (HTTPSConnectionPool(host='b80346ff3be18abadbf5fc89bd601cbc.gr7.us-west-2.eks.amazonaws.com', port=443): Max retries exceeded with url: /api/v1/namespaces/mwaa/pods?labelSelector=dag_id%3Dkubernetes_pod_example%2Cexecution_date%3D2023-09-06T165306.9448980000-d1eddeb04%2Ctask_id%3Dpod-task (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f17b7f0bd90>, 'Connection to b80346ff3be18abadbf5fc89bd601cbc.gr7.us-west-2.eks.amazonaws.com timed out. (connect timeout=None)')); 436)
[2023-09-06, 17:10:40 UTC] {{local_task_job.py:208}} INFO - Task exited with return code 1
[2023-09-06, 17:10:40 UTC] {{taskinstance.py:2578}} INFO - 0 downstream tasks scheduled from follow-on schedule check

Terminal Output Screenshot(s)

Additional context

I have a hunch this has to do with the providers list that MWAA uses by default. I would have to adjust the requirements.txt so I could get the default EMR task to run. I can confirm that the Virtual EMR cluster and EKS build. I can connect to the cluster via kubectl from my local (List Name Spaces and Pods). I have tried editing the kube_config file to match my local kube_config created from the awk eks update-kubeconfig command. I also used us-east-2 and us-west-2 on multiple builds when i start the build.sh command.

Here is a list of providers too...

Screenshot 2023-09-06 at 1 34 12 PM
@exs208 exs208 changed the title kubernetes_pod_example fails w/ kubernetes_pod_example fails w/ ERROR - Failed to execute job 3 for task pod-task Sep 6, 2023
@exs208
Copy link
Author

exs208 commented Sep 6, 2023

This issues from Kubernets-Sigs repo sees to address this issue: kubernetes-sigs/aws-iam-authenticator#268

From Reading the Known Issue portion of this document I think I will need to add a setup.sh script to the mwaa configuration that runs after the Terraform deployment to configure a new service role that works with mwaa and eks.

@vara-bonthu
Copy link
Collaborator

@jagpk Have you seen this before?

@exs208
Copy link
Author

exs208 commented Sep 18, 2023

I found that I need to expand the inbound rules security groups to allow for EKS and MWAA to communicate with each other. I will try and add this as a PR

@vara-bonthu
Copy link
Collaborator

Thanks @exs208 ! Please raise a PR or add details here so that users can lear from your issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants