Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: tcp_keepalive socket #3140

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

ShaneNolan
Copy link

When using the botocore.config.Config option tcp_keepalive=True, the TCP socket is configured with the keep alive socket option (socket.SO_KEEPALIVE). By default, Linux sets the TCP keepalive time parameter to 7200 seconds, which exceeds the AWS NAT Gateway default timeout of 350 seconds [source].

This limitation leads to an inability to receive a response from a Lambda function under the following conditions:

  • The Lambda function is invoked in synchronous mode (InvocationType='RequestResponse').
  • The invocation occurs within VPC where a NAT gateway is required to access the internet from a private subnet.
  • The execution time of the Lambda function exceeds 350 seconds.

Therefore, by configuring socket.TCP_KEEPIDLE, socket.TCP_KEEPINTVL and socket.TCP_KEEPCNT when tcp_keepalive during the _compute_socket_options function call we can overcome this limitation.

socket.IPPROTO_TCP is used to support cross platform compatibility.

The code submitted automatically calculates these values based on the read timeout. Another option would be to have supplied in the scope/client object.

Fixes issues: boto/boto3#2424, boto/boto3#2510 and #2916.

Fargate recently had a similar solution implemented to support this use case: https://aws.amazon.com/blogs/containers/announcing-additional-linux-controls-for-amazon-ecs-tasks-on-aws-fargate/.

@adammcdonagh
Copy link

This is also impacting me. Unfortunately we are invoking Lambda from ECS via AWS Batch, which doesn't support adding these new options in the task definition yet.

@smasa1112
Copy link

smasa1112 commented Oct 3, 2024

This issue is same for me.
In my case, Lambda connection is read_timeout when EC2 by Codebuild try to connect lambda.
It is OK, when Lambda sleep 300sec but read_timeout is occured when Lambda sleep 450 sec.
EC2 by Codebuild doesn't join any VPC(Codebuild default)

@spawn-guy
Copy link

experiencing similar issues

File "/var/app/venv/staging-LQM1lest/lib64/python3.11/site-packages/botocore/retryhandler.py", line 247, in __call__
    return self._check_caught_exception(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/app/venv/staging-LQM1lest/lib64/python3.11/site-packages/botocore/retryhandler.py", line 416, in _check_caught_exception
    raise caught_exception
  File "/var/app/venv/staging-LQM1lest/lib64/python3.11/site-packages/aiobotocore/endpoint.py", line 181, in _do_get_response
    http_response = await self._send(request)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/app/venv/staging-LQM1lest/lib64/python3.11/site-packages/aiobotocore/endpoint.py", line 294, in _send
    return await self.http_session.send(request)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/app/venv/staging-LQM1lest/lib64/python3.11/site-packages/aiobotocore/httpsession.py", line 261, in send
    raise ReadTimeoutError(endpoint_url=request.url, error=e)
botocore.exceptions.ReadTimeoutError: Read timeout on endpoint URL: "https://lambda.eu-west-1.amazonaws.com/2015-03-31/functions/redacted/invocations"

@pankajastro
Copy link

pankajastro commented Oct 14, 2024

Hi @nateprewitt / @jonathan343 / @alexgromero / @SamRemis,

I have an Airflow instance running on AWS and I'm using the Airflow LambdaInvokeFunctionOperator to run AWS Lambda functions. When a Lambda function takes 5 minutes or longer to execute, we encounter a ReadTimeoutError. There is an issue in the Airflow repo with more information: apache/airflow#41498.

I’ve tested the changes of this PR, and it is working as expected, handling Lambda functions that take up to 15 minutes to run without issues. Is there anything else needed for the review and merging process? I would appreciate any feedback and updates on its status. Thank you!

@spawn-guy
Copy link

bump. any movement on this PR? my 200s lambda sync invocations are constantly failing with botocore.ReadTimeoutError on Amazon Linux 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants