Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

yamux: keepalive failed: i/o deadline reached #11509

Open
pgrmega opened this issue Jan 20, 2022 · 5 comments
Open

yamux: keepalive failed: i/o deadline reached #11509

pgrmega opened this issue Jan 20, 2022 · 5 comments
Labels
bug stage/thinking Flagged for internal discussions about possible enhancements stage/waiting-on-upstream This issue is waiting on an upstream change

Comments

@pgrmega
Copy link
Contributor

pgrmega commented Jan 20, 2022

Hello,

Overview of the Issue

We use packer to automatically build HyperV VMs.
The scripts are directly executed on the HyperV server heberged in our own local infrastructure.
We build several VMs in parallel (up to 7).

Sometime we encounter an error "yamux: keepalive failed: i/o deadline reached".
We cannot determine a link between this error and any element (CPU, memory, network, disk, etc...).

We looked for any known issue looking like what we experiment and find #3378, #5360, #6410, #6475
But it didn't helped us => we are not running in Azure but on a Windows Server 2016.

Reproduction Steps

No specific element identified to reproduce the issue.
We lowered the number of VMs in parallel.
We tried to check if reboot of the server improve the behavior.
But no way to find a determining element.

Packer version

From packer version 1.7.8

Operating system and Environment details

Windows Server 2016
Intel Xeon CPU X5650 @ 2.67GHz (2procs)
RAM : 192 GB

Log Fragments and crash.log files

2022/01/19 23:01:22 [ERR] yamux: keepalive failed: i/o deadline reached
2022/01/19 23:02:07 [INFO] (telemetry) ending powershell
2022/01/19 23:02:22 [ERR] yamux: keepalive failed: i/o deadline reached
2022/01/19 23:02:22 [INFO] (telemetry) ending hyperv-vmcx
2022/01/19 23:02:22 [INFO] 537 bytes written for 'stderr'
2022/01/19 23:02:22 [INFO] 20095 bytes written for 'stdout'
2022/01/19 23:02:22 [ERR] Error decoding response stream 6: EOF
2022/01/19 23:02:22 [INFO] RPC endpoint: Communicator ended with: 123
2022/01/19 23:03:12 packer.exe plugin: [INFO] 537 bytes written for 'stderr'
2022/01/19 23:03:12 packer.exe plugin: [INFO] 20095 bytes written for 'stdout'
2022/01/19 23:03:12 packer.exe plugin: [ERR] Error decoding response stream 5: EOF
2022/01/19 23:03:12 packer.exe plugin: c:/Windows/Temp/script-61e85116-8b32-7360-3555-f83ea4f243bc.ps1 returned with exit code 123
2022/01/19 23:03:12 packer.exe plugin: waiting for all plugin processes to complete...

Thank you for any help

@pgrmega pgrmega added the bug label Jan 20, 2022
@nywilken
Copy link
Contributor

Thanks for reaching out. We’ll take a look to see what might be causing this. Thanks for linking the other issues in the description.

@nywilken nywilken added the stage/thinking Flagged for internal discussions about possible enhancements label Jan 26, 2022
@azr
Copy link
Contributor

azr commented Feb 15, 2022

Hey @pgrmega, do you have more info for us to be able to repro this ? Like a simplified packer build config ? And, maybe the system + shell this is running on ?

We will also need the full log.

Thanks !

@pgrmega
Copy link
Contributor Author

pgrmega commented Mar 1, 2022

Hello,

We migrated the server on wich it was running from Windows Server 2016 to Windows Server 2019.
And the problem does not occurs since this migration.

This was running on Powershell 5.2 on Windows Server 2016.

JSON Config Files :

  "variables": {
	  ...
	},
  "builders": [
    {
	"type": "hyperv-vmcx",
	"clone_from_vmcx_path": "{{user `PackerVMCXPath`}}",
	"vm_name": "{{user `VMName`}}",
	"temp_path": "{{user `PackerTempPath`}}",
	"output_directory": "{{user `PackerExportPath`}}",
	"boot_wait": "5m",
	"headless": true,

	"generation": "{{user `VMGeneration`}}",
	"memory": 12288,
	"enable_dynamic_memory": true,
	"cpus": 6,
	"switch_name": "Microsoft Network Adapter Multiplexor Driver - Virtual Switch",
		
	"communicator": "winrm",
	"winrm_use_ssl": false, 
	"winrm_insecure": true,
	"winrm_port": 5985,
	"winrm_timeout": "1h",
	"winrm_username": "{{user `VMAdminUser`}}",
    "winrm_password": "{{user `VMAdminPwd`}}",
    
    "shutdown_command": "IF EXIST C:\\exploitation\\run-sysprep.cmd (CALL C:\\exploitation\\run-sysprep.cmd)",
    "shutdown_timeout": "30m"
    
    }
  ],
  "provisioners": [
    ...
  ]
}

@azr azr added the stage/waiting-on-upstream This issue is waiting on an upstream change label Mar 1, 2022
@Hi-Angel
Copy link

Hi-Angel commented Dec 1, 2023

I'm not even sure where this error is coming from. I cloned both the packer repo and the plugin I use, and neither has deadline reached text in their code 🤷‍♂️

@rsergeevbgc
Copy link

rsergeevbgc commented Jun 17, 2024

We seem to experience the same issue on the following Nomad version run on M3 macbook pro.

nomad -v
Nomad v1.7.7
BuildDate 2024-04-16T19:26:43Z
Revision 0f34c85ee63f6472bd2db1e2487611f4b176c70c

This happens when we run nomad in a dev mode to construct a local cluster for testing. No other jobs run on a cluster. The This is a portion of an error:

2024-06-17T11:51:57.163-0400 [DEBUG] worker: ack evaluation: worker_id=6b2571cd-ad0e-4bbe-eae8-8a1cf854b5d1 eval_id=550ddcd7-83f9-fc31-b812-cd91929c61b0 type=_core namespace=- job_id=deployment-gc node_id="" triggered_by=scheduled
2024-06-17T11:51:57.163-0400 [DEBUG] core.sched: node GC scanning before cutoff index: index=0 node_gc_threshold=24h0m0s
2024-06-17T11:51:57.163-0400 [DEBUG] worker: ack evaluation: worker_id=a024c56d-604e-12af-60dc-b3f050845abd eval_id=48563a36-3321-9500-dbe1-6a23c0906474 type=_core namespace=- job_id=job-gc node_id="" triggered_by=scheduled
2024-06-17T11:51:57.164-0400 [DEBUG] worker: ack evaluation: worker_id=1c06a282-c128-3f0e-3d3b-1dd79d22c8a9 eval_id=2df3a1de-a8f2-e7a2-a570-2deadf6815fd type=_core namespace=- job_id=node-gc node_id="" triggered_by=scheduled
2024-06-17T11:51:57.290-0400 [WARN]  nomad.heartbeat: node TTL expired: node_id=68db21fe-a70b-ae80-16d4-1a4465059846
2024-06-17T11:52:09.858-0400 [ERROR] client: yamux: keepalive failed: i/o deadline reached
2024-06-17T11:52:09.858-0400 [ERROR] nomad.rpc: yamux: keepalive failed: i/o deadline reached
2024-06-17T11:52:09.858-0400 [ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.GetClientAllocs server=XX.XXX.XX.XXX:PPPP
2024-06-17T11:52:09.859-0400 [ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.UpdateStatus server=XX.XXX.XX.XXX:PPPP
2024-06-17T11:52:09.859-0400 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: EOF" rpc=Node.UpdateStatus server=XX.XXX.XX.XXX:PPPP
2024-06-17T11:52:09.859-0400 [ERROR] client: error heartbeating. retrying: error="failed to update status: rpc error: EOF" period=1s
2024-06-17T11:52:09.858-0400 [ERROR] nomad.rpc: multiplex_v2 conn accept failed: error="keepalive timeout"
2024-06-17T11:52:09.865-0400 [DEBUG] client.consul: bootstrap contacting Consul DCs: consul_dcs=["local"]
2024-06-17T11:52:09.869-0400 [INFO]  client.consul: discovered following servers: servers=[XX.XXX.XX.XXX:PPPP]
2024-06-17T11:52:10.862-0400 [DEBUG] client: state updated: node_status=ready
2024-06-17T11:52:10.862-0400 [WARN]  client: missed heartbeat: req_latency=1.723458ms heartbeat_ttl=17.502723309s since_last_heartbeat=41.074529s
2024-06-17T11:52:20.394-0400 [DEBUG] http: request complete: method=GET path=/v1/namespaces?index=1 duration=5m7.06801275s
2024-06-17T11:52:36.185-0400 [DEBUG] http: request complete: method=GET path="/v1/jobs?meta=true&namespace=%2A&index=1" duration=5m16.53492425s
2024-06-17T11:56:36.761-0400 [WARN]  nomad.heartbeat: node TTL expired: node_id=68db21fe-a70b-ae80-16d4-1a4465059846
2024-06-17T11:56:49.877-0400 [ERROR] client: yamux: keepalive failed: i/o deadline reached
2024-06-17T11:56:49.877-0400 [ERROR] nomad.rpc: yamux: keepalive failed: i/o deadline reached
2024-06-17T11:56:49.877-0400 [ERROR] nomad.rpc: multiplex_v2 conn accept failed: error="keepalive timeout"
2024-06-17T11:56:49.878-0400 [ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.GetClientAllocs server=XX.XXX.XX.XXX:PPPP
2024-06-17T11:56:49.878-0400 [ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.UpdateStatus server=XX.XXX.XX.XXX:PPPP
2024-06-17T11:56:49.878-0400 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: EOF" rpc=Node.UpdateStatus server=XX.XXX.XX.XXX:PPPP
2024-06-17T11:56:49.878-0400 [ERROR] client: error heartbeating. retrying: error="failed to update status: rpc error: EOF" period=1s
2024-06-17T11:56:49.882-0400 [DEBUG] client.consul: bootstrap contacting Consul DCs: consul_dcs=["local"]
2024-06-17T11:56:49.887-0400 [INFO]  client.consul: discovered following servers: servers=[XX.XXX.XX.XXX:PPPP]
2024-06-17T11:56:50.880-0400 [DEBUG] client: state updated: node_status=ready

This works normal when run on our CentOS servers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug stage/thinking Flagged for internal discussions about possible enhancements stage/waiting-on-upstream This issue is waiting on an upstream change
Projects
None yet
Development

No branches or pull requests

5 participants