Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exec driver leaks executor process after StartTask error #11958

Open
tantra35 opened this issue Jan 28, 2022 · 4 comments · May be fixed by #24495
Open

exec driver leaks executor process after StartTask error #11958

tantra35 opened this issue Jan 28, 2022 · 4 comments · May be fixed by #24495
Labels
hcc/jira stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/driver/exec theme/driver/java type/bug

Comments

@tantra35
Copy link
Contributor

Nomad version

Output from Nomad v1.1.10 (2f08fe230da05e1b179710ebe0e2582249599a4b+CHANGES)

Operating system and Environment details

Ubuntu 20.04

Issue

If we use unhallowed caps for exec driver after faill we got leeaking nomad exec processes

Reproduction steps

For example if we use net_raw caps that doens't allowed by default for exec driver

job testnetworknamespace
{
	region = "global"
	datacenters = ["test"]

	update
	{
		stagger = "1m"
		min_healthy_time = "1m"
		max_parallel = 1
		health_check="checks"
		healthy_deadline = "3m"
		progress_deadline = "6m"
		auto_revert = true
	}

	group testservicecheck
	{
		restart {
			attempts = 2
			delay    = "15s"
		}

		task testservicecheck
		{
			driver = "exec"
			leader=true

			config
			{
				cap_add = ["net_raw"]

				command = "sleep"
				args = ["6000"]
			}

			logs
			{
				max_files = 3
				max_file_size = 10
			}

			resources
			{
				memory = 300
				cpu = 100
			}
		}
	}
} 

after allocation on node fail with follow task state(which is absolutely expected behavior)

Recent Events:
Time                       Type            Description
2022-01-28T20:22:47+03:00  Killing         Sent interrupt. Waiting 5s before force killing
2022-01-28T20:22:47+03:00  Not Restarting  Error was unrecoverable
2022-01-28T20:22:47+03:00  Driver Failure  driver does not allow the following capabilities: net_raw
2022-01-28T20:22:45+03:00  Task Setup      Building Task Directory
2022-01-28T20:22:40+03:00  Received        Task received by client

on client node we got leaked nomad executor processes (here we demonstrate some output of ps axuf)

dnsmasq    33659  0.0  0.2  13932  2088 ?        S    19:16   0:00 /usr/sbin/dnsmasq -x /run/dnsmasq/dnsmasq.pid -u dnsmasq -7 /etc/dnsmasq.d,.dpkg-dist,.dpkg-old,.dpkg-new --local-service --trust-anchor=.,20326,8,2,e0
root       33756  0.6  5.6 1363452 56400 ?       Ssl  19:16   0:25 /opt/nomad/nomad agent -config=/etc/nomad/
root       34470  0.0  3.0 1287848 30340 ?       Ssl  19:23   0:00  \_ /opt/nomad/nomad_1.1.10-playrix/nomad executor {"LogFile":"/var/lib/nomad/alloc/0d35c0b9-5a61-adca-d070-413a1ee7ede6/testservicecheck/executor.out"
root       34893  0.0  3.0 1287848 30184 ?       Ssl  19:26   0:00  \_ /opt/nomad/nomad_1.1.10-playrix/nomad executor {"LogFile":"/var/lib/nomad/alloc/ca50587b-fa49-e422-2a7e-84f582147343/testservicecheck/executor.out"
root       38194  0.0  2.9 1509044 29924 ?       Ssl  20:05   0:00  \_ /opt/nomad/nomad_1.1.10-playrix/nomad executor {"LogFile":"/var/lib/nomad/alloc/006bd711-10c8-c230-9da1-b4182f826f8a/testservicecheck/executor.out"
root       38460  0.0  3.0 1287848 30892 ?       Ssl  20:07   0:00  \_ /opt/nomad/nomad_1.1.10-playrix/nomad executor {"LogFile":"/var/lib/nomad/alloc/6586763d-2fe8-9a89-a9e0-591d26461739/testservicecheck/executor.out"
root       38764  0.0  3.0 1287848 31008 ?       Ssl  20:09   0:00  \_ /opt/nomad/nomad_1.1.10-playrix/nomad executor {"LogFile":"/var/lib/nomad/alloc/52ccc789-89d0-23b4-d3ac-1408e6254ded/testservicecheck/executor.out"
root       40194  0.0  3.0 1361580 30492 ?       Ssl  20:22   0:00  \_ /opt/nomad/nomad_1.1.10-playrix/nomad executor {"LogFile":"/var/lib/nomad/alloc/c0d99d3f-3d47-dbb7-833c-054a4ef25721/testservicecheck/executor.out"
root       33760  0.2  2.6 175836 27048 ?        Ssl  19:16   0:12 /opt/consul/consul agent -config-dir=/etc/consul -advertise=192.168.102.22
@lgfa29
Copy link
Contributor

lgfa29 commented Feb 2, 2022

Thanks for raising this @tantra35, from a quick look at the information you provided (thanks for all the details!) I suspect we're missing some clean-up in an error code path.

@tantra35
Copy link
Contributor Author

tantra35 commented Feb 3, 2022

@lgfa29 could you please tell is it possible expect a fix soon?

@lgfa29
Copy link
Contributor

lgfa29 commented Feb 3, 2022

We don't have a date for a fix. I placed this into our backlog for further triaging.

@tgross tgross added stage/needs-verification Issue needs verifying it still exists and removed stage/needs-investigation labels Jun 24, 2024
@tgross tgross moved this from Triaging to Needs Roadmapping in Nomad - Community Issues Triage Jun 24, 2024
@tgross tgross removed the stage/needs-verification Issue needs verifying it still exists label Jun 24, 2024
@tgross tgross moved this from Needs Roadmapping to Needs Triage in Nomad - Community Issues Triage Jun 24, 2024
@tgross
Copy link
Member

tgross commented Jun 24, 2024

Doing some issue cleanup and wanted to confirm that this is still the case even after some improvements we've made recently to the exec driver's process cleanup. Using the following jobspec:

minimal jobspec
job "example" {
  group "sleep" {
    task "sleep" {

      driver = "exec"
      user   = "ubuntu"

      config {
        command = "sleep"
        args    = ["300"]
        cap_add = ["net_raw"]
      }
    }
  }
}

We get task events like the following (as expected):

Recent Events:
Time                       Type            Description
2024-06-24T14:40:00-04:00  Not Restarting  Error was unrecoverable
2024-06-24T14:40:00-04:00  Driver Failure  driver does not allow the following capabilities: net_raw
2024-06-24T14:40:00-04:00  Task Setup      Building Task Directory
2024-06-24T14:40:00-04:00  Received        Task received by client

But after a couple of restarts we get leaked executor processes as reported above:

$ ps afx
...
   1997 ?        Ssl    0:01 /usr/local/bin/nomad agent -config /etc/nomad.d
   2131 ?        Ssl    0:00  \_ /usr/local/bin/nomad executor {"LogFile":"/var/nomad/data/alloc/91bdfcf2-9972-5985-8cd7-62a5d566e193/sleep/executor.out
   2166 ?        Ssl    0:00  \_ /usr/local/bin/nomad executor {"LogFile":"/var/nomad/data/alloc/7599c82e-831f-7699-33f4-c6ab8da2655f/sleep/executor.out

I'm going to re-title this slightly and mark it for roadmapping. I'll also note from a quick look at the code that it almost certainly impacts the java driver and possibly the raw_exec driver as well, but haven't tested that.

@tgross tgross added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Jun 24, 2024
@tgross tgross changed the title resource leaking when unsupported caps used for exec driver exec driver leaks executor process after StartTask error Jun 24, 2024
@tgross tgross moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hcc/jira stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/driver/exec theme/driver/java type/bug
Projects
Status: Needs Roadmapping
Development

Successfully merging a pull request may close this issue.

4 participants