Skip to content
This repository has been archived by the owner on Mar 16, 2022. It is now read-only.

network_based slow for job_type=local #38

Open
cdunn2001 opened this issue Nov 5, 2016 · 4 comments
Open

network_based slow for job_type=local #38

cdunn2001 opened this issue Nov 5, 2016 · 4 comments

Comments

@cdunn2001
Copy link
Contributor

On synth5k, the blocking pwatcher with job_type=local and job_queue=bash -C ${CMD} takes about 20secs; fs_based takes 1.5mins; network_based takes 3mins.

@cdunn2001
Copy link
Contributor Author

I found the reason for the slow-down of fs_based on my Mac. This line is slow:

hostname = socket.getfqdn()

In fs_based, that is purely informational, and it takes a few seconds on my Mac. I'll remove it...

Actually, I was able to make that fast by running scutil --set HostName $(scutil --get LocalHostName), as mentioned here, but that did not solve the problem with getaddrinfo()

@cdunn2001
Copy link
Contributor Author

I think the slow-down in network_based is mostly from waiting on read/write buffers for logging, where we send stderr/stdout over the network.

If that's the reason, then it's a fair price. Flowers found a worse slow-down when many remote jobs write a lot of debug output into the filesystem simultaneously. (Still, I would rather solve that be logging less.)

But if getaddrinfo() is called repeatedly, that could be the cause instead. We need to try logging direct to disk, just so we know for sure.

@cdunn2001
Copy link
Contributor Author

This made network_based take the same amount of time as the other pwatchers:

diff --git pwatcher/mains/network_heartbeat.py pwatcher/mains/network_heartbeat.py
index 40bee48..821f010 100644
--- pwatcher/mains/network_heartbeat.py
+++ pwatcher/mains/network_heartbeat.py
@@ -140,18 +140,19 @@ sleep_s={sleep_s!r}""".format(
     call = ' '.join(args.command)
     log(heartbeat_server, jobid, 'In cwd: {}, Blocking call: {!r}'.format(
         os.getcwd(), call))
-    sp = subprocess.Popen(shlex.split(call), stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
+    #sp = subprocess.Popen(shlex.split(call), stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
+    sp = subprocess.Popen(shlex.split(call))
     # forward all output to server until job ends, then get exit value
-    with sp.stdout as f:
-        for line in iter(f.readline, b''):
-            # can't use log() for this because it appends a \n

It could be the repeated socket connections, or it could be the log-sending.

@cdunn2001
Copy link
Contributor Author

The slow-down is only 20% with this:

diff --git pwatcher/mains/network_heartbeat.py pwatcher/mains/network_heartbeat.py
index 40bee48..0eab693 100644
--- pwatcher/mains/network_heartbeat.py
+++ pwatcher/mains/network_heartbeat.py
@@ -141,17 +141,17 @@ sleep_s={sleep_s!r}""".format(
     log(heartbeat_server, jobid, 'In cwd: {}, Blocking call: {!r}'.format(
         os.getcwd(), call))
     sp = subprocess.Popen(shlex.split(call), stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
+    hsocket = socket.socket()
+    hsocket.connect(heartbeat_server)
     # forward all output to server until job ends, then get exit value
     with sp.stdout as f:
         for line in iter(f.readline, b''):
             # can't use log() for this because it appends a \n
-            hsocket = socket.socket()
                 socket_send(hsocket, 's {} {}'.format(jobid, line))
-                hsocket.close()
             except IOError:            # better to miss a line than terminate
                 pass
+    hsocket.close()
     rc = sp.wait()

But that change causes only a few lines of logging to be sent. Much of stdout is lost. What is my mistake there? And could some stdout be dropped even without it?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant