Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop current worker #411

Open
arjunrajlab opened this issue May 26, 2023 · 6 comments · May be fixed by #690
Open

Stop current worker #411

arjunrajlab opened this issue May 26, 2023 · 6 comments · May be fixed by #690
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@arjunrajlab
Copy link
Collaborator

arjunrajlab commented May 26, 2023

People in lab have noted that they want an option to stop a worker (annotation or property, but mostly annotation). They sometimes start a large job and realize it's doing the wrong thing and want to stop it, but there's no way to do that currently. Can we incorporate a stop button that would kill the process? I think a UI which just gives the option to stop right on the "Compute" button would be sufficient.

@arjunrajlab arjunrajlab added the enhancement New feature or request label May 26, 2023
@arjunrajlab arjunrajlab added this to the Alpha-Version milestone May 26, 2023
@arjunrajlab
Copy link
Collaborator Author

@bruyeret Someone in lab was asking about this again. I just played with it and I can't seem to get the "X compute" button to cancel something to show up.

@bruyeret
Copy link
Contributor

Yes this issue has not been solved
I made a branch cancel-workers 6 months ago but I had an issue that I discussed it with David
I just rebased this branch on master (it was 154 commits late)
We can resume our discussion here @manthey

Here is what I had last time:


You can checkout to this branch, open a dataset and create an annotation worker, for example the random square one

  • If I open the the worker and choose to create 1000 annotations, it works perfectly fine:
    The worker creates the annotations and the front end downloads the new annotations once the worker is done

  • If during the computation I click cancel, there are some issues:
    The worker keeps going and computes all the annotations, I get an error 500 from girder even if girder says that the job is cancelled
    The output of the worker is the following:

Executed the code in: 6.64365798100016 seconds
Invalid state transition to '3', Current state is '824'.

State 3 is success and 824 is cancelling

In the browser, I get an error from girder for the request PUT on the endpoint /job/${jobId}/cancel:

[2024-04-11 10:14:27,239] ERROR: 500 Error
Traceback (most recent call last):
  File "/venv/lib/python3.11/site-packages/kombu/connection.py", line 472, in _reraise_as_library_errors
    yield
  File "/venv/lib/python3.11/site-packages/kombu/connection.py", line 459, in _ensure_connection
    return retry_over_time(
           ^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/kombu/utils/functional.py", line 318, in retry_over_time
    return fun(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/kombu/connection.py", line 934, in _connection_factory
    self._connection = self._establish_connection()
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/kombu/connection.py", line 860, in _establish_connection
    conn = self.transport.establish_connection()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/kombu/transport/pyamqp.py", line 203, in establish_connection
    conn.connect()
  File "/venv/lib/python3.11/site-packages/amqp/connection.py", line 324, in connect
    self.transport.connect()
  File "/venv/lib/python3.11/site-packages/amqp/transport.py", line 129, in connect
    self._connect(self.host, self.port, self.connect_timeout)
  File "/venv/lib/python3.11/site-packages/amqp/transport.py", line 184, in _connect
    self.sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/venv/lib/python3.11/site-packages/girder/api/rest.py", line 655, in endpointDecorator
    val = fun(self, path, params)
          ^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/girder/api/rest.py", line 1251, in PUT
    return self.handleRoute('PUT', path, params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/girder/api/rest.py", line 983, in handleRoute
    val = handler(**kwargs)
          ^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/girder/api/access.py", line 56, in wrapped
    return fun(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/girder/api/rest.py", line 436, in wrapped
    val = fun(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/girder/api/describe.py", line 736, in wrapped
    return fun(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/girder_jobs/job_rest.py", line 203, in cancelJob
    return self._model.cancelJob(job)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/girder_jobs/models/job.py", line 145, in cancelJob
    event = events.trigger('jobs.cancel', info=job)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/girder/events.py", line 291, in trigger
    handler(e)
  File "/venv/lib/python3.11/site-packages/girder_worker/girder_plugin/event_handlers.py", line 141, in cancel
    asyncResult.revoke()
  File "/venv/lib/python3.11/site-packages/celery/result.py", line 160, in revoke
    self.app.control.revoke(self.id, connection=connection,
  File "/venv/lib/python3.11/site-packages/celery/app/control.py", line 496, in revoke
    return self.broadcast('revoke', destination=destination, arguments={
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/celery/app/control.py", line 776, in broadcast
    return self.mailbox(conn)._broadcast(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/kombu/pidbox.py", line 330, in _broadcast
    chan = channel or self.connection.default_channel
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/kombu/connection.py", line 953, in default_channel
    self._ensure_connection(**conn_opts)
  File "/venv/lib/python3.11/site-packages/kombu/connection.py", line 458, in _ensure_connection
    with ctx():
  File "/.pyenv/versions/3.11.9/lib/python3.11/contextlib.py", line 158, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/venv/lib/python3.11/site-packages/kombu/connection.py", line 476, in _reraise_as_library_errors
    raise ConnectionError(str(exc)) from exc
kombu.exceptions.OperationalError: [Errno 111] Connection refused
Additional info:
  Request URL: PUT http://localhost:8080/api/v1/job/6617b7fcc2e0ea61cecb39f5/cancel
  Query string: 
  Remote IP: 172.17.0.1
  Request UID: 9249b22f-3c15-4605-a5a2-e247b74f0e3a

@arjunrajlab
Copy link
Collaborator Author

@manthey I also tried this again just now. I see the same error in the Girder logs:

[2024-04-23 12:17:25,617: INFO/MainProcess] Received task: girder_worker.docker.tasks.docker_run[cb510362-e700-4e17-b006-272744c06867]  

/usr/local/lib/python3.6/dist-packages/celery/platforms.py:801: RuntimeWarning: You're running the worker with superuser privileges: this is

absolutely not recommended!

Please specify a different user using the --uid option.

User information: uid=0 euid=0 gid=0 egid=0

  uid=uid, euid=euid, gid=gid, egid=egid,

[2024-04-23 12:17:30,459: WARNING/ForkPoolWorker-16] creating new log file

2024-04-23 12:17:30,452 [INFO] WRITING LOG OUTPUT TO /root/.cellpose/run.log

2024-04-23 12:17:30,452 [INFO] 

cellpose version: 	2.2.3 

platform:       	linux 

python version: 	3.10.12 

torch version:  	2.1.0+cu121

[2024-04-23 12:17:30,466: WARNING/ForkPoolWorker-16] 2024-04-23 12:17:30,465 [INFO] TORCH CUDA version not installed/working.

2024-04-23 12:17:30,465 [INFO] >>>> using CPU

[2024-04-23 12:17:30,470: WARNING/ForkPoolWorker-16] 2024-04-23 12:17:30,466 [INFO] >> nuclei << model set to be used

[2024-04-23 12:17:30,580: WARNING/ForkPoolWorker-16] 2024-04-23 12:17:30,580 [INFO] >>>> model diam_mean =  17.000 (ROIs rescaled to this size during training)

[2024-04-23 12:17:30,729: WARNING/ForkPoolWorker-16] 2024-04-23 12:17:30,728 [INFO] ~~~ FINDING MASKS ~~~

[2024-04-23 12:17:34,042: WARNING/ForkPoolWorker-16] 2024-04-23 12:17:34,036 [INFO] >>>> TOTAL TIME 3.31 sec

[2024-04-23 12:17:35,364: WARNING/ForkPoolWorker-16] Uploading 513 annotations

progress=1 title=Running Cellpose info=1/1

[2024-04-23 12:17:36,131: WARNING/ForkPoolWorker-16] Invalid state transition to '3', Current state is '824'.

[2024-04-23 12:17:36,140: INFO/ForkPoolWorker-16] Task girder_worker.docker.tasks.docker_run[cb510362-e700-4e17-b006-272744c06867] succeeded in 10.495001778006554s: None

Seems like the same Invalid state transition to '3', Current state is '824'. issue. See also above from @bruyeret for more context on the PUT request. Not sure why this is not reproducing on your setup. I am doing this using localhost:5173 for the server and localhost:8080 for the girder domain, but we have also noticed the error in a number of other setups as well.

@arjunrajlab
Copy link
Collaborator Author

@manthey I could also get this up on AWS if you want to give it a try there.

@arjunrajlab
Copy link
Collaborator Author

Update: @manthey has now been able to see the problem and is trying to get to the bottom of it.

@arjunrajlab arjunrajlab moved this to In Progress in Alpha Release May 18, 2024
@bruyeret bruyeret linked a pull request Jun 5, 2024 that will close this issue
@bruyeret
Copy link
Contributor

bruyeret commented Jun 6, 2024

I fixed the endpoints for uploading annotations that we suspected to be the cause of the issue.
I made a worker upload 3000 annotations and sleep 2 seconds every 100 annotations (so that it takes a total of 60s)
It works as expected and uploads everything in 1min
But when I try to cancel I get a 500 error from girder and in the logs I see the same error as above
What do you think @manthey?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants