Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't skip training test #751

Open
constantinpape opened this issue Oct 19, 2024 · 17 comments
Open

Don't skip training test #751

constantinpape opened this issue Oct 19, 2024 · 17 comments

Comments

@constantinpape
Copy link
Contributor

The training test is very slow since yesterday. I am skipping it for now, see:
https://github.com/computational-cell-analytics/micro-sam/blob/master/test/test_training.py#L13-L14

But we should figure out why it's slower and reactivate it so that we notice if some change affects training.

I am not sure what the reason is for this slowdown. Nothing has changed in the training code, it seems to happen irrespective of the pytorch version, so is also not just for pytorch 2.5, and everything still works normally for me locally.

@DavidMed13
Copy link

Hi, Im David Medina, from México, im trying to train my own model but de training is not moving forward, I don't know if this commend is related to this issue im having, by the way, great work with usam

Captura de pantalla 2024-10-24 a la(s) 10 10 54 p m

@anwai98
Copy link
Contributor

anwai98 commented Oct 25, 2024

Hi @DavidMed13,

Thanks for your interest in micro-sam.

Can you run the following code in a cell in your notebook and share with us what it returns as outputs?

from micro_sam.util import _get_default_device
print(_get_default_device())

@DavidMed13
Copy link

Sure!
Captura de pantalla 2024-10-25 a la(s) 1 35 06 p m

@constantinpape
Copy link
Contributor Author

Hi @DavidMed13 , your issue is likely due to the fact that training on the MPS devices is very slow. You can try to train on the CPU by setting device="cpu". But please note that training on the CPU will still be quite slow and you will likely have to wait several hours and also need sufficient main memory (ideally >=32 GB).

This is why we recommend using a GPU for training. If you don't have access to a GPU you can use cloud resources for this, see https://github.com/computational-cell-analytics/micro-sam/blob/master/notebooks/sam_finetuning.ipynb for details.

@DavidMed13
Copy link

Hello, thank you for the information, I have tried put it in cpu but still won't advance, I don't know if im doing something wrong

If I use the same coding form the notebook I get this error
Captura de pantalla 2024-11-08 a la(s) 1 07 35 p m

But if I delete "rois" work but I don't know if that modification is the problem, because I guess it shouldn´t take so much time is a little training is only 5 imagines with labels
Captura de pantalla 2024-11-08 a la(s) 1 06 13 p m

I have tried to replicate this in google collab and kaggle but I always have error importing the packages

I really want to create this fine-tuning model because is for microglia, I really appreciate if you can help me, I really feel like I have tried all haha

Thank you in advance :D

@anwai98
Copy link
Contributor

anwai98 commented Nov 8, 2024

Hi @DavidMed13,

I think we fixed this issue in our latest release (where the dataloader accepts all supported arguments).

To make sure of this, could you run the following script in your terminal and share with us the outputs?

python -c "import micro_sam; print(micro_sam.__version__)"

@DavidMed13
Copy link

Sure :)
Captura de pantalla 2024-11-08 a la(s) 2 49 38 p m

@anwai98
Copy link
Contributor

anwai98 commented Nov 8, 2024

Okay, looks like you are using the latest micro-sam already. Hmm, that's strange.

I'll try to reproduce the issue you mentioned above.

Meanwhile, could you confirm two things for us:

@DavidMed13
Copy link

The installation was via mamba in a environment and yes im using the last fine-tuning notebook :)

@anwai98
Copy link
Contributor

anwai98 commented Nov 8, 2024

Hi @DavidMed13,

Thanks for sharing the details.

Another request: could you run this and send me the outputs?

python -c "import inspect; from micro_sam.training.training import default_sam_loader; print(inspect.getsource(default_sam_loader))"

@DavidMed13
Copy link

Of course

Captura de pantalla 2024-11-08 a la(s) 3 19 12 p m

And thank you a lot @anwai98

@anwai98
Copy link
Contributor

anwai98 commented Nov 8, 2024

Ah yeah, I see the issue now. Seems like we recently updated this part.

Can I request you to install micro-sam from source and try again? (see suggestion below)

Since you already have micro-sam installed, it's a rather easier pathway to installing the package from source:

  • Clone our repo
  • Enter the repo and install micro-sam in development mode
mamba activate <INSTALLED_ENVIRONMENT_NAME>  # where, you should ensure that you activate the environment where micro-sam is already installed
git clone https://github.com/computational-cell-analytics/micro-sam
cd micro-sam
pip install -e .

@DavidMed13
Copy link

Okay, done I did it, should I try again the notebook?

Captura de pantalla 2024-11-08 a la(s) 3 31 15 p m

@anwai98
Copy link
Contributor

anwai98 commented Nov 8, 2024

Yes, let's try training the model again and see if the error is (hopefully) gone!

EDIT: We should now try with the rois parameter in the default_sam_loader.

@DavidMed13
Copy link

Okay it work!
Now i don't know how much time should I wait if im in my MacBook,
Like 4 hrs?
Yes I used the the rois hehe :)
Captura de pantalla 2024-11-08 a la(s) 3 39 08 p m

And now im in here
Captura de pantalla 2024-11-08 a la(s) 3 39 42 p m

@anwai98
Copy link
Contributor

anwai98 commented Nov 8, 2024

Okay, that's great to see that the error is gone.

Regarding the training runtime, you should see the progress bar going forward (i.e. processing a few iterations) in a couple of minutes (and the overall training would probably take a couple of hours). If it's slower than what you would desire, I can suggest a) using Kaggle (as it provides reasonable GPU resources free to use) or b) in case you have access to some compute cluster resources. This would free up the training overhead on your laptop.

@DavidMed13
Copy link

DavidMed13 commented Nov 8, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants