Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine FSDP/deepspeed settings on device select. #883

Merged
merged 3 commits into from
Nov 29, 2023

Conversation

kallewoof
Copy link
Contributor

Without this, the OS env check for accelerate will fail.

Without this, the OS env check for accelerate will fail.
@NanoCode012
Copy link
Collaborator

Could you provide an example of the error that you're trying to fix and how to reproduce that?

@kallewoof
Copy link
Contributor Author

It's not a direct error, but a bug. Perhaps the behavior is intended, in which case it should be fixed in some other way.

In either case, unless I'm mistaken, the if case in

https://github.com/OpenAccess-AI-Collective/axolotl/blob/7ee3c4cacb1c7a0d92810247f014a7394e07db80/src/axolotl/utils/config.py#L37-L41

will never be true, unless the user manually does ACCELERATE_USE_something=true python -m ....

because choose_device() is called before setup_trainer() in every case.

@NanoCode012
Copy link
Collaborator

I'm not so clear about that line. As I recall, the only env for that is ACCELERATE_USE_DEEPSPEED.

@winglian
Copy link
Collaborator

thanks @kallewoof, I made a change to your PR, mostly on naming the new function and calling it between the config validation and normalization step (where the device setup happens)

@kallewoof
Copy link
Contributor Author

@winglian Looks good to me.

@kallewoof
Copy link
Contributor Author

kallewoof commented Nov 26, 2023

Actually, the normalization and validation should be swapped. Right now if you have e.g. a None eval_batch_size, you will get a warning about it being different from the batch size, even though the normalization part sets it to the same value if it is None. (This is unrelated to the fix in this PR, but it is something I've noticed and figured I'd mention since we're touching those lines.)

Edit: this:

eval_batch_size != micro_batch_size. This can lead to VRAM instability.

@winglian winglian merged commit 71b7ea3 into axolotl-ai-cloud:main Nov 29, 2023
4 checks passed
@winglian
Copy link
Collaborator

Actually, the normalization and validation should be swapped. Right now if you have e.g. a None eval_batch_size, you will get a warning about it being different from the batch size, even though the normalization part sets it to the same value if it is None. (This is unrelated to the fix in this PR, but it is something I've noticed and figured I'd mention since we're touching those lines.)

Edit: this:

eval_batch_size != micro_batch_size. This can lead to VRAM instability.

I'll address this in a different PR

@NanoCode012
Copy link
Collaborator

NanoCode012 commented Nov 29, 2023

Actually, the normalization and validation should be swapped. Right now if you have e.g. a None eval_batch_size, you will get a warning about it being different from the batch size, even though the normalization part sets it to the same value if it is None. (This is unrelated to the fix in this PR, but it is something I've noticed and figured I'd mention since we're touching those lines.)
Edit: this:

eval_batch_size != micro_batch_size. This can lead to VRAM instability.

I'll address this in a different PR

I think I just fixed this recently. #896

@kallewoof
Copy link
Contributor Author

Actually, the normalization and validation should be swapped. Right now if you have e.g. a None eval_batch_size, you will get a warning about it being different from the batch size, even though the normalization part sets it to the same value if it is None. (This is unrelated to the fix in this PR, but it is something I've noticed and figured I'd mention since we're touching those lines.)
Edit: this:

eval_batch_size != micro_batch_size. This can lead to VRAM instability.

I'll address this in a different PR

I think I just fixed this recently. #896

No, you circumvented one of the symptoms, but the underlying issue still remains. Normalization should happen before validation, to avoid having to put in code like in #896.

@kallewoof kallewoof deleted the 202311-prep-loader branch November 29, 2023 14:20
@winglian
Copy link
Collaborator

No, you circumvented one of the symptoms, but the underlying issue still remains. Normalization should happen before validation, to avoid having to put in code like in #896.

The reason we validate before normalization is because any errors that need to be surfaced to the user could be harder to understand since what they input could be indirectly different from the validation after normalization. I'm open to figuring out a better way.

@kallewoof
Copy link
Contributor Author

That makes sense. I think the opposite is the case too, though, where a user may become unnecessarily confused because an error or warning is raised that ultimately does not reflect the actual parameters used in training. Such as the VRAM warning mentioned above.

mkeoliya pushed a commit to mkeoliya/axolotl that referenced this pull request Dec 15, 2023
…#883)

* Determine FSDP/deepspeed settings on device select.

Without this, the OS env check for accelerate will fail.

* rename and move env setup call

* chore: lint

---------

Co-authored-by: Karl-Johan Alm <[email protected]>
Co-authored-by: Wing Lian <[email protected]>
djsaunde pushed a commit that referenced this pull request Dec 17, 2024
* Determine FSDP/deepspeed settings on device select.

Without this, the OS env check for accelerate will fail.

* rename and move env setup call

* chore: lint

---------

Co-authored-by: Karl-Johan Alm <[email protected]>
Co-authored-by: Wing Lian <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants