-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#4] Add VESSL Callback to Post Metrics to VESSL AI #6
Conversation
src/axolotl/utils/vessl_.py
Outdated
# default credential inside a VESSL Run | ||
credential_path = os.environ.get("VESSL_RUN_INITIAL_CONFIG") | ||
if credential_path: | ||
cfg.use_vessl = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what case cfg.vessl_credential_path
is not None
and cfg.use_vessl
is False
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both are None
by default if not set to a value. I will check if adding the parameter to training yaml file may cause them to have a value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- if you want to add configs, you need to touch this file: https://github.com/Y-IAB/axolotl/blob/main/src/axolotl/utils/config/models/input/v0_4_1/__init__.py
- do we need both values? if cfg.vessl_credential_path is set, can't we consider it that vessl is enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- No, I don't want to add config. I just want it to be enabled automatically inside a VESSL Run.
- True, I will update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but you are using cfg
here. for type checking, I think you need to add one
@@ -836,6 +836,13 @@ def get_callbacks(self) -> List[TrainerCallback]: | |||
SaveAxolotlConfigtoWandBCallback(self.cfg.axolotl_config_path) | |||
) | |||
|
|||
if self.cfg.use_vessl: | |||
from axolotl.utils.callbacks.vessl_ import VesslLogMetricsCallback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you tell me why you added a suffix _
to vessl_
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They added underscore too on mlflow (mlflow_
) and wandb (wandb_
) package, so I think it's the convention for external integration in this repo.
@@ -369,6 +370,8 @@ def load_cfg(config: Union[str, Path] = Path("examples/"), **kwargs): | |||
|
|||
setup_mlflow_env_vars(cfg) | |||
|
|||
setup_vessl_env_vars(cfg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please note that usually, it is better you check the condition from where you call the function when the function does nothing when the condition is not met.
LGTM for now to keep it consistent since line 371 does the same as yours.
@@ -9,6 +9,6 @@ generated-members=numpy.*, torch.* | |||
|
|||
|
|||
[pylint.messages_control] | |||
disable=missing-function-docstring, line-too-long, import-error, | |||
disable=missing-function-docstring, line-too-long, import-error, too-many-lines, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added too-many-lines
because data.py
is already almost hitting the 1000 lines limit, and with puree dataset logic it exceeds the limit.
def setup_vessl_env_vars(cfg: DictDefault): | ||
# VESSL_RUN_INITIAL_CONFIG is a variable that contain path to | ||
# default credential inside a VESSL Run | ||
credential_path = os.environ.get("VESSL_RUN_INITIAL_CONFIG") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should not override cfg.vessl_credential_path
if it is already set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if cfg.vessl_credential_path:
return
credential_path = os.environ.get("VESSL_RUN_INITIAL_CONFIG")
if credential_path:
cfg.vessl_credential_path = credential_path
logs: Dict[str, float], | ||
**kwargs # pylint: disable=unused-argument | ||
): | ||
if state.is_world_process_zero: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain where you copied this code from, please?
I am wondering why it is different to the following
https://github.com/vessl-ai/examples/blob/ebeae1c430509d619c380c56923c645cbd02f610/llama-factory/src/llmtuner/extras/callbacks.py#L162-L187
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I take it from wandb integration:
https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/integration_utils.py#L824
The difference between world_process_zero
and local_process_zero
is explained here:
https://github.com/huggingface/transformers/blob/main/src/transformers/trainer_callback.py#L77
src/axolotl/utils/vessl_.py
Outdated
|
||
|
||
def setup_vessl_env_vars(cfg: DictDefault): | ||
# VESSL_RUN_INITIAL_CONFIG is a variable that contain path to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot find any references explaining this variable. Can you attach a document pointing this variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got it from container environment variables, will try to attach a screenshot
…o checkpoint-callback
Changes
VesslLogMetricsCallback
to push metrics of training to VESSL AI.Screenshot
https://screen.yanolja.in/sDrqmbTZGqipTMh4.png