Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#4] Add VESSL Callback to Post Metrics to VESSL AI #6

Merged
merged 15 commits into from
Apr 23, 2024

Conversation

rifqiyan
Copy link
Collaborator

@rifqiyan rifqiyan commented Mar 27, 2024

Changes

  • Add VesslLogMetricsCallback to push metrics of training to VESSL AI.
  • Configure Axolotl to automatically enable the callback inside VESSL Run.

Screenshot

https://screen.yanolja.in/sDrqmbTZGqipTMh4.png

@rifqiyan rifqiyan self-assigned this Mar 27, 2024
# default credential inside a VESSL Run
credential_path = os.environ.get("VESSL_RUN_INITIAL_CONFIG")
if credential_path:
cfg.use_vessl = True

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what case cfg.vessl_credential_path is not None and cfg.use_vessl is False?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both are None by default if not set to a value. I will check if adding the parameter to training yaml file may cause them to have a value.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. if you want to add configs, you need to touch this file: https://github.com/Y-IAB/axolotl/blob/main/src/axolotl/utils/config/models/input/v0_4_1/__init__.py
  2. do we need both values? if cfg.vessl_credential_path is set, can't we consider it that vessl is enabled?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. No, I don't want to add config. I just want it to be enabled automatically inside a VESSL Run.
  2. True, I will update it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but you are using cfg here. for type checking, I think you need to add one

@@ -836,6 +836,13 @@ def get_callbacks(self) -> List[TrainerCallback]:
SaveAxolotlConfigtoWandBCallback(self.cfg.axolotl_config_path)
)

if self.cfg.use_vessl:
from axolotl.utils.callbacks.vessl_ import VesslLogMetricsCallback

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you tell me why you added a suffix _ to vessl_?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They added underscore too on mlflow (mlflow_) and wandb (wandb_) package, so I think it's the convention for external integration in this repo.

@@ -369,6 +370,8 @@ def load_cfg(config: Union[str, Path] = Path("examples/"), **kwargs):

setup_mlflow_env_vars(cfg)

setup_vessl_env_vars(cfg)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please note that usually, it is better you check the condition from where you call the function when the function does nothing when the condition is not met.
LGTM for now to keep it consistent since line 371 does the same as yours.

@@ -9,6 +9,6 @@ generated-members=numpy.*, torch.*


[pylint.messages_control]
disable=missing-function-docstring, line-too-long, import-error,
disable=missing-function-docstring, line-too-long, import-error, too-many-lines,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added too-many-lines because data.py is already almost hitting the 1000 lines limit, and with puree dataset logic it exceeds the limit.

def setup_vessl_env_vars(cfg: DictDefault):
# VESSL_RUN_INITIAL_CONFIG is a variable that contain path to
# default credential inside a VESSL Run
credential_path = os.environ.get("VESSL_RUN_INITIAL_CONFIG")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not override cfg.vessl_credential_path if it is already set

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if cfg.vessl_credential_path:
    return
credential_path = os.environ.get("VESSL_RUN_INITIAL_CONFIG")
if credential_path:
    cfg.vessl_credential_path = credential_path

logs: Dict[str, float],
**kwargs # pylint: disable=unused-argument
):
if state.is_world_process_zero:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain where you copied this code from, please?
I am wondering why it is different to the following
https://github.com/vessl-ai/examples/blob/ebeae1c430509d619c380c56923c645cbd02f610/llama-factory/src/llmtuner/extras/callbacks.py#L162-L187

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.



def setup_vessl_env_vars(cfg: DictDefault):
# VESSL_RUN_INITIAL_CONFIG is a variable that contain path to

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot find any references explaining this variable. Can you attach a document pointing this variable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it from container environment variables, will try to attach a screenshot

@rifqiyan rifqiyan merged commit 85564fb into main Apr 23, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants