Skip to content

Commit

Permalink
Update
Browse files Browse the repository at this point in the history
  • Loading branch information
yirenlu92 committed Sep 23, 2024
1 parent 7d70c41 commit d40e717
Showing 1 changed file with 3 additions and 5 deletions.
8 changes: 3 additions & 5 deletions 06_gpu_and_ml/long-training/long-training.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,9 +92,6 @@ def train_model(data_dir, checkpoint_dir, resume_from_checkpoint=None):
# Next, we define the training function running on Modal infrastructure. Note that this function has the volume mounted on it.
# The training function checks in the volume for an existing latest checkpoint file, and resumes training off that checkpoint if it finds it.
# The `timeout` parameter in the `@app.function` decorator is set to 30 seconds for demonstration purposes. In a real scenario, you'd set this to a larger value (e.g., several hours) based on your needs.
# If the function times out, or if the job is [preempted](/docs/guide/preemption#preemption), the main loop below will catch the exception and attempt to resume training from the last checkpoint.


@app.function(
image=image,
# mounts=[train_script_mount],
Expand Down Expand Up @@ -130,9 +127,10 @@ def train():
# ## Run the model
#
# We define a [`local_entrypoint`](https://modal.com/docs/guide/apps#entrypoints-for-ephemeral-apps)
# to run the training. The entrypoint handles job preemptions and timeouts, and keeps attempting to run the training until it is able to keeps attempting to run the training until it is able to complete successfully.
# to run the training.
# If the function times out, or if the job is [preempted](/docs/guide/preemption#preemption), the loop will catch the exception and attempt to resume training from the last checkpoint.

# You can run this locally with `modal run long-training.long-training --detach`
# You can run this locally with `modal run 06_gpu_and_ml.long-training.long-training --detach`
# This runs the code in detached mode, allowing it to continue running even if you close your terminal or computer. This is important since training jobs can be long.


Expand Down

0 comments on commit d40e717

Please sign in to comment.