-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support S3 URI as Live.dir to store DVCLive data in cloud storage #676
Comments
I worked arounds this with a small
Note that I'm using a little helper from my own open source lib here for syncing a local temp folder with S3. You could probably use PyTorch Lightnings fsspec abstractions instead rather. |
@aschuh-hf I thought there was a GH issue for this already, but I can't seem to find it. Do you plan to change the path for each experiment you run? |
Yes, in this case, where experiments are executed as batch jobs potentially at the same time (parallel experiment runs) in a cluster environment such as Ray with results stored in cloud storage such as S3 I have to use a separate output path for each experiment. On the local machine, DVC is creating and managing the temp folders for me. My version="$(date -u +%Y-%m-%d)-${DVC_EXP_BASELINE_REV:0:8}-${DVC_EXP_NAME}"
s3_dvc_dir="s3://${s3_bucket}/${s3_prefix}dvc/${version}/"
s3_log_dir="s3://${s3_bucket}/${s3_prefix}logs/${version}/"
s3_out_dir="s3://${s3_bucket}/${s3_prefix}state/" # excl. version subdir because Ray Train appends it The script then uses The following is run in a loop as long as the Ray job is not in a terminal state:
|
Nice @aschuh-hf! I was looking a bit into #237, and this looks like the same idea that's needed there. It's also related to the discussion in #638. It would be great to converge on a smoother experience here that doesn't require adding a helper to sync to and from cloud storage to track jobs run on ray or in other remote/distributed compute scenarios. |
An integration with Ray Tune (and Ray Train) sounds fantastic. When using Ray Tune, I currently disable DVCLive because Tune already provides me with an experiment analysis object which basically let's me compare trials, e.g., as Pandas dataframe. But it would indeed be better and more convenient if those trial results would be nicely integrated with individual DVC experiment runs instead (across The reason I chose not to write the DVCLive to the trial working directory, and thus having to upload the artifacts to remote storage manually in the DVCLiveLogger is that the trial folder names are more cryptic and I would need to use Ray to parse the experiment JSON files to figure out under which S3 prefix the results are stored, even though with Ray Train I only have one trial folder. Choosing my own upload path made it easier to download the data using
Not having to use my own helper would be great! Another aspect is the handling of AWS credentials in this case for long running training jobs. I haven't looked into how to enable the script itself to update credentials before they expire (so if DVC can handle that for me even better), but basically I need the local A smoother integration may be to execute In case of DVC Studio, I imagine the train job would directly submit the data to the running Studio server. It's just that DVC CLI and VS Code Extension have to actively obtain this data differently I suppose. Since I first set this up, we also changed our Ray cluster IAM roles to allow me access to the GitHub repository. Do you think it would be better to exchange intermediate data with DVC CLI / VS Code Extension via the GitHub repository? I'm just not sure whether it would be good to push so many intermediate experiment results as commits to the remote repository (unless this is actually not an issue at all, or the previous commit would be replaced; in the end leaving just one commit for each run). |
Sorry for the long delay here @aschuh-hf! We are still thinking about this one but got caught up with other priorities.
Makes sense. I think it may be better for us to start with Ray Train here since it's simpler than managing results from many experiments at once. Have you considered using EFS to share the repo across the cluster so you can actually write dvclive output there? Otherwise, I think it should be possible to support s3 paths in dvclive, and we could document how to have your stage download those to the repo when training finishes. It doesn't feel like the cleanest solution to me but it could at least unblock you. We could also provide realtime updates via Studio. Not sure if that interests you but could be helpful for the general public to not have to write a helper to intermittently download results. cc @mnrozhkov |
I am using Ray Jobs to execute training on EC2 worker nodes forming an auto-scaling Ray Cluster. I would like to save DVCLive output in a persistent remote storage location (AWS S3). If the Ray Job would use
dvc exp run
, the output can be saved at the end of training to the Git repo viadvc exp push
. But if a failure would occur or training is interrupted, the output up to that point would only be on the worker node which will get terminated.Further, in my setup, I am not actually running
dvc exp run
as the Ray Job command. Instead, I am runningdvc exp run --queue && dvc queue start
(using CLI or VS Code Extension) on my local machine. The DVC queue task is a custom script which submits the training job and sync's the intermediate outputs from remote storage in S3 to the local machine at regular intervals until the job is in a terminal state. The advantage of doing this is that I can use localdvc exp
commands as if the training tasks would be running locally, e.g., to follow progress, compare live plots, etc. My custom script thus takes care of the download of the training job output from S3 to the local machine. PyTorch Lightning TensorBoard logs can be written directly to S3 as supported bylightning.pytorch.loggers.tensorboard.TensorBoardLogger
wheresave_dir
can be a URI, and Ray Tune / Train state is uploaded by Ray by specifying astorage_path
URI.With Ray Train <2.5, I was using
RunConfig(local_dir=...)
along sideSyncConfig(upload_dir=...)
to have Ray upload the DVCLive output located inlocal_dir
from the local directory to the remote storage. This is now deprecated since Ray 2.5. When the job is running on the Ray cluster head node, thedvclive
output folder is still synced correctly, but when run on worker nodes it is not (even though it exists on the local EC2 instance drive which I checked).I could likely fix this by adjusting my Ray Train script to either save the DVCLive output to a folder inside the Ray
session.get_trial_dir()
or taking care of the upload of DVCLive output myself.However, ideally, DVCLive itself (or at least the
dvclive.lightning.DVCLiveLogger
) would support URIs (S3 in my case) as an outputdir
for logged data, just as theTensorBoardLogger
of PyTorch Lightning supports (thanks to URI support oftorch.utils.tensorboard.SummaryWriter
).The text was updated successfully, but these errors were encountered: