-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dsync not updating times correctly #560
Comments
Changed the subject since the root problem here is the time stamp info not being preserved. Here is a more comprehensive example. There is an odd mix of correct and incorrect time stamps in the destination.
|
Hi, @dvickernasa . Great to hear mpiFileUtils has been helpful. Thanks. dsync should be setting the atime and mtime on the destination files to match the source file. However, it does that as the very last step, after all files have been copied. I wonder if the job might have been interrupted or hit a fatal error before it could complete that step. Do you see any such errors or early termination from your job output log? Tip: When copying a large number of files, I like to recommend using the --batch-files option, as in:
This changes dsync to copy files in batches rather than all files at once. It sets the timestamps at the completion of each batch. If the job is interrupted, one can then execute the same dsync command again. dsync skips copying any files that were completed in earlier runs. By default, it avoids copying any file whose destination exists and has a matching file type, size, and mtime. |
That is possible. Here is the end of the output from the
I'll start using the I'm attempting to recover from this using an rsync command to update the time stamps.
But its going fairly slow. Woud just repeating the |
Actually, the data I copied was done with two separate Any recommendations for fixing the time stamps would be appreciated. The rsync command I posted above should do it, but its slow. I'm thinking dsync won't work since the output contains dsync-ccdev.out.2023-09-29.txt |
I've been doing some tests this afternoon using a smaller dataset. Basically did this:
Even though no actual data was transferred for the 2nd dsync (I think, the 2nd transfer went fast), the time stamps in the destination got updated to their source time stamps. I used rsync after the 2nd dsync to see if it found any files it wanted to transfer again and it did not, indicating that the size and times of all files did indeed get updated. I've restarted the dsync's of my original, large datasets so we'll see what happens. |
Ok, thanks for the update. I'm still puzzled as to why the timestamps aren't being set correctly the first time. How many nodes and processes are you using? |
Does it do a better job at setting timestamps the first time if you add the |
@adammoody is there a reason not to set the timestamps on a file immediately after it is finished processing? That would avoid the risk of missing them being set, and also reduces overhead on the filesystem because it doesn't need to do another set of filename lookups on all of the files. It is understandable that the timestamps on the directories should be deferred until all of the entries in the directory are updated (and also to avoid setting the directory timestamp millions of times for a large directory), but that is not the case for regular files. At worst, if a large file is being written by multiple MPI ranks concurrently, the timestamp might be set multiple times, but that is fine given it only happens for very large files, and one of the threads will be the last one writing and will reset the timestamp when it is finished. Separately, is the " |
I'm using 16 cores on a single workstation for these transfers.
In my small tests, I wasn't using The 2nd dsync for the large dataset just went OK. The intent here was just to correct the times. For some reason, this happened:
The source data is almost completely static so I'm not sure why dsync decided it needed to delete and recopy about half of the data. Perhaps the incorrect dates from the first run threw it off? Its hard to tell for sure but it looked like there were far fewer than 50% of the files with wrong dates in the destination from the 1st dsync. Any thoughts on this mass deletion would be welcome. |
Regarding all of those deletes, I think that second dsync detected the mismatch in mtime between the source and destination files. It then assumes the files are different, it deletes the destination files, and it re-copies those files from the source. It would obviously have been smoother if the first dsync properly set the mtimes. From your output logs, I don't see the types of errors that I'd expect to see if the job had been interrupted. It looks to print the message that it has successfully updated all timestamps. It looks buggy to me that it somehow failed. One guess is that perhaps a trailing write came in to the file after the timestamp update had been processed on the metadata server. That shouldn't be happening, but maybe it is. I don't think that would happen either for a single node run or when using I'll also run some tests to see if I can reproduce that locally. |
Thank you for following along and for your suggestions, @adilger . Yes, we currently set the timestamps once at the end to handle files that are split among multiple ranks. We don't have communication to know when all ranks are done with a given file, so instead we wait until all ranks are done with all files. The Picking the best batch size is not well defined. It effectively acts like a checkpoint operation, so it makes sense to pick a batch size based on the expected runtime and failure rate. It also depends on amount of data and the file sizes. I suggested 100k here since there were 4m files in total, so 100k just seemed like a decent checkpoint frequency. Pretty arbitrary... I like the idea of setting the timestamp multiple times from multiple ranks. That does add extra load on the metadata server, but that's probably not a huge hit. However, a remaining challenge is to figure out how to handle setting other properties like ownership, permission bits, and extended attributes. We set those at the end, and then we set the timestamps as the very last step. This ensures that the mtime on the destination only matches the source after the file has been fully copied (all data and attributes). I know some properties can't be set too early, since for example, some of those can disable write access to the destination file. And I think we'd need to consider the case where dsync might be interrupted and leaves a destination file with the correct size and mtime before all ranks had actually completed their writes. (The file size can be correct since the last bytes in the file might actually be written first.) We could perhaps write to a temporary file and rename to handle that, but then I think we run into a similar problem of knowing when it's time to do the rename. Have more insight into those? |
Given this, I think @adilger's suggestion of writting the times right after the file write makes a lot of sense if its not too much of a performance hit. If dsync's behavior is to delete the file and start over again if the times don't match, it would be very beneficial to get the times updated ASAP so that if the transfer is interrupted, the next dsync can avoid as much re-transfer as possible. One other possibility is implementing rsync's |
One option for detecting partial file transfer would be to check the blocks allocated to the file against the source file. If they are grossly smaller on the target, then it is clearly not finished copying the interior. Alternately (and this might be better for performance reasons as well) you could make the work distribution function smarter and have only a single rank per OST object and copy the objects linearly to avoid holes. With modern Lustre, large O_DIRECT writes are much faster than buffered writes 20GB/s vs. 2GB/s). Also, I imagine you only split a file into multiple parts to copy in parallel once it is over 1GiB in allocated space, so the cost of doing an MPI comm to the other ranks is pretty tiny compared to the file data moved, so making the client do multiple locks/opens for these files is probably more work in total. |
Sorry, I know this isn't directly related to this issue, but I have a question about
Note that the percentage is progressing along fairly linearly, then suddenly jumps a lot and finishes much earlier than predicted. I see this happen a lot. Do you know what's going on there? My 2nd dsync of the large dataset finished and I have an rsync running to verify it doesn't find anything else to transfer. Its going to take awhile for that to finish but its been running most of the day and it hasn't found anything yet, which is a good sign. My original rsync verification starting finding files it wanted to transfer almost right away. |
The missing progress messages is a known problem. This is also noted in #530 These progress messages are implemented using a tree-based communication pattern across ranks using point-to-point messages. Every 10 seconds or so, rank 0 initiates a message down the tree to request progress. Each intermediate rank receives that request message from its parent and forwards it down to its children. When the request message reaches the leaves, ranks respond and send their current progress status back up the tree. Intermediate ranks receive that data, merge in their own progress status, and respond back up the tree. Rank 0 receives the messages from its children, merges its data, and prints the final result. These reductions are asynchronous. Ranks periodically check for and respond to these messages as they come in. I think what probably happens for these progress messages to get stalled is that some rank (any rank) gets stuck in an I/O call. Because it is blocked in the I/O call, it doesn't respond to MPI messages, which then blocks the progress tree. If that's what's happening, one solution would be to move the I/O calls into a pthread or maybe use async I/O routines so that the process can still make MPI calls while waiting on its current I/O operation to complete. |
Is it possible to set a timeout on the MPI broadcast that doesn't also result in a total MPI failure? Since the status updates are somewhat advisory, unlike an HPC calculation, getting a partial status is probably fine. |
Yes, I think timeouts on the reduction operation could also work. I think we'd need to pair that with adding sequence numbers to the messages or internal counters so that a parent can distinguish late-arriving responses from its children. It might also get strange, since the progress "sum" would be missing any results from the subtree rooted at the blocked rank. The reported progress would appear to bounce up and down during the run if such stalls are frequent as random ranks come and go. That might be countered by caching the last reported value from each child. I'd have to check whether that makes sense for the various reduction ops... There may be another advantage to moving the I/O to a thread. For algorithms that use libcircle for work stealing, work on blocked ranks could be moved to other ranks faster. Having said that, changing all of the tools to move I/O to a separate thread is a bigger overhaul. |
Oh, wait. I'm describing the point-to-point reduction in libcircle: https://github.com/hpc/libcircle/blob/406a0464795ef5db783971e578a798707ee79dc8/libcircle/token.c#L97 There is a separate progress reduction algorithm in mpiFileUtils, and I forgot that it uses non-blocking MPI collectives for that: https://github.com/hpc/mpifileutils/blob/main/src/common/mfu_progress.c That'd probably first need to be switched over to a point-to-point method (like in libcircle) to add timeouts. |
Since this is getting a bit involved, let's migrate discussion about missing progress messages to issue #530. |
I've got a few more datapoints on setting the times. First of all, the rsync verification of our 141 TB, 2.7M files dataset is still going. Its been running since 10/5. We closed off the permissions of the top level directory to ensure people can't access the data. So far, rsync has found about a couple dozen files that it wanted to re-transfer. Over the weekend, we had another few sets of data we needed to transfer, each set containing about 250 GB and 1.7M files. Using a single node we can transfer one set in less than an hour using dsync. However, we found more indications of timestamps getting set incorrectly. This time we kept running dsync. After 2 or 3 times of dsync finding new files to transfer, we could get to the point where dsync itself no longer found new files to transfer. This was still not using |
To be sure, your latest transfer of 1.7M files saw mismatched timestamps, and this only ran dsync using a single node, right? That info helps to eliminate one theory I had. And this is a Lustre-to-Lustre transfer, right? Did you see any errors reported in the output of that first transfer? |
Correct - single node and lustre to lustre. |
Sorry, to answer your questions about errors, Yes, there were some errors. They did a couple intermediate transfers while the datasets were in use and being modified in an effort to reduce the time it took to transfer the data when the runs were completed.
Although, I found this odd (the source data was smaller not larger), we didn't spend any time digging into this. One the job were done and the data was static, the transfer did show just a couple of errors.
Which is kind of strange since the source and destination directories were both owned by the person doing the transfer. One more dsync run got all files and finished without errors. |
FYI, I started another transfer on Thursday. This one was quite large WRT the number of files (44.7M)
I used
The scratch filesystem is NFS so this is a an NFS to lustre transfer. Unfortunately, it looks like this one died sometime on Saturday.
I'm not sure why it died yet - I'll be digging into that some more. However, after poking around in the partially transferred directory structure, it looks like all directories are owned by root. Or at least most of them. Here is an example of the top level directory.
Is this expected? |
A few updates. The first dsync run of this latest transfer died due to an OOM event on the machine I was running on. I think it was dsync itself that caused the OOM, since this was a pretty isolated machine and nothing else was running there. But I'm not 100% on that. I didn't mention it before, but this is also a single node run. The sync is restarted and, unfortunately, I'm having similar issues. dsync decided to delete a lot of what it already transferred.
I'm attaching the full output for both the original and the restart for this latest run. dsync-em1.out.2023-10-16-09-50.txt |
Ugh, that's frustrating. I suspect that the timestamps for the files on that first run are still not being set correctly. It looks like the first run successfully completed a number of 100k file-transfer batches before it died (193 batches to be precise).
So at most it should have had to delete about 100k files, which would be the files just from the batch being worked when the run died. Instead, it thought 11m million files needed to be replaced. It seems like it's either missing setting the timestamps on some files, or perhaps it's hitting an error and not reporting it. I'll review the code again with an eye for those kinds of problem. You might as well drop Having root ownership left on the directories is expected, because the directory permissions are not updated until all files have been transferred. Since the first run didn't complete, it didn't get a chance to change the ownership on the directories. However, it should have changed ownership and set timestamps on all of those previous files. It'd be helpful if we could find a reproducer where you don't have to transfer a ton of data. You mentioned some smaller transfers worked on the first attempt. Do you have any idea of how large you have to get before you see things break? |
No problem and I will drop this for future runs.
Yes, I agree. I will see if I can find a small-ish dataset to reproduce this. |
This will produce some long output, but let's add some debug statements to print info when dsync sets the timestamp on a file. Can you modify
Also add an Execute a dsync and if you find any files have mismatching timestamps, look for those same paths in the output. Let's verify that a line is actually printed, that the input timestamps are correct, and that the BTW, it seems like it's not too hard to find files with mismatched timestamps, but if you need to scan a lot of files manually, we can probably use No need to copy the actual output here, but let me know what you find. |
@dvickernasa , have you had any transfers reproduce the Lustre timestamp problem? |
Sorry, this fell off my radar. The "production" transfers I needed to do are done and I didn't get back to my testing. I'm restarting a transfer as a test. I'm going to recreate the original transfer this time, including the original options I used (most notably without
|
OK, I think we caught a few files this time. FYI, the source was closed off to regular users so I'm certain it was static.
Here is the end of the first transfer:
And the output from immediately running the second
One thing stood out to me. The output said it was deleting and re-transferring 3 files, but the subsequent output shows 6 files. I picked one of those files and grepped it out of the output.
Even with
dsync-testing.2023-11-03T13-27-57.out.splitaa.gz |
Thanks for running these additional tests. I think there are just 3 files, but it prints two debug messages for each file. Looking at the
And then for that file in the second dsync, I see these 4 lines:
So for this particular file, it looks dsync succeeded each time that it set the timestamps ( The mtime values did not change between the two runs ( I do see that the access time changed from Given that the mtime is the same, I'm not sure why dsync decided to delete and re-copy the file. Perhaps the file size was different between the two? Given that the mtime didn't change on the source file, and knowing that the file system was locked down, I really doubt the size of the source file changed. Maybe dsync never fully copied the file to the destination in the first run, even though it thought it had? It would have been a good idea to print the file size, too, so that I could verify. There might be a couple other checks, like if the file type came up different somehow. I'll look to add that. |
@dvickernasa , I've updated the debugtimestamps branch. I merged a commit which required a force push, so hopefully that doesn't trip you up. https://github.com/hpc/mpifileutils/tree/debugtimestamps Since the mtime handling in the output of your last test looked to be correct, I've disabled the messages that print getting/setting timestamps on each file so that it's not as noisy. I added new messages to print what difference dsync finds between the source/destination when it decides to replace a destination file. If you have a chance, would you please kick off a repeat of that test? I realize that takes a while to run, so I appreciate your time. Thanks. |
Oh, this branch does include the new |
I've compiled the update and I'll repeat the test. I'm not going to use the new option for now, just to be consistent. I'll email the output for the latest results when its done. |
Thanks, @dvickernasa . In this latest test, the second dsync copied 3 files. The new "DIFF" debug statements show that it detected a difference in the size/mtime check:
Here the file size values match, but the mtime values are different. The destination mtimes are all much more recent than the source file mtimes, and the destination mtime values look like they probably match up with the time at which the dsync was running. From your previous test, we saw that dsync had correctly set the mtime on all files, including the 3 files that were copied by the second run. Unfortunately, I can't verify whether the mtime values had been set correctly in this test, because I dropped those messages. I suspect that to be the case again, but I'd like to add those messages back and run once more to be certain. Would you mind pulling those changes and running once more? Thanks! |
With this next test, the goal is to look for messages that verify that the timestamps are being set correctly and then must somehow be getting changed again behind the scenes. I'm back to wondering if this is due to outstanding writes that are buffered on the server and then getting applied after the timestamps have been set. Those "late writes" would change the mtime to the time that last write is applied. dsync has a
For this test, you don't need to send the full output to me. You can grep for "DIFF" in the second run to find any files that were re-copied. Even one file should be sufficient. Then also grep for the source and destination path of that file in both the first and second run output. We should only need those lines. Assuming that behaves as expected, I'd like to run one more test where we modify dsync to call Thanks again for your help with this, @dvickernasa . |
@adammoody just following along at home, and wanted to note that Once the data has been flushed from the clients to the servers, Lustre will generally write it to storage in order, so a later "sync" will ensure all earlier writes are also committed to storage. The "guaranteed" way to ensure all of the data is flushed is to call Calling the Since these APIs could trigger sync writes for every file accessed (if they have dirty data), it would best to call it on a file only when all of the writers are done, just before the timestamp is reset. That should avoid too much overhead. While there is an |
Thanks, @adilger . I was hoping you might still be following this thread. The global barrier should ensure that It sounds like I'll need the For a large file that gets split across clients, I presume a single call to |
@adammoody, the latest test completed. It didn't re-transfer files on the 2nd dsync this time so I'm not sure how useful the results are but I'll send them to you anyway. I'm pulling the latest changes and repeating the test now. |
Thanks, @dvickernasa . Right, unfortunately it looks like this latest run didn't trigger the problem. Hopefully, the next run will catch it. Thanks as always for your help. When you run using that latest branch, it will be quite verbose again as it prints a message whenever getting/setting timestamps. |
Sorry for the delay. I was at SC23 all last week and didn't get back to this. Looks like my latest transfer with the updated code caught one file.
@adammoody I'll email you the latest output. |
Thanks again, @dvickernasa . No problem, and I hope you had a good time at SC. Yes, this set of runs captured good evidence of what we were looking for. Copying bits from your email, here is the note about the second run that finds a difference in mtime:
The source file has an mtime of However, for this file, I see that the output from the first run reports that it had set the mtime to the correct value
Let me update the branch to enable the |
@dvickernasa , I've pushed a commit to the When you get a chance, would you please pull this change, rebuild, and try another test? This will add a lot of Given that this looks to be a race condition, the timing change alone might hide the problem. It'll be hard to trust that it's really gone without running more tests or running larger. On the other hand, if we still see the problem show up with I think what we're seeing is that some outstanding writes are being buffered by a server and applied after the timestamp. I'm looking for different ways to force those writes to finish. |
I'm re-running my test with the updated code. Should be finished in the morning. Since I've transferred this same set of files a few times now, we should get a good idea if the extra fsyncs's slow things down much. |
@adilger , assuming this is what is happening, I see the following in the notes section of the
Do you know how Lustre treats From your earlier comments, it sounds like it may be more along the lines of POSIX, where |
@adammoody I don't think that |
The updated test ran this morning. The 2nd transfer did not re-transfer anything, so that's encouraging. I'll be out the rest of the week for Thanksgiving but I'm going to put this in a loop and run in several times while I'm gone, so we'll have some more data next week. Also, here is some data on how the extra sync's affected the transfer times. There is a lot of variation in the runs but it would appear that there isn't a huge affect. Or, it might be nice to add some (potentially just debug) timing statements to directly calculate the overhead of calling the sync.
|
That's great. Thanks, @dvickernasa . Have a Happy Thanksgiving! |
Hi @dvickernasa, hope you had a good Thanksgiving. Is the destination file system built with OSTs backed by ZFS, and if so, is "options osd_zfs osd_object_sync_delay_us=<some_positive_number>" set on those servers? If so I might have an explanation for what we're seeing (although I'm not sure). Thanks! |
@adammoody, I got 6 more iterations on this done over the break. It looks like all of them found more data to copy on the 2nd transfer.
I'll sent you some more details via email. @ofaaland, for the testing I've been using in the last several posts (and most of the other examples in this issue), the source LFS does have ZFS OST's. The destination LFS is ldiskfs. Even so, we are setting the parameter you mentioned to zero.
|
Hi @dvickernasa Since the destination LFS is ldiskfs, that rules out my theory. Back to the drawing board for us. Thank you. |
Thanks for that additional test output, @dvickernasa . This shows that the problem is still there even if we I've pushed a couple more commits to the branch. The first one calls I also added a call to When you have a chance, would you please pull, rebuild, and submit a new round of tests? |
I've started my test with the updated code. |
@adammoody, we talked a while ago about adding the |
Oh, right. That slipped my mind. I'll work on that. |
The latest test finished (sorry for the delay). The latest run found 9 files to retransfer.
|
Thanks, @dvickernasa . Wow, this problem is stubborn. So it would seem the Based on the data from the latest run that you emailed to me separately, I don't see any of the new Even more puzzling is that several of the files that have incorrect mtimes are of size=0. There should not be any BTW, I am working to have the |
Hello,
I recently discovered mpifileutils and they are awesome. I moved about 200 TB in 4 million files between two different lustre filesystems over the weekend. Thank you for your work on these tools. After doing an initial data move, my intent was to keep the data in sync for a period of time using
rsync
. With the bulk of the data transferred, this should go fast but it would also catch any file types that `dsync doesn't handle (hard links, etc.). Here is the script I'm using.The problem I'm running into is that
rsync
is copying files thatdync
already copied. Here is one example of a file thatrsync
is trying to transfer again.Source:
Destination:
The file size it correct but it looks like the date on the destination is wrong - this is the date that that the file got created. After poking around in the destination more, it looks like this is prevalent - dates were not preserved. This will trip up
rsync
's default algorithm of checking by date and size. I could switch torsync
's checksum mode to get around this but that will be quite a bit more time consuming for this dataset.Is this a bug with
dsync
or am I possibly doing something wrong? I'm using version 0.11.1 of mpifileutils.The text was updated successfully, but these errors were encountered: