Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data loss occured while copying data between Azure File shares using ServiceSideAsyncCopy #260

Open
pawelpabich opened this issue Mar 12, 2021 · 5 comments

Comments

@pawelpabich
Copy link
Contributor

pawelpabich commented Mar 12, 2021

Which service(blob, file) does this issue concern?

Azure Files

Which version of the SDK was used?

2.0.1

On which platform were you using? (.Net Framework version or .Net Core version, and OS version)

Linux (AKS)

How can the problem be reproduced? It'd be better if the code caused the problem can be shared.

We transfered files between 2 file shares located on 2 different Azure Storage accounts in West US 2.

What problem was encountered?

Data loss. Some files were created with length 0.

Have you found a mitigation/solution?

We run custom code that compares the content of the source and destination shares. The code compares the size of the files in both locations(unfortunately MD5 property is not populated), deletes the ones that failed to transfer, and retries the copy operation. AssertNoTransferFailures method contains this code.

Code

 public class TransferManagerContext
    {
        public static async Task CopyDirectory(ILogger logger, CloudFileDirectory sourceFileDirectory, CloudFileDirectory destFileDirectory)
        {
            ShareStats total = null;
            if (!sourceFileDirectory.Share.IsSnapshot)
            {
                total = await GetTotalSize(sourceFileDirectory);
            }

            var totalString = total != null ? $"{total.UsageInBytes.Bytes().ToString("0.00")}" : string.Empty;

            TransferManager.Configurations.ParallelOperations = 128;
            TransferManager.Configurations.MaxListingConcurrency = 16;

            var lastNumberOfFilesTransferred = 0L;
            var lastLogged = DateTime.UtcNow;
            var start = DateTime.UtcNow;

            var context = new DirectoryTransferContext
            {
                ProgressHandler = new Progress<TransferStatus>(progress =>
                {
                    var now = DateTime.UtcNow;

                    if (progress.NumberOfFilesTransferred - lastNumberOfFilesTransferred <= 500 && now - lastLogged < TimeSpan.FromMinutes(1))
                    {
                        return;
                    }

                    lastNumberOfFilesTransferred = progress.NumberOfFilesTransferred;
                    lastLogged = now;

                    var bytesPerSecond = progress.BytesTransferred / (now - start).TotalSeconds;
                    logger.Information($"Transferred {progress.BytesTransferred.Bytes().ToString("0.00")} / {totalString} at avg {bytesPerSecond.Bytes().ToString("0.00")} / sec. {progress.NumberOfFilesTransferred} files transferred so far.");
                }),
            };

            // Does not copy empty folders
            var transferStatus = await TransferManager.CopyDirectoryAsync(sourceFileDirectory, destFileDirectory, CopyMethod.ServiceSideAsyncCopy, new CopyDirectoryOptions { Recursive = true }, context, CancellationToken.None);
            
            await AssertNoTransferFailures(sourceFileDirectory, destFileDirectory, logger);
            if (transferStatus.NumberOfFilesFailed > 0) throw new Exception($"{transferStatus.NumberOfFilesFailed} failed to transfer.");

            logger.Information("Copy complete. It took {ShareCopyTotalTime} to transfer {ShareSize} GB", DateTime.UtcNow - start, total?.UsageInBytes.Bytes().Gigabytes.ToString("0.00"));
        }

        static async Task AssertNoTransferFailures(CloudFileDirectory sourceFileDirectory, CloudFileDirectory destFileDirectory, ILogger logger)
        {
            var filesToVerifyInSource = await GetFilesToVerify(sourceFileDirectory, logger);
            var filesToVerifyInDestination = await GetFilesToVerify(destFileDirectory, logger);
            
            logger.Information("There are {FilesInSource} files in the source and {FilesInDestination} files in the destination.", filesToVerifyInSource.Length, filesToVerifyInSource.Length);

            var failures = filesToVerifyInSource
                .Union(filesToVerifyInDestination)
                .GroupBy(ftv => ftv.Uri.AbsolutePath)
                .Select(ftv =>
                {
                    var files = ftv.ToArray();

                    if (files.Length < 1 || files.Length > 2) throw new Exception($"{ftv.Key} is present unexpected number of times ({files.Length}) in the dataset.");

                    var first = files[0];
                    if (files.Length == 1) return $"{ftv.Key} exists only in {first.Uri}";

                    var second = files[1];
                    if (files.Length == 2 && first.Length != second.Length)
                    {
                        logger.Warning("Deleting {PathToFile} because it failed to transfer correctly. Its size is {ActualSize} instead of expected {ExpectedSize}", second.Uri, second.Length, first.Length);
                        new CloudFile(second.Uri, destFileDirectory.ServiceClient.Credentials).Delete();
                        return $"{ftv.Key} has {first.Length} size in {first.Uri} and {second.Length} size in {second.Uri}";
                    }

                    return null;
                })
                .Where(failure => failure != null)
                .ToArray();

            if (failures.Any())
            {
                logger.Error("Detected {FailureCount} transfer failures Details: {Failures}", failures.Length, failures);
                throw new Exception($"Detected {failures.Length} transfer failures Details: {string.Join(" , ", failures)}");
            }
        }

        static async Task<FileToVerify[]> GetFilesToVerify(CloudFileDirectory parent, ILogger logger)
        {
            logger.Information("Getting list of files from {Path} for the verification process.", parent.Uri.AbsoluteUri);
            
            var children = parent.ListFilesAndDirectories().ToArray();
            var files = children.OfType<CloudFile>().ToArray();
            var directories = children.OfType<CloudFileDirectory>().ToArray();

            Task.WaitAll(files.Select(f => f.FetchAttributesAsync()).ToArray());
            
            // Unfortunately, CloudFile.Properties.ContentMD5 is not calculated by Azure Files and return null here.  
            var filesToVerify = files.Select(f => new FileToVerify(f.Uri, f.Properties.Length));
            var filesToVerifyInDirectories = directories.Select(dir => GetFilesToVerify(dir, logger)).ToArray();

            Task.WaitAll(filesToVerifyInDirectories);

            return filesToVerify.Union(filesToVerifyInDirectories.Select(p => p.Result).SelectMany(paths => paths)).ToArray();
        }

        static async Task<ShareStats> GetTotalSize(IListFileItem root)
        {
            var total = await root.Share.GetStatsAsync();

            return total;
        }
    }

    public class FileToVerify
    {
        public Uri Uri { get; }
        public long Length { get; }
        public FileToVerify(Uri uri, long length)
        {
            Uri = uri;
            Length = length;
        }
    }

File Shares

image

@amnguye
Copy link
Member

amnguye commented Mar 24, 2021

I'm looking to reproducing your error currently. Does this consistently occur when using 2.0.1 and not the previous version (I saw your other issue where you were asking if another File Share problem was fixed)

@amnguye
Copy link
Member

amnguye commented Mar 24, 2021

Also another question I have is how large is the file share you are attempting to copy (e.g. how many files, directories and avg size of those files). Does this only happen with a large file share or does can it be replicated with a smaller file. Are the files that aren't copied random (aka copied with 0 length) or does is consistently do it to the same files.

Looks like from your code snippet, that the TransferStatus is getting returned by the CopyDirectory method, and that it is returning the number of files that were failed to transfer. However you're doing this check after your personal validation, so my question is, is the TransferStatus returning back the number of failed files for you? Because if anything this check should be before your validation because DMLib would have informed you that there were some files that it was unable to copy.

@pawelpabich
Copy link
Contributor Author

pawelpabich commented Apr 6, 2021

Sorry, for the delay. I somehow missed your message.

I'm looking to reproducing your error currently. Does this consistently occur when using 2.0.1 and not the previous version (I saw your other issue where you were asking if another File Share problem was fixed)

We've seen this problem occur also with 2.0.0. I'm not 100% but I think we also found one case of corruption with a version < 2.0.0

@pawelpabich
Copy link
Contributor Author

Also another question I have is how large is the file share you are attempting to copy (e.g. how many files, directories and avg size of those files). Does this only happen with a large file share or does can it be replicated with a smaller file. Are the files that aren't copied random (aka copied with 0 length) or does is consistently do it to the same files.

The shares have from hundreds to a couple of thousands of files. The size of the files varies from 1k to a couple of megs.

We run the move tasks as part of our E2E Test and this problem occurs daily. We run around 10 builds a day so not every build suffers from it.

image

Looks like from your code snippet, that the TransferStatus is getting returned by the CopyDirectory method, and that it is returning the number of files that were failed to transfer. However you're doing this check after your personal validation, so my question is, is the TransferStatus returning back the number of failed files for you? Because if anything this check should be before your validation because DMLib would have informed you that there were some files that it was unable to copy.

TransferStatus.NumberOfFilesFailed returned 0 in every case when we detected corruption. We've moved it right to the end because a) it doesn't report errors when we expect it b) we can't really recover based on 1 number.

@pawelpabich
Copy link
Contributor Author

@amnguye

Any movement on this front? Can I help in any way?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants