Handling of archives (tar, zip) not well defined #160

ptgolden · 2024-11-08T14:28:44Z

Koza supports reading data from two archive formats: zip and tar. However, they are possibly treated differently.

If a zip archive is listed in a transform's files setting, then only the first file in the archive's file list is read. That read is streaming (i.e. it is not decompressed first):

koza/src/koza/io/utils.py

Lines 62 to 65 in 8a3bab9

    
           if is_zipfile(resource): 
        
               with ZipFile(resource, 'r') as zip_file: 
        
                   file = TextIOWrapper(zip_file.open(zip_file.namelist()[0], 'r'))  # , encoding='utf-8') 
        
                   # file = zip_file.read(zip_file.namelist()[0], 'r').decode('utf-8')

If a zip or tar archive is listed in a transform's file_archives setting, they are first decompressed, and then every file in the archive is added to the files setting:

koza/src/koza/model/config/source_config.py

Lines 185 to 199 in 8a3bab9

    
           def extract_archive(self): 
        
               archive_path = Path(self.file_archive).parent  # .absolute() 
        
               if self.file_archive.endswith(".tar.gz") or self.file_archive.endswith(".tar"): 
        
                   with tarfile.open(self.file_archive) as archive: 
        
                       archive.extractall(archive_path) 
        
               elif self.file_archive.endswith(".zip"): 
        
                   with zipfile.ZipFile(self.file_archive, "r") as archive: 
        
                       archive.extractall(archive_path) 
        
               else: 
        
                   raise ValueError("Error extracting archive. Supported archive types: .tar.gz, .zip") 
        
               if self.files: 
        
                   files = [os.path.join(archive_path, file) for file in self.files] 
        
               else: 
        
                   files = [os.path.join(archive_path, file) for file in os.listdir(archive_path)] 
        
               return files

There are a couple issues here.

In the zip-in-files case, files beyond the first one are silently ignored
In the archive-in-file_archive case, it is a waste of CPU time and disk space to extract an archive when it's possible to stream its read
Also in the archive-in-file_archive case, reading will fail if the archive contains file types different from the format that was declared. For example this example will fail or (worse) silently read garbage data if data.zip contains the files data.csv and README.txt:

file_archive: data.zip
format: csv

The text was updated successfully, but these errors were encountered:

ptgolden · 2024-11-08T14:34:12Z

Related: #124

A possible solution might transparently deal with compression (easy!- this would just mirror the behavior in the zip-in-files case, and add similar behavior to tar files)

Dealing with the third issue above would only be possible by detecting file types of files contained within the archive, probably through file extensions. The easier thing to do might be to document that when reading from an archive, all of the files contained within that archive are expected to be of the format which you expect to read.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of archives (tar, zip) not well defined #160

Handling of archives (tar, zip) not well defined #160

ptgolden commented Nov 8, 2024

ptgolden commented Nov 8, 2024

Handling of archives (tar, zip) not well defined #160

Handling of archives (tar, zip) not well defined #160

Comments

ptgolden commented Nov 8, 2024

ptgolden commented Nov 8, 2024