You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Koza supports reading data from two archive formats: zip and tar. However, they are possibly treated differently.
If a zip archive is listed in a transform's files setting, then only the first file in the archive's file list is read. That read is streaming (i.e. it is not decompressed first):
If a zip or tar archive is listed in a transform's file_archives setting, they are first decompressed, and then every file in the archive is added to the files setting:
In the zip-in-files case, files beyond the first one are silently ignored
In the archive-in-file_archive case, it is a waste of CPU time and disk space to extract an archive when it's possible to stream its read
Also in the archive-in-file_archive case, reading will fail if the archive contains file types different from the format that was declared. For example this example will fail or (worse) silently read garbage data if data.zip contains the files data.csv and README.txt:
file_archive: data.zip
format: csv
The text was updated successfully, but these errors were encountered:
A possible solution might transparently deal with compression (easy!- this would just mirror the behavior in the zip-in-files case, and add similar behavior to tar files)
Dealing with the third issue above would only be possible by detecting file types of files contained within the archive, probably through file extensions. The easier thing to do might be to document that when reading from an archive, all of the files contained within that archive are expected to be of the format which you expect to read.
Koza supports reading data from two archive formats: zip and tar. However, they are possibly treated differently.
If a zip archive is listed in a transform's
files
setting, then only the first file in the archive's file list is read. That read is streaming (i.e. it is not decompressed first):koza/src/koza/io/utils.py
Lines 62 to 65 in 8a3bab9
If a zip or tar archive is listed in a transform's
file_archives
setting, they are first decompressed, and then every file in the archive is added to thefiles
setting:koza/src/koza/model/config/source_config.py
Lines 185 to 199 in 8a3bab9
There are a couple issues here.
files
case, files beyond the first one are silently ignoredfile_archive
case, it is a waste of CPU time and disk space to extract an archive when it's possible to stream its readfile_archive
case, reading will fail if the archive contains file types different from the format that was declared. For example this example will fail or (worse) silently read garbage data ifdata.zip
contains the filesdata.csv
andREADME.txt
:The text was updated successfully, but these errors were encountered: