-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
My initial attempt to decompress bdz-file and pass it on #367
base: master
Are you sure you want to change the base?
Conversation
# Trying to decompress bdz-file. If it works, saves a temporary bdo-file which will be used for reading later | ||
# zip_name is the addition to the file name for the new bdo-file, compared to the original: nothing in case | ||
# original is already bdo, and "_TEMP.bdo" in case original was a bdz. | ||
with open(self.filename) as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should tell python that it is a binary stream opened for reading. Try open(filename, "rb")
.
utf-8 is an encoding used for text file. I guess it will be fixed if you open the bdz file as a binary stream instead. See my comment in the code.
Personally, I think I would prefer to pass the unzipped file object around instead of writing the unzipped file to disk.
However: However, pymchelper is reading these files to merge sequentially, not multithreaded (please correct me if I am wrong), so therefore I think it would be best to minimize disk IO. But Leszek can probably give better advice. |
Thanks Niels, that´s very good advice! |
I have spent a few more hours on this now, though have not yet come very far. I can read the file now when specifying that it is binary. I am thinking how to pass the resulting object on further to be handled properly by the other functions. There is not so much help on Stackoverflow for this method either. And maybe it cannot even be used at all like I have in mind. |
# magic number was introduced together with first token-based BDO file format (BDO2016) | ||
# presence of magic number means we could have BDO2016 or BDO2019 format | ||
if file_has_sh_magic_number(self.filename): | ||
if file_has_sh_magic_number(bdo): | ||
reader = SHReaderBDO2019 | ||
|
||
# format tag specifying binary standard was introduced in SH12A v0.7.4-dev on 07.06.2019 (commit 6eddf98) | ||
file_format = read_token(self.filename, SHBDOTagID.format) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to change the read_token() function to accept the bdo object instead of self.filename...
and basically you have to do this with every function which used to touch the original file. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is (one of the few parts :), which is) clear to me. But I thought this would have other effects further down in the code where it expects the filename, and not the object, for example in the np.fromfile method. But I´ll try some more later, maybe it is more easy than I think
I tried some more now, but could not get it to work. I do not understand how to get the np.fromfile method to work properly with the bdo-object being passed. Maybe there is some easy way that I am missing. |
@fkalholm thanks to your attempt I've realized that in fact the issue is a bit more complicated than I thought at the beginning. First of all I agree with Niels that saving temporary unpacked file is not a good option. You have correctly figured out that there are two methods in numpy which are helpful in reading binary data:
All parsers of SHIELD-HIT12A formats were implemented using the file pointer logic. After reading a token from file, the pointer moves forward, see i.e. I would like still to keep this functionality which avoids loading full unpacked files into memory. My proposed solution is following. Let us create an abstraction layer (a function or class) which holds one of two objects:
Such abstraction should be used in all places where currently we have calls to |
1f5e842
to
1e9911d
Compare
I have spent some time on this (not resulted in much code). My idea to fix the support the bdz files is to:
In my submitted code, I have only tried to implement 1-2 (though not yet changed how the string with the filename is passed on)
The problems I have:
I cannot decompress the bdz-file using zlib.decompress method. It gives me the error:
"UnicodeDecodeError: 'utf-8' codec can't decode byte 0xab in position 2: invalid start byte"
As the file type is quite unique, I did not find much googling for the solution
Before I go further trying to implement it in detail, do you think this approach in general is good? It may be "un-clean" to pass around the string. (Please be direct and honest)