My initial attempt to decompress bdz-file and pass it on #367

fkalholm · 2020-06-23T09:28:52Z

I have spent some time on this (not resulted in much code). My idea to fix the support the bdz files is to:

Try to decompress the current file using zlib.decompress. If it works, save a temporary bdo-file
Create a string variable (called zip_name here), which is passed on to the next functions along with self.filename. This is empty if original file is bdo (so name does not change), otherwise the "_TEMP.bdo" is added so it instead points to the saved temporary bdo-file.
After all is done, delete the temporary bdo-file

In my submitted code, I have only tried to implement 1-2 (though not yet changed how the string with the filename is passed on)

The problems I have:

I cannot decompress the bdz-file using zlib.decompress method. It gives me the error:
"UnicodeDecodeError: 'utf-8' codec can't decode byte 0xab in position 2: invalid start byte"
As the file type is quite unique, I did not find much googling for the solution
Before I go further trying to implement it in detail, do you think this approach in general is good? It may be "un-clean" to pass around the string. (Please be direct and honest)

nbassler · 2020-06-23T22:46:47Z

pymchelper/readers/shieldhit/general.py

+        # Trying to decompress bdz-file. If it works, saves a temporary bdo-file which will be used for reading later
+        # zip_name is the addition to the file name for the new bdo-file, compared to the original: nothing in case
+        # original is already bdo, and "_TEMP.bdo" in case original was a bdz.
+        with open(self.filename) as f:


You should tell python that it is a binary stream opened for reading. Try open(filename, "rb").

nbassler · 2020-06-23T23:04:07Z

I cannot decompress the bdz-file using zlib.decompress method. It gives me the error:
"UnicodeDecodeError: 'utf-8' codec can't decode byte 0xab in position 2: invalid start byte"

utf-8 is an encoding used for text file. I guess it will be fixed if you open the bdz file as a binary stream instead. See my comment in the code.
https://docs.python.org/3/library/functions.html?highlight=open#open

As the file type is quite unique, I did not find much googling for the solution
2. Before I go further trying to implement it in detail, do you think this approach in general is good? It may be "un-clean" to pass around the string. (Please be direct and honest)

Personally, I think I would prefer to pass the unzipped file object around instead of writing the unzipped file to disk.
Reasons:

bdz was introduced in order to avoid having large files on disk. However now you create large files again.
disk IO is much slower than memory IO
how does this scale if you have 10000 bdz files you wish to merge?

However:
SHIELD-HIT12A does what you do here (but inverse). it makes a temporary bdo file which is then read, compressed and written as bdz.
The reason for this was to keep the memory foot-print lean on the cluster. SH12A is supposed to run on many nodes, and in this situation there is not much RAM available per thread. Since writing to disk is a relatively rare event, the loss in speed easily tolerated.

However, pymchelper is reading these files to merge sequentially, not multithreaded (please correct me if I am wrong), so therefore I think it would be best to minimize disk IO.

But Leszek can probably give better advice.

fkalholm · 2020-06-24T07:15:02Z

Thanks Niels, that´s very good advice!

fkalholm · 2020-06-24T08:45:13Z

I have spent a few more hours on this now, though have not yet come very far. I can read the file now when specifying that it is binary. I am thinking how to pass the resulting object on further to be handled properly by the other functions.
If the object is passed rather than the file name, the np.fromfile method needs to be replaced with something else. I tried looking for an analogous method which reads from an object rather than a file. The np.frombuffer method might be a choice, but I have not got it to work.
When I try to use it and replace the file handle with the object, I get
ValueError: buffer is smaller than requested size

There is not so much help on Stackoverflow for this method either. And maybe it cannot even be used at all like I have in mind.
I am probably learning from this, but I don´t feel I work efficiently since I get stuck for long on probably quite simple steps many times. Maybe that is how it is for everyone in the beginning, or maybe I should take some more Python classes

nbassler · 2020-06-24T14:15:03Z

pymchelper/readers/shieldhit/general.py

        # magic number was introduced together with first token-based BDO file format (BDO2016)
        # presence of magic number means we could have BDO2016 or BDO2019 format
-        if file_has_sh_magic_number(self.filename):
+        if file_has_sh_magic_number(bdo):
            reader = SHReaderBDO2019

            # format tag specifying binary standard was introduced in SH12A v0.7.4-dev on  07.06.2019 (commit 6eddf98)
            file_format = read_token(self.filename, SHBDOTagID.format)


you need to change the read_token() function to accept the bdo object instead of self.filename...

and basically you have to do this with every function which used to touch the original file. :)

Yes, that is (one of the few parts :), which is) clear to me. But I thought this would have other effects further down in the code where it expects the filename, and not the object, for example in the np.fromfile method. But I´ll try some more later, maybe it is more easy than I think

fkalholm · 2020-06-25T07:55:38Z

I tried some more now, but could not get it to work. I do not understand how to get the np.fromfile method to work properly with the bdo-object being passed. Maybe there is some easy way that I am missing.

grzanka · 2020-06-26T20:10:54Z

@fkalholm thanks to your attempt I've realized that in fact the issue is a bit more complicated than I thought at the beginning.

First of all I agree with Niels that saving temporary unpacked file is not a good option.

You have correctly figured out that there are two methods in numpy which are helpful in reading binary data:

np.fromfile - accepts open (!) file object of path to the file. Note that file object has a inside a pointer which moves forward as you read the file. Every call to np.fromfile is moving the pointer forward by so many bytes as the function has read. This means that call to np.fromfile is modifying file object (but not file content, just the pointer).
np.frombuffer - accepts a buffer object (which is like an array). Note that buffers in python do not have such a pointer as file objects. Every subsequent call to np.frombuffer is not changing the buffer.

All parsers of SHIELD-HIT12A formats were implemented using the file pointer logic. After reading a token from file, the pointer moves forward, see i.e. read_next_token method:
https://github.com/DataMedSci/pymchelper/blob/v1.1.0/pymchelper/readers/shieldhit/reader_base.py#L155
This solution is nice as we are avoid loading full file content into memory.

I would like still to keep this functionality which avoids loading full unpacked files into memory.

My proposed solution is following. Let us create an abstraction layer (a function or class) which holds one of two objects:

buffer with contents of unzipped file (a lot of RAM may be needed here as this is a full file loaded into memory) and our own pointer containing location what are we reading at the moment
open file object in case we have uncompressed BDO file
This abstraction should also have a method of reading, with similar arguments as np.fromfile and it should do one of two things:
call np.frombuffer on a respective part of file in case we hold a buffer
call np.fromfile on a respective part of file in case we hold an open file object

Such abstraction should be used in all places where currently we have calls to np.fromfile

fkalholm requested review from grzanka and nbassler June 23, 2020 09:29

nbassler reviewed Jun 23, 2020

View reviewed changes

nbassler reviewed Jun 24, 2020

View reviewed changes

fkalholm added 4 commits March 26, 2021 21:33

My initial attempt to decompress bdz-file and pass it on

2b4879e

changed file reading

66804cd

trying to pass object on

2863b56

passed the bdo object on to functions

1e9911d

grzanka force-pushed the fix/365-read_bdz_files branch from 1f5e842 to 1e9911d Compare March 26, 2021 20:33

reviewpad bot mentioned this pull request Mar 28, 2023

feat: reviewpad onboarding #616

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

My initial attempt to decompress bdz-file and pass it on #367

My initial attempt to decompress bdz-file and pass it on #367

fkalholm commented Jun 23, 2020

nbassler Jun 23, 2020

nbassler commented Jun 23, 2020

fkalholm commented Jun 24, 2020

fkalholm commented Jun 24, 2020

nbassler Jun 24, 2020

fkalholm Jun 24, 2020

fkalholm commented Jun 25, 2020

grzanka commented Jun 26, 2020

My initial attempt to decompress bdz-file and pass it on #367

Are you sure you want to change the base?

My initial attempt to decompress bdz-file and pass it on #367

Conversation

fkalholm commented Jun 23, 2020

nbassler Jun 23, 2020

Choose a reason for hiding this comment

nbassler commented Jun 23, 2020

fkalholm commented Jun 24, 2020

fkalholm commented Jun 24, 2020

nbassler Jun 24, 2020

Choose a reason for hiding this comment

fkalholm Jun 24, 2020

Choose a reason for hiding this comment

fkalholm commented Jun 25, 2020

grzanka commented Jun 26, 2020