Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

My initial attempt to decompress bdz-file and pass it on #367

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

fkalholm
Copy link
Contributor

I have spent some time on this (not resulted in much code). My idea to fix the support the bdz files is to:

  1. Try to decompress the current file using zlib.decompress. If it works, save a temporary bdo-file
  2. Create a string variable (called zip_name here), which is passed on to the next functions along with self.filename. This is empty if original file is bdo (so name does not change), otherwise the "_TEMP.bdo" is added so it instead points to the saved temporary bdo-file.
  3. After all is done, delete the temporary bdo-file

In my submitted code, I have only tried to implement 1-2 (though not yet changed how the string with the filename is passed on)

The problems I have:

  1. I cannot decompress the bdz-file using zlib.decompress method. It gives me the error:
    "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xab in position 2: invalid start byte"
    As the file type is quite unique, I did not find much googling for the solution

  2. Before I go further trying to implement it in detail, do you think this approach in general is good? It may be "un-clean" to pass around the string. (Please be direct and honest)

@fkalholm fkalholm requested review from grzanka and nbassler June 23, 2020 09:29
# Trying to decompress bdz-file. If it works, saves a temporary bdo-file which will be used for reading later
# zip_name is the addition to the file name for the new bdo-file, compared to the original: nothing in case
# original is already bdo, and "_TEMP.bdo" in case original was a bdz.
with open(self.filename) as f:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should tell python that it is a binary stream opened for reading. Try open(filename, "rb").

@nbassler
Copy link
Member

  1. I cannot decompress the bdz-file using zlib.decompress method. It gives me the error:
    "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xab in position 2: invalid start byte"

utf-8 is an encoding used for text file. I guess it will be fixed if you open the bdz file as a binary stream instead. See my comment in the code.
https://docs.python.org/3/library/functions.html?highlight=open#open

As the file type is quite unique, I did not find much googling for the solution
2. Before I go further trying to implement it in detail, do you think this approach in general is good? It may be "un-clean" to pass around the string. (Please be direct and honest)

Personally, I think I would prefer to pass the unzipped file object around instead of writing the unzipped file to disk.
Reasons:

  • bdz was introduced in order to avoid having large files on disk. However now you create large files again.
  • disk IO is much slower than memory IO
  • how does this scale if you have 10000 bdz files you wish to merge?

However:
SHIELD-HIT12A does what you do here (but inverse). it makes a temporary bdo file which is then read, compressed and written as bdz.
The reason for this was to keep the memory foot-print lean on the cluster. SH12A is supposed to run on many nodes, and in this situation there is not much RAM available per thread. Since writing to disk is a relatively rare event, the loss in speed easily tolerated.

However, pymchelper is reading these files to merge sequentially, not multithreaded (please correct me if I am wrong), so therefore I think it would be best to minimize disk IO.

But Leszek can probably give better advice.

@fkalholm
Copy link
Contributor Author

Thanks Niels, that´s very good advice!

@fkalholm
Copy link
Contributor Author

I have spent a few more hours on this now, though have not yet come very far. I can read the file now when specifying that it is binary. I am thinking how to pass the resulting object on further to be handled properly by the other functions.
If the object is passed rather than the file name, the np.fromfile method needs to be replaced with something else. I tried looking for an analogous method which reads from an object rather than a file. The np.frombuffer method might be a choice, but I have not got it to work.
When I try to use it and replace the file handle with the object, I get
ValueError: buffer is smaller than requested size

There is not so much help on Stackoverflow for this method either. And maybe it cannot even be used at all like I have in mind.
I am probably learning from this, but I don´t feel I work efficiently since I get stuck for long on probably quite simple steps many times. Maybe that is how it is for everyone in the beginning, or maybe I should take some more Python classes

# magic number was introduced together with first token-based BDO file format (BDO2016)
# presence of magic number means we could have BDO2016 or BDO2019 format
if file_has_sh_magic_number(self.filename):
if file_has_sh_magic_number(bdo):
reader = SHReaderBDO2019

# format tag specifying binary standard was introduced in SH12A v0.7.4-dev on 07.06.2019 (commit 6eddf98)
file_format = read_token(self.filename, SHBDOTagID.format)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to change the read_token() function to accept the bdo object instead of self.filename...

and basically you have to do this with every function which used to touch the original file. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is (one of the few parts :), which is) clear to me. But I thought this would have other effects further down in the code where it expects the filename, and not the object, for example in the np.fromfile method. But I´ll try some more later, maybe it is more easy than I think

@fkalholm
Copy link
Contributor Author

I tried some more now, but could not get it to work. I do not understand how to get the np.fromfile method to work properly with the bdo-object being passed. Maybe there is some easy way that I am missing.

@grzanka
Copy link
Contributor

grzanka commented Jun 26, 2020

@fkalholm thanks to your attempt I've realized that in fact the issue is a bit more complicated than I thought at the beginning.

First of all I agree with Niels that saving temporary unpacked file is not a good option.

You have correctly figured out that there are two methods in numpy which are helpful in reading binary data:

  • np.fromfile - accepts open (!) file object of path to the file. Note that file object has a inside a pointer which moves forward as you read the file. Every call to np.fromfile is moving the pointer forward by so many bytes as the function has read. This means that call to np.fromfile is modifying file object (but not file content, just the pointer).
  • np.frombuffer - accepts a buffer object (which is like an array). Note that buffers in python do not have such a pointer as file objects. Every subsequent call to np.frombuffer is not changing the buffer.

All parsers of SHIELD-HIT12A formats were implemented using the file pointer logic. After reading a token from file, the pointer moves forward, see i.e. read_next_token method:
https://github.com/DataMedSci/pymchelper/blob/v1.1.0/pymchelper/readers/shieldhit/reader_base.py#L155
This solution is nice as we are avoid loading full file content into memory.

I would like still to keep this functionality which avoids loading full unpacked files into memory.

My proposed solution is following. Let us create an abstraction layer (a function or class) which holds one of two objects:

  • buffer with contents of unzipped file (a lot of RAM may be needed here as this is a full file loaded into memory) and our own pointer containing location what are we reading at the moment
  • open file object in case we have uncompressed BDO file
    This abstraction should also have a method of reading, with similar arguments as np.fromfile and it should do one of two things:
  • call np.frombuffer on a respective part of file in case we hold a buffer
  • call np.fromfile on a respective part of file in case we hold an open file object

Such abstraction should be used in all places where currently we have calls to np.fromfile

@grzanka grzanka force-pushed the fix/365-read_bdz_files branch from 1f5e842 to 1e9911d Compare March 26, 2021 20:33
@reviewpad reviewpad bot mentioned this pull request Mar 28, 2023
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants