-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add read_single_spectrum function #2052
Conversation
coming up for air after EDR... I really like the spirit of this PR and it would be a welcome new feature. However, I'm concerned about the maintenance fragility of the amount of replicated code between the previous def read_spectra(infile, single=False, targetids=None):
...
if targetids is not None:
targetids = np.atleast_1d(targetids)
file_targetids = hdus['FIBERMAP'].read(columns='TARGETID')
rows = np.where(np.isin(file_targetids, targetids))[0]
else:
rows = None
...
flux = hdus['FLUX_B'].read(rows=rows) # for example; actual code has to deal with B/R/Z
... Skipping various HDUs could also be implemented as an opt-out argument to Do you think that would work? I think we should allow the requested targetids to be a superset of what is actually in the file, which is convenient for looping over files and stacking the results (i.e. I have N>>1 TARGETIDs that I know are in these M>>1 files, but I don't need to track the details of which ones are in which files). A detail to consider is what should happen if none of the requested targetid are in the file. Most consistent would be to return a Spectrum object with 0 rows, but that could be a bit surprising if you are asking for one target in one file and it isn't there but you still get something back. |
For me, this function doesn't work unless I modify line 526 such that
targetrow is defined to be the [0][0] element of the rhs of the equation.
ML
…On Wed, Jun 21, 2023 at 2:59 PM Stephen Bailey ***@***.***> wrote:
coming up for air after EDR...
I really like the spirit of this PR and it would be a welcome new feature.
However, I'm concerned about the maintenance fragility of the amount of
replicated code between the previous read_spectra and the new
read_single_spectrum. Let's try to implement this as additional args to
read_spectra itself while also adding support to read more than one
TARGETID but not all of them. i.e. something like:
def read_spectra(infile, single=False, targetids=None):
...
if targetids is not None:
targetids = np.atleast_1d(targetids)
file_targetids = hdus['FIBERMAP'].read(columns='TARGETID')
rows = np.where(np.isin(file_targetids, targetids))[0]
else:
rows = None
...
flux = hdus['FLUX_B'].read(rows=rows) # for example; actual code has to deal with B/R/Z
...
Skipping various HDUs could also be implemented as an opt-out argument to
read_spectra rather than an opt-in argument to a different function.
Do you think that would work?
I think we should allow the requested targetids to be a superset of what
is actually in the file, which is convenient for looping over files and
stacking the results (i.e. I have N>>1 TARGETIDs that I know are in these
M>>1 files, but I don't need to track the details of which ones are in
which files). A detail to consider is what should happen if none of the
requested targetid are in the file. Most consistent would be to return a
Spectrum object with 0 rows, but that could be a bit surprising if you are
asking for one target in one file and it isn't there but you still get
something back.
—
Reply to this email directly, view it on GitHub
<#2052 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEWOPQXWXDNCOP4MWTDQVHLXMNVEBANCNFSM6AAAAAAYNN4CFY>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
--
Martin Landriau
Lawrence Berkeley National Laboratory
1 Cyclotron Road
Berkeley, CA 94720
USA
Kitt Peak National Observatory
AZ-386
Tucson, AZ 85364
USA
|
@sbailey implementing it as extra arguments into the I only allowed reading a single spectrum for the specific reason that the If we can fix the issue above then allowing a superset of TARGETIDs and only retrieving the spectra in the file might be an option. @mlandriau Thanks, I'll fix that! |
@dirkscholte thanks for the explanation; I hadn't realized that fitsio only supports slices but not non-contiguous lists of row indices for images (it supports both for tables). The best workaround I've found is: fp = fitsio.FITS(filename)
b_flux = np.vstack([fp['B_FLUX'][i,:] for i in rows]) I haven't rigorously tested the timing, but it looks like this is equal or faster than reading the full HDU if you are reading up to about 20% of the total data. After that it is faster to read the full HDU then sub-select, e.g. fp = fitsio.FITS(filename)
b_flux = fp['B_FLUX'].read()[rows] Let's aim for a solution that
|
@sbailey I need to do some testing of the read time for different amounts of spectra but I think this solution might come closer to what you are looking for. This should do the following:
I implemented selecting a subset from the ImageHDUs differently. I decided to use I just spotted need to update the docstring a bit still. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functionally, this works great, thanks. I put in some stylistic comments for consideration about what we want to user-facing API to be, e.g. the output order of the targets if the input list is in a different order than what appears in the file, and whether you have to always opt-in + opt-out of every HDU, or only out-out the ones you don't want.
If you are on vacation or otherwise too busy with other tasks, I'm willing to implement these changes myself, but I wanted to give an opportunity for you to at least comment on them in case you disagree. Thoughts?
@sbailey I agree with all these comments to improve the usability! If you're willing to take over that'd be amazing! Otherwise, I will get to it but it might take a little while until I get round to it. |
Thanks @dirkscholte @araichoor @moustakas . I implemented my review suggestions, following some guidance/warnings from @araichoor and @moustakas . I added a bunch of unit tests to check the various cases of filtering by targetid (including missing or duplicated), skipping HDUs, and selecting subsets of columns. High level "proof is in the pudding": from desispec.io import read_spectra
filename = 'spectra-0-100-thru20210505.fits.gz'
spectra = read_spectra(filename)
targetids = np.unique(np.array(spectra.fibermap['TARGETID'])[0:3])
subset = read_spectra(
filename,
targetids=targetids,
skip_hdus=['RESOLUTION',],
select_columns=dict(FIBERMAP=['TARGETID', 'FIBER'])
)
print(f"Requested targetids={targetids}")
print(f"targetids in subset={np.array(subset.fibermap['TARGETID'])}")
print("repeated targetids are good, because input had 2x entries per targetid")
print(f"Resolution skipped:", subset.R is None)
print(f"fibermap columns {subset.fibermap.colnames}") Results:
That being said, the performance of reading just a few targets from a large compressed file is terrible, so I'm going to poke at that a bit more and compare to Dirk's original version to see if I broke something... |
Although astropy provides the ability to read subsets of image rows via
so I switched the code to use only fitsio for reading spectra. Comparing the readtime (seconds) with current main vs. this PR for reading spectra and coadd files:
i.e. for reading the full file it is about the same as current main, reading one target is slightly faster on a compressed spectra file, and significantly faster on an uncompressed coadd file. |
and one more feature: I added a I think this is ready to merge, but I'll let it sit for a little bit longer in case others want to look. Thanks for @dirkscholte for initiating this PR with the subselect-while-reading feature that multiple people have wanted (myself included!) |
@sbailey Thanks for the time you spent making this as nice as it is now!! ☀️ |
A function to read a single spectrum from a fits file instead of reading in all the spectra in the file. As requested by @sbailey .
This function also adds the option to either read in or not read in the
"FIBERMAP", "EXP_FIBERMAP", "SCORES", "EXTRA_CATALOG", "MASK", "RESOLUTION"
HDUs.I have currently not included the option to read in a subset of spectra from each file. I did this because this is incompatible with another option I would rather include: reading a subset of columns from for example the fibermap. This makes it possible to minimise the amount of data read in to the bare minimum needed for any purpose.
This should minimise the amount of data read from the disk and reduce the read time when reading many spectra from separate files.