add read_single_spectrum function #2052

dirkscholte · 2023-05-24T14:01:35Z

A function to read a single spectrum from a fits file instead of reading in all the spectra in the file. As requested by @sbailey .

This function also adds the option to either read in or not read in the "FIBERMAP", "EXP_FIBERMAP", "SCORES", "EXTRA_CATALOG", "MASK", "RESOLUTION" HDUs.

I have currently not included the option to read in a subset of spectra from each file. I did this because this is incompatible with another option I would rather include: reading a subset of columns from for example the fibermap. This makes it possible to minimise the amount of data read in to the bare minimum needed for any purpose.

This should minimise the amount of data read from the disk and reduce the read time when reading many spectra from separate files.

sbailey · 2023-06-21T21:59:14Z

coming up for air after EDR...

I really like the spirit of this PR and it would be a welcome new feature. However, I'm concerned about the maintenance fragility of the amount of replicated code between the previous read_spectra and the new read_single_spectrum. Let's try to implement this as additional args to read_spectra itself while also adding support to read more than one TARGETID but not all of them. i.e. something like:

def read_spectra(infile, single=False, targetids=None):
    ...
    if targetids is not None:
        targetids = np.atleast_1d(targetids)
        file_targetids = hdus['FIBERMAP'].read(columns='TARGETID')
        rows = np.where(np.isin(file_targetids, targetids))[0]
    else:
        rows = None

    ...
    flux = hdus['FLUX_B'].read(rows=rows)  # for example; actual code has to deal with B/R/Z
    ...

Skipping various HDUs could also be implemented as an opt-out argument to read_spectra rather than an opt-in argument to a different function.

Do you think that would work?

I think we should allow the requested targetids to be a superset of what is actually in the file, which is convenient for looping over files and stacking the results (i.e. I have N>>1 TARGETIDs that I know are in these M>>1 files, but I don't need to track the details of which ones are in which files). A detail to consider is what should happen if none of the requested targetid are in the file. Most consistent would be to return a Spectrum object with 0 rows, but that could be a bit surprising if you are asking for one target in one file and it isn't there but you still get something back.

mlandriau · 2023-06-27T05:04:48Z

For me, this function doesn't work unless I modify line 526 such that targetrow is defined to be the [0][0] element of the rhs of the equation. ML

…

On Wed, Jun 21, 2023 at 2:59 PM Stephen Bailey ***@***.***> wrote: coming up for air after EDR... I really like the spirit of this PR and it would be a welcome new feature. However, I'm concerned about the maintenance fragility of the amount of replicated code between the previous read_spectra and the new read_single_spectrum. Let's try to implement this as additional args to read_spectra itself while also adding support to read more than one TARGETID but not all of them. i.e. something like: def read_spectra(infile, single=False, targetids=None): ... if targetids is not None: targetids = np.atleast_1d(targetids) file_targetids = hdus['FIBERMAP'].read(columns='TARGETID') rows = np.where(np.isin(file_targetids, targetids))[0] else: rows = None ... flux = hdus['FLUX_B'].read(rows=rows) # for example; actual code has to deal with B/R/Z ... Skipping various HDUs could also be implemented as an opt-out argument to read_spectra rather than an opt-in argument to a different function. Do you think that would work? I think we should allow the requested targetids to be a superset of what is actually in the file, which is convenient for looping over files and stacking the results (i.e. I have N>>1 TARGETIDs that I know are in these M>>1 files, but I don't need to track the details of which ones are in which files). A detail to consider is what should happen if none of the requested targetid are in the file. Most consistent would be to return a Spectrum object with 0 rows, but that could be a bit surprising if you are asking for one target in one file and it isn't there but you still get something back. — Reply to this email directly, view it on GitHub <#2052 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEWOPQXWXDNCOP4MWTDQVHLXMNVEBANCNFSM6AAAAAAYNN4CFY> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- Martin Landriau Lawrence Berkeley National Laboratory 1 Cyclotron Road Berkeley, CA 94720 USA Kitt Peak National Observatory AZ-386 Tucson, AZ 85364 USA

dirkscholte · 2023-06-30T08:34:41Z

@sbailey implementing it as extra arguments into the read_spectra function should be fine.

I only allowed reading a single spectrum for the specific reason that the fitsio module only allows me to select a slice of consecutive rows in ImageHDUs, not a list of rows single rows. astropy Tables do allow this functionality but they then don't allow a sub-selection of columns in TableHDUs.

If we can fix the issue above then allowing a superset of TARGETIDs and only retrieving the spectra in the file might be an option.

@mlandriau Thanks, I'll fix that!

sbailey · 2023-06-30T16:19:48Z

@dirkscholte thanks for the explanation; I hadn't realized that fitsio only supports slices but not non-contiguous lists of row indices for images (it supports both for tables). The best workaround I've found is:

fp = fitsio.FITS(filename)
b_flux = np.vstack([fp['B_FLUX'][i,:] for i in rows])

I haven't rigorously tested the timing, but it looks like this is equal or faster than reading the full HDU if you are reading up to about 20% of the total data. After that it is faster to read the full HDU then sub-select, e.g.

fp = fitsio.FITS(filename)
b_flux = fp['B_FLUX'].read()[rows]

Let's aim for a solution that

syntactically supports reading multiple targets in a single call
isn't dramatically slower than your solution when requesting a single target
isn't dramatically slower than current read_spectra for reading the entire file if no targets are specified

dirkscholte · 2023-07-04T08:05:03Z

@sbailey I need to do some testing of the read time for different amounts of spectra but I think this solution might come closer to what you are looking for.

This should do the following:

Read any sub-selection of targets from file
Allow a superset or TARGETIDs to be inputted and only returning the set in the file. If none --> empty Spectra object.
Skipping HDUs is now an opt out argument, there is another opt out argument for skipping columns in the TableHDUs, default all columns are returned.

I implemented selecting a subset from the ImageHDUs differently. I decided to use astropy.io.fits to read sections of those files instead of fitsio, whilst still using fitsio for the TableHDUs. I think this should allow both selecting a subset of rows in ImageHDUs and Columns in TableHDUs without compromising on speed by looping.

I just spotted need to update the docstring a bit still.

sbailey

Functionally, this works great, thanks. I put in some stylistic comments for consideration about what we want to user-facing API to be, e.g. the output order of the targets if the input list is in a different order than what appears in the file, and whether you have to always opt-in + opt-out of every HDU, or only out-out the ones you don't want.

If you are on vacation or otherwise too busy with other tasks, I'm willing to implement these changes myself, but I wanted to give an opportunity for you to at least comment on them in case you disagree. Thoughts?

py/desispec/io/spectra.py

dirkscholte · 2023-08-08T12:34:47Z

@sbailey I agree with all these comments to improve the usability! If you're willing to take over that'd be amazing! Otherwise, I will get to it but it might take a little while until I get round to it.

sbailey · 2023-08-10T23:08:10Z

Thanks @dirkscholte @araichoor @moustakas . I implemented my review suggestions, following some guidance/warnings from @araichoor and @moustakas . I added a bunch of unit tests to check the various cases of filtering by targetid (including missing or duplicated), skipping HDUs, and selecting subsets of columns.

High level "proof is in the pudding":

from desispec.io import read_spectra
filename = 'spectra-0-100-thru20210505.fits.gz'
spectra = read_spectra(filename)
targetids = np.unique(np.array(spectra.fibermap['TARGETID'])[0:3])

subset = read_spectra(
        filename,
        targetids=targetids,
        skip_hdus=['RESOLUTION',],
        select_columns=dict(FIBERMAP=['TARGETID', 'FIBER'])
)
print(f"Requested targetids={targetids}")
print(f"targetids in subset={np.array(subset.fibermap['TARGETID'])}")
print("repeated targetids are good, because input had 2x entries per targetid")
print(f"Resolution skipped:", subset.R is None)
print(f"fibermap columns {subset.fibermap.colnames}")

Results:

Requested targetids=[ 39633393167631849  39633393167633604 616094145466860255]
targetids in subset=[ 39633393167631849  39633393167631849  39633393167633604
  39633393167633604 616094145466860255 616094145466860255]
repeated targetids are good, because input had 2x entries per targetid
Resolution skipped: True
fibermap columns ['TARGETID', 'FIBER']

That being said, the performance of reading just a few targets from a large compressed file is terrible, so I'm going to poke at that a bit more and compare to Dirk's original version to see if I broke something...

sbailey · 2023-08-11T18:27:32Z

Although astropy provides the ability to read subsets of image rows via ahdus[h].section[rows, :], its performance is absolutely terrible on gzip compressed files (like un-coadded spectra), and slightly worse that fitsio reading the full HDU and post-facto subselecting rows even on uncompressed files. e.g. a test of reading B_FLUX, R_FLUX, and Z_FLUX from compressed spectra vs. uncompressed coadds, and either reading the entire HDU or subselecting to 5% of the rows:

                       -- read time --
filetype       rows    fitsio   astropy
compressed     all      1.981    3.470
compressed     subset   1.964   42.773
uncompressed   all      0.006    0.016
uncompressed   subset   0.006    0.016

so I switched the code to use only fitsio for reading spectra. Comparing the readtime (seconds) with current main vs. this PR for reading spectra and coadd files:

                      main   PR   PR(onetarget)
compressed spectra     2.6   2.5  2.3
uncompressed coadd     0.3   0.3  0.1

i.e. for reading the full file it is about the same as current main, reading one target is slightly faster on a compressed spectra file, and significantly faster on an uncompressed coadd file.

sbailey · 2023-08-11T18:39:25Z

and one more feature: I added a read_spectra(filename, rows=...) option to directly specify which rows to read instead of targetids. This is useful for tile-based coadd files where the row corresponds to the fiber number.

I think this is ready to merge, but I'll let it sit for a little bit longer in case others want to look. Thanks for @dirkscholte for initiating this PR with the subselect-while-reading feature that multiple people have wanted (myself included!)

dirkscholte · 2023-08-15T15:53:35Z

@sbailey Thanks for the time you spent making this as nice as it is now!! ☀️

add read_single_spectrum function

ec0504e

dirkscholte requested a review from sbailey May 24, 2023 14:02

dirkscholte self-assigned this May 24, 2023

dirkscholte added the enhancement label May 24, 2023

dirkscholte marked this pull request as ready for review May 24, 2023 15:38

dirkscholte added 2 commits July 3, 2023 13:42

allow read_spectra to read subset

6cdcd26

tiny bug fix

3354bce

update docstring read_spectra

72754f5

sbailey reviewed Aug 3, 2023

View reviewed changes

sbailey added 2 commits August 10, 2023 15:44

update read_spectra options; preserve requested targetid order

36d09ec

add more read_spectra tests

d891ce1

spectra read only with fitsio

34ab1aa

read_spectra(filename, rows=...) option

629bbce

sbailey merged commit d401035 into main Aug 16, 2023
24 checks passed

sbailey deleted the read_few_spectra branch August 16, 2023 17:05

moustakas mentioned this pull request Oct 9, 2023

read_spectra with subset of targetids returns an incomplete exp_fibermap #2125

Closed

sbailey mentioned this pull request Nov 3, 2023

fix read_spectra skipping exp_fibermap #2137

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add read_single_spectrum function #2052

add read_single_spectrum function #2052

dirkscholte commented May 24, 2023 •

edited

Loading

sbailey commented Jun 21, 2023

mlandriau commented Jun 27, 2023 via email

dirkscholte commented Jun 30, 2023

sbailey commented Jun 30, 2023

dirkscholte commented Jul 4, 2023

sbailey left a comment

dirkscholte commented Aug 8, 2023

sbailey commented Aug 10, 2023

sbailey commented Aug 11, 2023

sbailey commented Aug 11, 2023

dirkscholte commented Aug 15, 2023

add read_single_spectrum function #2052

add read_single_spectrum function #2052

Conversation

dirkscholte commented May 24, 2023 • edited Loading

sbailey commented Jun 21, 2023

mlandriau commented Jun 27, 2023 via email

dirkscholte commented Jun 30, 2023

sbailey commented Jun 30, 2023

dirkscholte commented Jul 4, 2023

sbailey left a comment

Choose a reason for hiding this comment

dirkscholte commented Aug 8, 2023

sbailey commented Aug 10, 2023

sbailey commented Aug 11, 2023

sbailey commented Aug 11, 2023

dirkscholte commented Aug 15, 2023

dirkscholte commented May 24, 2023 •

edited

Loading