Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-34875: switch most DataFrame connections to ArrowAstropy #1010

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

TallJimbo
Copy link
Member

No description provided.

Copy link
Contributor

@erykoff erykoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good; a bunch of small comments.

import esutil
import hpgeom as hpg
import numpy as np
import pandas as pd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warms my heart.


table = df[persist_columns][goodSrc.selected].to_records()
table = tbl[persist_columns][goodSrc.selected].as_array().view(np.recarray)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we need the .view(np.recarray) here? Or why (equivalently) np.asarray(tbl[persist_columns][goodSrc.selected]) doesn't work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was being cautious because the old code (I think) was making actual np.recarray instances, not just np.ndarray instances with structured dtypes, and I didn't know if any recarray functionality was actually needed. But unit tests pass with what you've suggested, so I'll switch to that.

import functools
import pandas as pd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😄

import numbers
import os

import numpy as np
import pandas as pd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... sigh.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it'd have been fantastic to be able to drop it from this file, but that's a long ways away. Just have to settle for not pretending it's part of the standard library anymore.

df = catalog.asAstropy().to_pandas().set_index("id", drop=True)
df["visit"] = visit
tbl = catalog.asAstropy()
tbl.add_index("id")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me nervous. Why do we need the index here? We aren't persisting it... Also astropy indexes are wonky and I don't want to rely on them since the support seems nonexistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I hadn't realized I'd already committed this before our Slack discussion about the Astropy indexes being inadvisable. I don't think it's actually needed; I'll drop it and rerun ci_*.

outputCatalog = pd.DataFrame(data=visitEntries)
outputCatalog.set_index("visitId", inplace=True, verify_integrity=True)
outputCatalog = astropy.table.Table(rows=visitEntries)
outputCatalog.add_index("visitId")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, the astropy table index thing.

import numpy as np
import pandas as pd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

df.set_index('sourceId', inplace=True)
data_refs.append(lsst.pipe.base.InMemoryDatasetHandle(df, storageClass="DataFrame"))
tbl = astropy.table.Table(table)
handles.append(lsst.pipe.base.InMemoryDatasetHandle(tbl, storageClass="DataFrame"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be DataFrame or ArrowAstropy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. But I guess nothing actually cares, given what the test actually does with the handle.

tables.append(df.to_records())
for handle in self.handles:
tbl = handle.get()
tables.append(tbl.as_array().view(np.recarray))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again np.asarray(tbl)?

tables.append(df.to_records())
for handle in self.handles:
tbl = handle.get()
tables.append(tbl.as_array().view(np.recarray))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here.

This will only change the output dataset type definitions in
repositories where they are not already registered.  It should be
backwards compatible (both in reading inputs originally written as
DataFrame, and in downstream tasks reading its outputs) due to storage
class conversions.
Some of these previously returned a DataFrame directly while others
returned a Struct.
This does not include TransformForcedSourceTable, as its string
columns make it a little trickier.

The inputs and calculations are still Pandas; we just convert to
ArrowAstropy at the end.  This gets the default dataset type
definitions in shape with minimal effort.

There is a temporary exception or TransformObjectTable when
multiLevelOutput=True - that still uses DataFrame - but this option is
now depecated (it wasn't being used in production already).
We can't do the same for Object and ForcedSource (at least not easily)
because those use Pandas MultiIndexes.
Use "handle" as an abbreviation for DeferredDatasetHandle (or
InMemoryDatasetHandle) rather than "ref", which used to mean DataRef
in Gen2 to (analogous) but suggests DatasetRef in Gen3 (not analogous).
Since the internals are already all using structured numpy arrays,
it's easy to fully remove Pandas here.
We're already setting these in transfromSourceTable from the data ID,
overriding whatever was there in the pre-transform table, and this lets
transformSourceTable run on calibrateImage's already-calibrated Astropy
output, which lacks them.

Note that I'm making minor adjustments to Source.yaml instead of
switching to the existing configuration for initial_stars because I
don't want to reconfigure downstream analysis tasks (which require
many measurement columns that calibrateImage doesn't run by default)
on this ticket.  I expect the final configuration to be somewhere in
between.
Now that WriteRecalibratedImageTask is also producing ArrowAstropy
outputs, their ID columns look the same as the outputs of
CalibrateImageTask again, and PreSource.yaml and Source.yaml are
(up to some typo fixes in the latter) identical.
Prior to this ticket, there was no detector ID column (just the
manged ccdVisitId), which seems like an odd oversight.  Prior to this
commit, the detector ID was just called "id", since that's what made
sense in the per-visit input cataogs, but it doesn't make sense here.
Using this for MakeCcdVisitTableTask is trickier because we can't
ask an ExposureCatalog how many rows it has before loading it.  If
that's needed, we can extend this code to do it on another branch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants