DM-34875: switch most DataFrame connections to ArrowAstropy #1010

TallJimbo · 2024-11-21T16:53:58Z

No description provided.

erykoff

Overall looks good; a bunch of small comments.

erykoff · 2024-11-21T17:57:54Z

python/lsst/pipe/tasks/isolatedStarAssociation.py

 import esutil
 import hpgeom as hpg
 import numpy as np
-import pandas as pd


This warms my heart.

erykoff · 2024-11-21T18:05:31Z

python/lsst/pipe/tasks/isolatedStarAssociation.py


-            table = df[persist_columns][goodSrc.selected].to_records()
+            table = tbl[persist_columns][goodSrc.selected].as_array().view(np.recarray)


I'm not sure why we need the .view(np.recarray) here? Or why (equivalently) np.asarray(tbl[persist_columns][goodSrc.selected]) doesn't work.

I was being cautious because the old code (I think) was making actual np.recarray instances, not just np.ndarray instances with structured dtypes, and I didn't know if any recarray functionality was actually needed. But unit tests pass with what you've suggested, so I'll switch to that.

erykoff · 2024-11-21T18:05:48Z

python/lsst/pipe/tasks/postprocess.py

 import functools
-import pandas as pd


erykoff · 2024-11-21T18:06:00Z

python/lsst/pipe/tasks/postprocess.py

 import numbers
 import os

+import numpy as np
+import pandas as pd


Oh... sigh.

Yeah, it'd have been fantastic to be able to drop it from this file, but that's a long ways away. Just have to settle for not pretending it's part of the standard library anymore.

erykoff · 2024-11-21T18:07:25Z

python/lsst/pipe/tasks/postprocess.py

-        df = catalog.asAstropy().to_pandas().set_index("id", drop=True)
-        df["visit"] = visit
+        tbl = catalog.asAstropy()
+        tbl.add_index("id")


This makes me nervous. Why do we need the index here? We aren't persisting it... Also astropy indexes are wonky and I don't want to rely on them since the support seems nonexistent.

Ah, I hadn't realized I'd already committed this before our Slack discussion about the Astropy indexes being inadvisable. I don't think it's actually needed; I'll drop it and rerun ci_*.

erykoff · 2024-11-21T18:13:05Z

python/lsst/pipe/tasks/postprocess.py

-        outputCatalog = pd.DataFrame(data=visitEntries)
-        outputCatalog.set_index("visitId", inplace=True, verify_integrity=True)
+        outputCatalog = astropy.table.Table(rows=visitEntries)
+        outputCatalog.add_index("visitId")


Again, the astropy table index thing.

erykoff · 2024-11-21T18:13:31Z

tests/test_isolatedStarAssociation.py

 import numpy as np
-import pandas as pd


erykoff · 2024-11-21T18:14:03Z

tests/test_isolatedStarAssociation.py

-                df.set_index('sourceId', inplace=True)
-                data_refs.append(lsst.pipe.base.InMemoryDatasetHandle(df, storageClass="DataFrame"))
+                tbl = astropy.table.Table(table)
+                handles.append(lsst.pipe.base.InMemoryDatasetHandle(tbl, storageClass="DataFrame"))


Should this be DataFrame or ArrowAstropy?

Fixed. But I guess nothing actually cares, given what the test actually does with the handle.

erykoff · 2024-11-21T18:14:24Z

tests/test_isolatedStarAssociation.py

-            tables.append(df.to_records())
+        for handle in self.handles:
+            tbl = handle.get()
+            tables.append(tbl.as_array().view(np.recarray))


Again np.asarray(tbl)?

erykoff · 2024-11-21T18:14:33Z

tests/test_isolatedStarAssociation.py

-            tables.append(df.to_records())
+        for handle in self.handles:
+            tbl = handle.get()
+            tables.append(tbl.as_array().view(np.recarray))


This will only change the output dataset type definitions in repositories where they are not already registered. It should be backwards compatible (both in reading inputs originally written as DataFrame, and in downstream tasks reading its outputs) due to storage class conversions.

Some of these previously returned a DataFrame directly while others returned a Struct.

This does not include TransformForcedSourceTable, as its string columns make it a little trickier. The inputs and calculations are still Pandas; we just convert to ArrowAstropy at the end. This gets the default dataset type definitions in shape with minimal effort. There is a temporary exception or TransformObjectTable when multiLevelOutput=True - that still uses DataFrame - but this option is now depecated (it wasn't being used in production already).

We can't do the same for Object and ForcedSource (at least not easily) because those use Pandas MultiIndexes.

Use "handle" as an abbreviation for DeferredDatasetHandle (or InMemoryDatasetHandle) rather than "ref", which used to mean DataRef in Gen2 to (analogous) but suggests DatasetRef in Gen3 (not analogous).

Since the internals are already all using structured numpy arrays, it's easy to fully remove Pandas here.

We're already setting these in transfromSourceTable from the data ID, overriding whatever was there in the pre-transform table, and this lets transformSourceTable run on calibrateImage's already-calibrated Astropy output, which lacks them. Note that I'm making minor adjustments to Source.yaml instead of switching to the existing configuration for initial_stars because I don't want to reconfigure downstream analysis tasks (which require many measurement columns that calibrateImage doesn't run by default) on this ticket. I expect the final configuration to be somewhere in between.

Now that WriteRecalibratedImageTask is also producing ArrowAstropy outputs, their ID columns look the same as the outputs of CalibrateImageTask again, and PreSource.yaml and Source.yaml are (up to some typo fixes in the latter) identical.

Prior to this ticket, there was no detector ID column (just the manged ccdVisitId), which seems like an odd oversight. Prior to this commit, the detector ID was just called "id", since that's what made sense in the per-visit input cataogs, but it doesn't make sense here.

Using this for MakeCcdVisitTableTask is trickier because we can't ask an ExposureCatalog how many rows it has before loading it. If that's needed, we can extend this code to do it on another branch.

erykoff approved these changes Nov 21, 2024

View reviewed changes

TallJimbo force-pushed the tickets/DM-34875 branch from 81a3750 to f924f94 Compare November 26, 2024 15:09

TallJimbo added 15 commits December 13, 2024 10:56

Make all Transform*Table run methods return a Struct.

cd7e946

Some of these previously returned a DataFrame directly while others returned a Struct.

Switch MakeVisitTable and MakeCcdVisitTable outputs to ArrowAstropy.

caf4bc9

Convert Write[Recalibrated]SourceTable output to ArrowAstropy.

8a91114

We can't do the same for Object and ForcedSource (at least not easily) because those use Pandas MultiIndexes.

Drop Gen2 terminology in isolatedStarAssociation.

bb34e45

Use "handle" as an abbreviation for DeferredDatasetHandle (or InMemoryDatasetHandle) rather than "ref", which used to mean DataRef in Gen2 to (analogous) but suggests DatasetRef in Gen3 (not analogous).

Switch isolatedStarAssociation to use ArrowAstropy.

eb57397

Since the internals are already all using structured numpy arrays, it's easy to fully remove Pandas here.

Fix comment typos.

293b685

Don't expect Pandas indexes in Source functors.

4480c29

Use np.asarray and don't use np.recarray in isolatedStarAssociation.

e82c2ed

Drop unnecessary Astropy table indexes.

4d4243f

Fix (unused) storage class in test.

8be7d9e

Remove PreSource.yaml workaround.

91da38a

Now that WriteRecalibratedImageTask is also producing ArrowAstropy outputs, their ID columns look the same as the outputs of CalibrateImageTask again, and PreSource.yaml and Source.yaml are (up to some typo fixes in the latter) identical.

TallJimbo force-pushed the tickets/DM-34875 branch from 1618995 to b0b3e71 Compare December 13, 2024 15:56

TallJimbo added 2 commits December 13, 2024 12:53

Convert finalizeCharacterization to ArrowAstropy.

433d2c8

Convert measureCoaddSources to ArrowAstropy.

77c6147

TallJimbo force-pushed the tickets/DM-34875 branch from b0b3e71 to b9a7f93 Compare December 13, 2024 17:53

Add memory-efficient stack for consolidate tasks.

acbd837

Using this for MakeCcdVisitTableTask is trickier because we can't ask an ExposureCatalog how many rows it has before loading it. If that's needed, we can extend this code to do it on another branch.

TallJimbo force-pushed the tickets/DM-34875 branch from b9a7f93 to acbd837 Compare December 13, 2024 17:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-34875: switch most DataFrame connections to ArrowAstropy #1010

DM-34875: switch most DataFrame connections to ArrowAstropy #1010

TallJimbo commented Nov 21, 2024

erykoff left a comment

erykoff Nov 21, 2024

erykoff Nov 21, 2024

TallJimbo Nov 21, 2024

erykoff Nov 21, 2024

erykoff Nov 21, 2024

TallJimbo Nov 21, 2024

erykoff Nov 21, 2024

TallJimbo Nov 21, 2024

erykoff Nov 21, 2024

erykoff Nov 21, 2024

erykoff Nov 21, 2024

TallJimbo Nov 21, 2024

erykoff Nov 21, 2024

erykoff Nov 21, 2024


		table = df[persist_columns][goodSrc.selected].to_records()
		table = tbl[persist_columns][goodSrc.selected].as_array().view(np.recarray)

DM-34875: switch most DataFrame connections to ArrowAstropy #1010

Are you sure you want to change the base?

DM-34875: switch most DataFrame connections to ArrowAstropy #1010

Conversation

TallJimbo commented Nov 21, 2024

erykoff left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment