Reduce scope of sync_tables #254

dougbrn · 2023-10-04T19:40:47Z

Addresses some first steps of #245

Removes nobs columns from critical columns, no longer generates them from source in _generate_object_table()
Sync_tables now just aligns indices, no generation of nobs information
Adds the notion of temporary columns, which are removed when syncs occur
Updates prune() to use an existing nobs column if provided, but otherwise calls calc_nobs() as part of the operation

review-notebook-app · 2023-10-04T19:40:51Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

codecov · 2023-10-04T19:43:46Z

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (0f962c5) 92.70% compared to head (551af41) 93.60%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #254      +/-   ##
==========================================
+ Coverage   92.70%   93.60%   +0.90%     
==========================================
  Files          22       22              
  Lines        1151     1142       -9     
==========================================
+ Hits         1067     1069       +2     
+ Misses         84       73      -11

Files	Coverage Δ
src/tape/ensemble.py	`92.00% <100.00%> (+2.12%)`	⬆️
src/tape/utils/column_mapper/column_mapper.py	`90.00% <ø> (ø)`

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dougbrn · 2023-10-04T21:34:12Z

src/tape/ensemble.py

            )  # the pivot_table call makes each band_count a column of the id_col row

+            # repartition the result to align with object
+            if self._object.known_divisions:
+                band_counts = band_counts.reset_index().set_index(self._id_col)  # ugly, but need this


I had to do this to resolve issues with assign not knowing divisions, and I'm really not happy about it. If anyone has ideas on how to more elegantly generate division information, let me know.

Could you explain why do we need it here? I was thinking that groupby would keep partiotions, isn't it true?

The grouping is done on the source table, so the source partitions are preserved. And the divisions for the resulting dataframe are not known unless they were known for source. This leads to the problem where the assign call seems to have issues when the divisions for object are known, but the divisions for the new pandas series to assign are not. That being said, I wonder if it'd resolve better if I tried to remove the division information from object, might give that a go

The latest commit changes this to just drop divisions from object so it doesn't try to use them as part of the assign

hombit

Looks good, thank you!

hombit · 2023-10-05T12:08:24Z

src/tape/ensemble.py

            # short-hand for calculating nobs_total
            band_counts["total"] = band_counts[list(band_counts.columns)].sum(axis=1)

            bands = band_counts.columns.values
            self._object = self._object.assign(**{label + "_" + band: band_counts[band] for band in bands})

+            if temporary:
+                self._object_temp.extend([label + "_" + band for band in bands])


Redundant square brackets

I think this is actually needed as a generator for a list comprehension

I don't think so.

Practice:

In [1]: a = [-1] In [2]: a.extend(i**2 for i in range(5)) In [3]: a Out[3]: [-1, 0, 1, 4, 9, 16]

Theory:

list.extend() takes any iterable, not necessary list or even a collection. That means it is not nessery to create a new list, it is enough to pass a generator

In-line generator syntax is (expr for item in iter), with round brackets. However, if in-line generator is an only argument of a function, it is allowed to omit these brackets.

I think that not having square brackets is a more pythonic way of doing the things, but of course it is a matter of taste

Fair enough, seems to pass unit tests. New commit drops the brackets

hombit · 2023-10-05T12:15:03Z

src/tape/ensemble.py

            )  # the pivot_table call makes each band_count a column of the id_col row

+            # repartition the result to align with object
+            if self._object.known_divisions:
+                band_counts = band_counts.reset_index().set_index(self._id_col)  # ugly, but need this


Could you explain why do we need it here? I was thinking that groupby would keep partiotions, isn't it true?

hombit · 2023-10-05T20:07:54Z

src/tape/ensemble.py

Could we also have temporary argument for assign? So users could add their own temporary columns.

Yeah I think this makes a lot of sense!

reduce scope of sync_tables

3a6e3fb

dougbrn added 2 commits October 4, 2023 14:04

address divisions issue

4049e03

add temporary cols test

074cf3d

dougbrn commented Oct 4, 2023

View reviewed changes

improve coverage

6488344

dougbrn marked this pull request as ready for review October 4, 2023 22:12

dougbrn requested a review from hombit October 4, 2023 22:17

hombit approved these changes Oct 5, 2023

View reviewed changes

dougbrn added 4 commits October 5, 2023 14:32

add temporary kwarg to assign

5a5f7a1

add temporary kwarg to assign

7f7167d

drop divisions

e6b6d38

drop brackets

551af41

dougbrn merged commit 5a93408 into main Oct 6, 2023
9 checks passed

wenneman mentioned this pull request Oct 6, 2023

Unit test failing after PR #256 #263

Closed

dougbrn mentioned this pull request Oct 6, 2023

Improvements to Table Syncing #245

Open

dougbrn deleted the lean_sync branch December 11, 2023 19:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce scope of sync_tables #254

Reduce scope of sync_tables #254

dougbrn commented Oct 4, 2023

review-notebook-app bot commented Oct 4, 2023

codecov bot commented Oct 4, 2023 •

edited

Loading

dougbrn Oct 4, 2023

hombit Oct 5, 2023

dougbrn Oct 5, 2023

dougbrn Oct 5, 2023

hombit left a comment

hombit Oct 5, 2023

dougbrn Oct 5, 2023

hombit Oct 6, 2023

dougbrn Oct 6, 2023

hombit Oct 5, 2023

hombit Oct 5, 2023

dougbrn Oct 5, 2023

Reduce scope of sync_tables #254

Reduce scope of sync_tables #254

Conversation

dougbrn commented Oct 4, 2023

review-notebook-app bot commented Oct 4, 2023

codecov bot commented Oct 4, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hombit left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Oct 4, 2023 •

edited

Loading