Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce scope of sync_tables #254

Merged
merged 8 commits into from
Oct 6, 2023
Merged

Reduce scope of sync_tables #254

merged 8 commits into from
Oct 6, 2023

Conversation

dougbrn
Copy link
Collaborator

@dougbrn dougbrn commented Oct 4, 2023

Addresses some first steps of #245

  • Removes nobs columns from critical columns, no longer generates them from source in _generate_object_table()
  • Sync_tables now just aligns indices, no generation of nobs information
  • Adds the notion of temporary columns, which are removed when syncs occur
  • Updates prune() to use an existing nobs column if provided, but otherwise calls calc_nobs() as part of the operation

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@codecov
Copy link

codecov bot commented Oct 4, 2023

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (0f962c5) 92.70% compared to head (551af41) 93.60%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #254      +/-   ##
==========================================
+ Coverage   92.70%   93.60%   +0.90%     
==========================================
  Files          22       22              
  Lines        1151     1142       -9     
==========================================
+ Hits         1067     1069       +2     
+ Misses         84       73      -11     
Files Coverage Δ
src/tape/ensemble.py 92.00% <100.00%> (+2.12%) ⬆️
src/tape/utils/column_mapper/column_mapper.py 90.00% <ø> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

) # the pivot_table call makes each band_count a column of the id_col row

# repartition the result to align with object
if self._object.known_divisions:
band_counts = band_counts.reset_index().set_index(self._id_col) # ugly, but need this
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to do this to resolve issues with assign not knowing divisions, and I'm really not happy about it. If anyone has ideas on how to more elegantly generate division information, let me know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why do we need it here? I was thinking that groupby would keep partiotions, isn't it true?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The grouping is done on the source table, so the source partitions are preserved. And the divisions for the resulting dataframe are not known unless they were known for source. This leads to the problem where the assign call seems to have issues when the divisions for object are known, but the divisions for the new pandas series to assign are not. That being said, I wonder if it'd resolve better if I tried to remove the division information from object, might give that a go

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest commit changes this to just drop divisions from object so it doesn't try to use them as part of the assign

@dougbrn dougbrn marked this pull request as ready for review October 4, 2023 22:12
@dougbrn dougbrn requested a review from hombit October 4, 2023 22:17
Copy link
Contributor

@hombit hombit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you!

# short-hand for calculating nobs_total
band_counts["total"] = band_counts[list(band_counts.columns)].sum(axis=1)

bands = band_counts.columns.values
self._object = self._object.assign(**{label + "_" + band: band_counts[band] for band in bands})

if temporary:
self._object_temp.extend([label + "_" + band for band in bands])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant square brackets

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is actually needed as a generator for a list comprehension

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so.

Practice:

In [1]: a = [-1]
In [2]: a.extend(i**2 for i in range(5))
In [3]: a
Out[3]: [-1, 0, 1, 4, 9, 16]

Theory:

  • list.extend() takes any iterable, not necessary list or even a collection. That means it is not nessery to create a new list, it is enough to pass a generator
  • In-line generator syntax is (expr for item in iter), with round brackets. However, if in-line generator is an only argument of a function, it is allowed to omit these brackets.

I think that not having square brackets is a more pythonic way of doing the things, but of course it is a matter of taste

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, seems to pass unit tests. New commit drops the brackets

) # the pivot_table call makes each band_count a column of the id_col row

# repartition the result to align with object
if self._object.known_divisions:
band_counts = band_counts.reset_index().set_index(self._id_col) # ugly, but need this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why do we need it here? I was thinking that groupby would keep partiotions, isn't it true?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also have temporary argument for assign? So users could add their own temporary columns.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think this makes a lot of sense!

@dougbrn dougbrn merged commit 5a93408 into main Oct 6, 2023
9 checks passed
@dougbrn dougbrn deleted the lean_sync branch December 11, 2023 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants