-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce scope of sync_tables #254
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Codecov ReportAll modified lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #254 +/- ##
==========================================
+ Coverage 92.70% 93.60% +0.90%
==========================================
Files 22 22
Lines 1151 1142 -9
==========================================
+ Hits 1067 1069 +2
+ Misses 84 73 -11
☔ View full report in Codecov by Sentry. |
src/tape/ensemble.py
Outdated
) # the pivot_table call makes each band_count a column of the id_col row | ||
|
||
# repartition the result to align with object | ||
if self._object.known_divisions: | ||
band_counts = band_counts.reset_index().set_index(self._id_col) # ugly, but need this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to do this to resolve issues with assign not knowing divisions, and I'm really not happy about it. If anyone has ideas on how to more elegantly generate division information, let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain why do we need it here? I was thinking that groupby
would keep partiotions, isn't it true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The grouping is done on the source table, so the source partitions are preserved. And the divisions for the resulting dataframe are not known unless they were known for source. This leads to the problem where the assign call seems to have issues when the divisions for object are known, but the divisions for the new pandas series to assign are not. That being said, I wonder if it'd resolve better if I tried to remove the division information from object, might give that a go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest commit changes this to just drop divisions from object so it doesn't try to use them as part of the assign
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thank you!
src/tape/ensemble.py
Outdated
# short-hand for calculating nobs_total | ||
band_counts["total"] = band_counts[list(band_counts.columns)].sum(axis=1) | ||
|
||
bands = band_counts.columns.values | ||
self._object = self._object.assign(**{label + "_" + band: band_counts[band] for band in bands}) | ||
|
||
if temporary: | ||
self._object_temp.extend([label + "_" + band for band in bands]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redundant square brackets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is actually needed as a generator for a list comprehension
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so.
Practice:
In [1]: a = [-1]
In [2]: a.extend(i**2 for i in range(5))
In [3]: a
Out[3]: [-1, 0, 1, 4, 9, 16]
Theory:
list.extend()
takes any iterable, not necessary list or even a collection. That means it is not nessery to create a new list, it is enough to pass a generator- In-line generator syntax is
(expr for item in iter)
, with round brackets. However, if in-line generator is an only argument of a function, it is allowed to omit these brackets.
I think that not having square brackets is a more pythonic way of doing the things, but of course it is a matter of taste
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, seems to pass unit tests. New commit drops the brackets
src/tape/ensemble.py
Outdated
) # the pivot_table call makes each band_count a column of the id_col row | ||
|
||
# repartition the result to align with object | ||
if self._object.known_divisions: | ||
band_counts = band_counts.reset_index().set_index(self._id_col) # ugly, but need this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain why do we need it here? I was thinking that groupby
would keep partiotions, isn't it true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also have temporary
argument for assign? So users could add their own temporary columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think this makes a lot of sense!
Addresses some first steps of #245
_generate_object_table()
calc_nobs()
as part of the operation