-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intelligent Sampler and Assembler #17
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpn--
Feedback from me for your consideration (fyi @kulshresthaa, @JoeJimFlood)
---------- | ||
orig_indiv : path-like | ||
Trips table from "original" model run, should be comprehensive simulation | ||
of all individual trips for all synthetic households. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method is requesting CSV
files with certain fields of certain data types. Is it worth including these references in the method docs?
jnt_trips_rsm = pd.read_csv(rsm_joint) | ||
|
||
# convert to rsm trips | ||
logger.info("convert to common table platform") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An assert
here checking for the existence of the expected column headers would be useful.
# convert to rsm trips | ||
logger.info("convert to common table platform") | ||
rsm_trips = _merge_joint_and_indiv_trips(ind_trips_rsm, jnt_trips_rsm) | ||
original_trips = _merge_joint_and_indiv_trips(ind_trips_full, jnt_trips_full) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a personal preference for using suffixes to identify datatypes, e.g., rsm_trips_df
. This makes it easier for me to read the code and follow what is happening. If you agree, can we modify?
_agg_by_hhid_and_tripmode(original_trips_that_were_resimulated, "n_trips_orig"), | ||
_agg_by_hhid_and_tripmode(rsm_trips, "n_trips_rsm"), | ||
on=["hh_id", "trip_mode"], | ||
how="outer", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this an outer join? Are we anticipating that we have RSM trips that are not in the original model?
# aggregating by Home zone | ||
hh_rsm = pd.read_csv(households) | ||
hh_id_col_names = ["hhid", "hh_id", "household_id"] | ||
for hhid in hh_id_col_names: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for col_name in hh_id_col_names:
is easier to follow.
study_area=None, | ||
input_household="households.csv", | ||
input_person="persons.csv", | ||
taz_crosswalk="taz_crosswalk.csv", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not obvious to me what taz_crosswalk
is. Is this from the original
model to the rsm
? Can we use a more logical name? It's also not in the in-line method documentation below.
logger.warning(f"missing curr_iter_access_df from {curr_iter_access}") | ||
if prev_iter_access_df is None: | ||
logger.warning(f"missing prev_iter_access_df from {prev_iter_access}") | ||
# true for first iteraion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why give a warning if this is true for the first iteration? Would it be better to add a first_iteration
flag to the method to avoid giving a confusing warning?
taz_hh = input_household_df.groupby(["taz"]).size().rename("n_hh").to_frame() | ||
|
||
if curr_iter_access_df is None or prev_iter_access_df is None: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Largely a stylistic preference, but I'd rather see these chunks as methods than long if-else blocks. If this, go see this method. If not, go see this method. It makes it easier (for me) to follow the broader logic of the method and then dive into the details of how the pieces are implemented.
curr_iter_access_df.index.isin(taz_hh.index) | ||
].copy() | ||
|
||
# compare accessibility columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, prefer this be a method to keep the flow easier to read.
axis=1 | ||
) | ||
|
||
# TODO: potentially adjust this later after we figure out a better approach |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems as a good a time as any to discuss the approach we want to use at the MVP. Can you please confirm what we are doing now? My guess is below.
- Sum the errors in accessibility across the accessibility columns
- Normalize the errors within the range of the input sampling rate
- Insure that the outcome is within the sampling bounds
Summing the errors across the categories makes me uneasy, as they are on different scales. Can we instead normalize each one and take the max across the accessibility categories as the measure?
This is a cleaner PR to make review simpler.
Closes #16