Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intelligent Sampler and Assembler #17

Closed
wants to merge 8 commits into from
Closed

Intelligent Sampler and Assembler #17

wants to merge 8 commits into from

Conversation

jpn--
Copy link
Collaborator

@jpn-- jpn-- commented Oct 11, 2022

This is a cleaner PR to make review simpler.

Closes #16

@DavidOry
Copy link
Collaborator

  • Add pytest infrastructure for unit testing
  • Convert the notebooks to tests to ensure that the resources repository and docker environment remain up to date

Copy link
Collaborator

@DavidOry DavidOry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpn--
Feedback from me for your consideration (fyi @kulshresthaa, @JoeJimFlood)

----------
orig_indiv : path-like
Trips table from "original" model run, should be comprehensive simulation
of all individual trips for all synthetic households.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method is requesting CSV files with certain fields of certain data types. Is it worth including these references in the method docs?

jnt_trips_rsm = pd.read_csv(rsm_joint)

# convert to rsm trips
logger.info("convert to common table platform")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An assert here checking for the existence of the expected column headers would be useful.

# convert to rsm trips
logger.info("convert to common table platform")
rsm_trips = _merge_joint_and_indiv_trips(ind_trips_rsm, jnt_trips_rsm)
original_trips = _merge_joint_and_indiv_trips(ind_trips_full, jnt_trips_full)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a personal preference for using suffixes to identify datatypes, e.g., rsm_trips_df. This makes it easier for me to read the code and follow what is happening. If you agree, can we modify?

_agg_by_hhid_and_tripmode(original_trips_that_were_resimulated, "n_trips_orig"),
_agg_by_hhid_and_tripmode(rsm_trips, "n_trips_rsm"),
on=["hh_id", "trip_mode"],
how="outer",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this an outer join? Are we anticipating that we have RSM trips that are not in the original model?

# aggregating by Home zone
hh_rsm = pd.read_csv(households)
hh_id_col_names = ["hhid", "hh_id", "household_id"]
for hhid in hh_id_col_names:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for col_name in hh_id_col_names: is easier to follow.

study_area=None,
input_household="households.csv",
input_person="persons.csv",
taz_crosswalk="taz_crosswalk.csv",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obvious to me what taz_crosswalk is. Is this from the original model to the rsm? Can we use a more logical name? It's also not in the in-line method documentation below.

logger.warning(f"missing curr_iter_access_df from {curr_iter_access}")
if prev_iter_access_df is None:
logger.warning(f"missing prev_iter_access_df from {prev_iter_access}")
# true for first iteraion
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why give a warning if this is true for the first iteration? Would it be better to add a first_iteration flag to the method to avoid giving a confusing warning?

taz_hh = input_household_df.groupby(["taz"]).size().rename("n_hh").to_frame()

if curr_iter_access_df is None or prev_iter_access_df is None:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Largely a stylistic preference, but I'd rather see these chunks as methods than long if-else blocks. If this, go see this method. If not, go see this method. It makes it easier (for me) to follow the broader logic of the method and then dive into the details of how the pieces are implemented.

curr_iter_access_df.index.isin(taz_hh.index)
].copy()

# compare accessibility columns
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, prefer this be a method to keep the flow easier to read.

axis=1
)

# TODO: potentially adjust this later after we figure out a better approach
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems as a good a time as any to discuss the approach we want to use at the MVP. Can you please confirm what we are doing now? My guess is below.

  1. Sum the errors in accessibility across the accessibility columns
  2. Normalize the errors within the range of the input sampling rate
  3. Insure that the outcome is within the sampling bounds

Summing the errors across the categories makes me uneasy, as they are on different scales. Can we instead normalize each one and take the max across the accessibility categories as the measure?

@AshishKuls AshishKuls closed this Jan 20, 2023
@AshishKuls AshishKuls deleted the sampler-assembler branch January 20, 2023 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PRODUCT 2B: Intelligent Assembler PRODUCT 2A: Intelligent Sampler
3 participants