Intelligent Sampler and Assembler #17

jpn-- · 2022-10-11T21:26:21Z

This is a cleaner PR to make review simpler.

Closes #16

DavidOry · 2022-10-28T17:22:23Z

Add pytest infrastructure for unit testing
Convert the notebooks to tests to ensure that the resources repository and docker environment remain up to date

DavidOry

@jpn--
Feedback from me for your consideration (fyi @kulshresthaa, @JoeJimFlood)

DavidOry · 2022-10-28T17:25:09Z

sandag_rsm/assembler.py

+    ----------
+    orig_indiv : path-like
+        Trips table from "original" model run, should be comprehensive simulation
+        of all individual trips for all synthetic households.


The method is requesting CSV files with certain fields of certain data types. Is it worth including these references in the method docs?

DavidOry · 2022-10-28T17:26:32Z

sandag_rsm/assembler.py

+    jnt_trips_rsm = pd.read_csv(rsm_joint)
+
+    # convert to rsm trips
+    logger.info("convert to common table platform")


An assert here checking for the existence of the expected column headers would be useful.

DavidOry · 2022-10-28T17:27:19Z

sandag_rsm/assembler.py

+    # convert to rsm trips
+    logger.info("convert to common table platform")
+    rsm_trips = _merge_joint_and_indiv_trips(ind_trips_rsm, jnt_trips_rsm)
+    original_trips = _merge_joint_and_indiv_trips(ind_trips_full, jnt_trips_full)


I have a personal preference for using suffixes to identify datatypes, e.g., rsm_trips_df. This makes it easier for me to read the code and follow what is happening. If you agree, can we modify?

DavidOry · 2022-10-28T17:29:14Z

sandag_rsm/assembler.py

+        _agg_by_hhid_and_tripmode(original_trips_that_were_resimulated, "n_trips_orig"),
+        _agg_by_hhid_and_tripmode(rsm_trips, "n_trips_rsm"),
+        on=["hh_id", "trip_mode"],
+        how="outer",


Why is this an outer join? Are we anticipating that we have RSM trips that are not in the original model?

DavidOry · 2022-10-28T17:31:04Z

sandag_rsm/assembler.py

+    # aggregating by Home zone
+    hh_rsm = pd.read_csv(households)
+    hh_id_col_names = ["hhid", "hh_id", "household_id"]
+    for hhid in hh_id_col_names:


for col_name in hh_id_col_names: is easier to follow.

DavidOry · 2022-10-28T17:42:50Z

sandag_rsm/sampler.py

+    study_area=None,
+    input_household="households.csv",
+    input_person="persons.csv",
+    taz_crosswalk="taz_crosswalk.csv",


It's not obvious to me what taz_crosswalk is. Is this from the original model to the rsm? Can we use a more logical name? It's also not in the in-line method documentation below.

DavidOry · 2022-10-28T17:44:53Z

sandag_rsm/sampler.py

+            logger.warning(f"missing curr_iter_access_df from {curr_iter_access}")
+        if prev_iter_access_df is None:
+            logger.warning(f"missing prev_iter_access_df from {prev_iter_access}")
+        # true for first iteraion


Why give a warning if this is true for the first iteration? Would it be better to add a first_iteration flag to the method to avoid giving a confusing warning?

DavidOry · 2022-10-28T17:50:21Z

sandag_rsm/sampler.py

+    taz_hh = input_household_df.groupby(["taz"]).size().rename("n_hh").to_frame()
+
+    if curr_iter_access_df is None or prev_iter_access_df is None:
+


Largely a stylistic preference, but I'd rather see these chunks as methods than long if-else blocks. If this, go see this method. If not, go see this method. It makes it easier (for me) to follow the broader logic of the method and then dive into the details of how the pieces are implemented.

DavidOry · 2022-10-28T17:52:14Z

sandag_rsm/sampler.py

+            curr_iter_access_df.index.isin(taz_hh.index)
+        ].copy()
+
+        # compare accessibility columns


Again, prefer this be a method to keep the flow easier to read.

DavidOry · 2022-10-28T17:57:19Z

sandag_rsm/sampler.py

+            axis=1
+        )
+
+        # TODO: potentially adjust this later after we figure out a better approach


This seems as a good a time as any to discuss the approach we want to use at the MVP. Can you please confirm what we are doing now? My guess is below.

Sum the errors in accessibility across the accessibility columns

Normalize the errors within the range of the input sampling rate

Insure that the outcome is within the sampling bounds

Summing the errors across the categories makes me uneasy, as they are on different scales. Can we instead normalize each one and take the max across the accessibility categories as the measure?

jpn-- added 8 commits October 11, 2022 15:45

sampler and assembler

01943e6

demo notebooks

0fc939e

sampler cleanup

8b2e7ee

clean up assembler

4d37802

add load data function

23202e5

test data loader

5fc4822

Merge branch 'data-loader' into sampler-assembler

4e52ea5

update notebooks to use get_test_file

b58f2ef

DavidOry reviewed Oct 28, 2022

View reviewed changes

This was linked to issues Nov 9, 2022

PRODUCT 2A: Intelligent Sampler #13

Open

PRODUCT 2B: Intelligent Assembler #14

Open

AshishKuls closed this Jan 20, 2023

AshishKuls deleted the sampler-assembler branch January 20, 2023 18:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intelligent Sampler and Assembler #17

Intelligent Sampler and Assembler #17

jpn-- commented Oct 11, 2022

DavidOry commented Oct 28, 2022

DavidOry left a comment

DavidOry Oct 28, 2022

DavidOry Oct 28, 2022

DavidOry Oct 28, 2022

DavidOry Oct 28, 2022

DavidOry Oct 28, 2022

DavidOry Oct 28, 2022

DavidOry Oct 28, 2022

DavidOry Oct 28, 2022

DavidOry Oct 28, 2022

DavidOry Oct 28, 2022

		taz_hh = input_household_df.groupby(["taz"]).size().rename("n_hh").to_frame()

		if curr_iter_access_df is None or prev_iter_access_df is None:

Intelligent Sampler and Assembler #17

Intelligent Sampler and Assembler #17

Conversation

jpn-- commented Oct 11, 2022

DavidOry commented Oct 28, 2022

DavidOry left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment