Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PopulationSim Input Validation #187

Open
dhensle opened this issue Oct 25, 2024 · 4 comments
Open

PopulationSim Input Validation #187

dhensle opened this issue Oct 25, 2024 · 4 comments
Labels

Comments

@dhensle
Copy link

dhensle commented Oct 25, 2024

Issue Background
Several usage-related questions have come up where inconsistent results are generated due to improper input data or configuration. This issue is particularly acute in multiprocessing cases where results vary when run in multiprocessing versus single thread. Often this is due to a tiny mistake in the configuration which can be especially difficult to trace when no error is raised, and bad results are produced. A relatively simple mistake occurs when a sub-geography overlaps with two super-geographies. This may not cause an error in single thread, even though the results are inaccurate. The problem can be addressed via an assert statement when the data is loading into PopulationSim. After parallel processing completes, tables produced from slicing are not automatically coalesced; the software must be told what tables to coalesce in the settings file, which is error prone. It may be possible to perform this step automatically in the software without requiring user-specified settings. Failing that, the software could at least check for consistency in terms of the tables that are coalesced.

Solution
There are a number of other data checks that users should be aware of but would not be possible to check automatically given the flexibility of the software. These include checks to make sure that household and person controls are consistent across geographies, that data does not include missing values, and that controls are consistent with the incidence table specified for the seed data. This is very much inline with the input checker developed for ActivitySim. Additionally, an FAQ on the PopulationSim wiki with a list of common data and control file checks for new users should be developed.

@dhensle dhensle changed the title Input Validation PopulationSim Input Validation Oct 25, 2024
@DavidOry
Copy link

I would suggest that using a data model to first describe the inputs and then using that data model to validate the inputs is a better way forward than simply checking the inputs. Checking inputs strikes me as fairly low value in the absence of defining the inputs. And defining the inputs in software is far better than defining inputs in a document, like a Wiki.

@bettinardi
Copy link
Collaborator

This sounds a little bit like semantics - data model vs input validation. I think the key here is that PopSim assumes a high degree of input consistency and it would be helpful to have some process to enforce that upfront prior to getting into cryptic code error messages.

@DavidOry
Copy link

DavidOry commented Oct 30, 2024

This sounds a little bit like semantics - data model vs input validation. I think the key here is that PopSim assumes a high degree of input consistency and it would be helpful to have some process to enforce that upfront prior to getting into cryptic code error messages.

In my experience, PopulationSim is challenging for new users for at least the following reasons:

  1. Variables used in the seed file are not well defined and those that come from the PUMS often have cryptic names, e.g., HHT. This is particularly challenging for international users not familiar with the PUMS.
  2. Most agencies use a script outside of PopulationSim to convert the PUMS seed files into the PopulationSim input seed files, which makes it hard for other agencies to follow the examples (e.g., how exactly is employed calculated?)
  3. The business rules for the geographies are not intuitive.
  4. The software generates a lot of unhelpful output and not much helpful output when errors are encountered, as @bettinardi notes.
  5. The output from PopulationSim cannot be fed directly into ActivitySim (e.g., where is person_type calculated?).

A data model can help by doing the following (I've highlighted in bold italic text the functions of the input checker):

  • Acknowledging, in code, the existence of seed source files, PUMS data for US applications, with a known schema (i.e., variable names and definitions).
  • Define the schema for the input seed files and acknowledge, in code, that these are derivatives of the seed source files, with known and documented variable transformations.
  • Define the schema and the business rules of the geographies input file, with the business rules being the number of geographies and the one-to-many parent-to-child relationships.
  • Define the schema and the business rules of the controls input file, with the business rules being the relationship of the controls to the input seed files and the geographies.
  • Define the schema of the output person and household files.
  • Acknowledge, in code, the existence of separate ActivitySim input person and household files, and acknowledge, again in code, that these are derived from the PopulationSim output files with known and documented variable transformations.

A data model approach is therefore different both philosophically and practically than an input checking approach. The former attempts to describe the system; the latter describes only the business rules of the system. In my view, it's important to put the business rules in the context of the system description. As a user, I would certainly appreciate being told at the beginning that I've violated the software's business rules, but I would also be irritated that there's no description of how the system works.

Because PopulationSim's design seeks to accommodate maximum flexibility, i.e., the seed and controls can contain any variables, the data model must be specific to the examples. New users could then read these examples to understand how the system works and make their own versions (though in most ABM use cases the version distributed with the code would suffice). The business rules could live in utility methods used by the data models or, given PopulationSim's modest runtime, done via better error handling (perhaps as a part of a broader effort to improve the software's communication back to the user).

@bettinardi
Copy link
Collaborator

@DavidOry makes an undeniable well thought out argument/proposal (as is his norm). When I said this is semantics, I meant around the goal of validating inputs. I think David's thoughts expand the scope a little bit (beyond just validating inputs), but likely in the correct way. If the scope can allow for the expansion, I fully support what David is laying out - that a better way (more efficient over the life of the project) to achieve this is through a data model example. I support closing this issue and reopening a PopSim data model issue - with the foundation of approach and benefits captured in David's notes.

@jpn-- jpn-- moved this to Punt in Phase 10A Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Punt
Development

No branches or pull requests

3 participants