Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The workflow and package paradigm #2

Open
jdhoffa opened this issue Jul 26, 2023 · 24 comments
Open

The workflow and package paradigm #2

jdhoffa opened this issue Jul 26, 2023 · 24 comments
Assignees
Labels

Comments

@jdhoffa
Copy link
Collaborator

jdhoffa commented Jul 26, 2023

Packages are used to keep track of modular code that we intend to re-use multiple times, and want to have unit-tested and portable.

A template PACTA R package repository can be found here: https://github.com/RMI-PACTA/pacta.r.package

Workflows are much less structured, and are used to keep track of any kind of data analysis or data science type pipelines.

A template PACTA workflow repository can be found here (with some useful guidance on how to conduct reproducible data science): https://github.com/RMI-PACTA/workflow.template.pacta

I made both of these template repos with little to NO input from anyone, and they include all of my bull-headed assumptions.

It would be cool to align and document our ideas of useful practices for either type of repo

@AlexAxthelm
Copy link

is the general idea that most of our repos would fall into one of these categories?

@jdhoffa
Copy link
Collaborator Author

jdhoffa commented Jul 27, 2023

It is! Or at least, most repos that are directly or tangentially related to "data analysis" and "tools built on data analysis"

As usual, we should not feel restricted to only ever base things/ try to fit things into these categories that don't fit, but rather utilize them as a common basis for styles of tasks that we find ourselves repeating

@jdhoffa jdhoffa transferred this issue from RMI-PACTA/practices.DEPRECATED Jan 24, 2024
@jdhoffa
Copy link
Collaborator Author

jdhoffa commented Jan 25, 2024

Consider adding a build.* type repository. FYI. @cjyetman

@jdhoffa jdhoffa added the naming label Jan 25, 2024
@AlexAxthelm
Copy link

What would be in the build.* repos?

@cjyetman
Copy link
Collaborator

cjyetman commented Jan 29, 2024

my thoughts on this were:

pacta.* - primary output is an R package
workflow.* - primary output is files, plots, data, etc. (irrelevant if an ephemeral Docker container is used to generate that output)
build.* - primary output is a Docker image (obviously, that Docker image will be used somewhere else at sometime by someone probably to output some other files, but that's not the primary goal of the repo)

edit: build.* could probably also contain anything that builds a self-contained app or software

@AlexAxthelm
Copy link

AlexAxthelm commented Jan 29, 2024

At first thought, this seems like an anti-pattern, since any build steps should live close to the code they're building (in-repo, implemented by a CI/CD pipeline). The workflow.* repos seem to be the right place to collect multiple dependent repos into a "thing that does something" which would be built in-repo (https://github.com/RMI-PACTA/workflow.factset being an example of that), but I may be missing a use case.

@jdhoffa
Copy link
Collaborator Author

jdhoffa commented Jan 29, 2024

Tech Review and/ or Dev Hangout topic.

I had a similar "I'm not sure about this" feeling initially, but CJ and I discussed it and I tend to keep wobbling back and forth on the topic to be honest. Would be good to have a focused discussion on it I think.

@cjyetman
Copy link
Collaborator

I'd like to align on this soon because the two "workflows" I maintain, workflow.transition.monitor and workflow.data.preparation, are very different animals with very different goals.

@jdhoffa
Copy link
Collaborator Author

jdhoffa commented Jan 31, 2024

I would suggest that we follow a similar process to how we collectively brainstormed this document: #12

That is:

  • Collective brainstorming in a Dev Hangout. Discussion of all the nitty gritty details (ideally I will record a transcript of the meeting)
  • I will summarize the transcript of that conversation into a .md document and submit it as a PR into this repo, requesting all devs to review
  • Iterate until we are happy and merge

@jdhoffa jdhoffa self-assigned this Jan 31, 2024
@jdhoffa
Copy link
Collaborator Author

jdhoffa commented Jan 31, 2024

If we are happy with this approach, and we don't have a topic for this week's Dev Hangouts I would be happy to already lead it?

I was waffling over if this should be a Dev Hangout thing or a Tech Review thing, but it feels more "dev hangouts" oriented since it's more process and not really a technical implementation decision/ doesn't really touch actual code still.

@jdhoffa
Copy link
Collaborator Author

jdhoffa commented Feb 1, 2024

Oh jeez how am I ever going to summarize that DevHangouts discussion XD

@jdhoffa
Copy link
Collaborator Author

jdhoffa commented Feb 1, 2024

Well here goes:

First Draft

Prefixes for Repositories

There seemed to be some agreement that using the pacta.* prefix for R packages that generally export functions or data-frames and have final outputs that are generally data-frames and/ or plots was useful. Likely pacta.* is not the BEST prefix, perhaps something like rpkg or rmipkg is more useful. Actual prefix TBD. However, there was much less agreement on what repositories with the workflow.* define. Broadly, it seems that they should deal more with file reading, I/O, and processes, calling package functions with various parameters. The question of if they should/ shouldn't include a Dockerfile, a build/ directory, or otherwise Build steps (or if this should be abstracted) was hotly debated and not agreed on.

Clarity of Purpose

@cjyetman finds that the prefix system adds clarity to the purpose and conditions for creating repositories. It helps him more clearly understand and decide when it is a good time to split off a new repo.

Implementation Across the Team

The system of prefixes has started spreading across the team, with Jacob and others creating workflow repos. Especially as the team builds more products, it can be useful to have a prefix to quickly assess what the goals of the repository may be.

build.* Repos

@cjyetman expressed a desire for a new prefix, build, to differentiate between repos that output a collection of files and those that primarily create Docker images.

Prefix Definitions and Implications

There's debate on what the prefix implies, like file restrictions, language use, and what can or cannot be included in a repo. No total agreement here yet.

Maintainership and Scope

@cjyetman (and I imagine also @MonikaFu and @jacobvjk ) are more comfortable working on R code and maintaining R packages without needing to focus on Docker or other environments. He prefers to develop in RStudio, which has implications around the repositories he maintains. The primary implication is that, they MUST preserve an RStudio-centric API (does not necessarily preclude building an additional API on top, e.g. through a Dockerfile, but crucially the RStudio API must be supported.

Constraints of Docker/VM Implementation

There are challenges in maintaining workflows compatible with local environments and Docker/VM environments. The main challenge is that the people working on the repo must commit to doing it! This relates to #12 , but I think it is clear that the key developer responsible for what API is supported is the maintainer.

Collaboration and Decision-Making

The team discussed the need for collaboration when making changes to ensure they don't disrupt other users' workflows or the integrity of the process, especially in the context of supported primary ways of engaging with a repo (be it through RStudio, Docker/ Azure, or either/ both). @jdhoffa comments here: I think the main API of the repo should be decided, wherever possible, by the user or use-case and not the developer. Our goal is to build useful products, not necessarily to build products that we like developing (but whenever possible, we should try to achieve both).

Methodological Decisions in Workflows

Workflow repos may contain significant methodological decisions that impact how data is processed or handled. However this isn't ideal. R packages contain useful documentation systems built-in, which makes it an ideal solution for transparently documenting methodology. Wherever possible, key decisions should be extracted FROM a workflow.* repository TO a pacta.* repository (note: final prefix names still TBD).

Operational Reality

The discussion recognizes the operational reality that some workflows cannot be feasibly run locally due to resource constraints and must be run on virtual machines or Azure, due to the local resources available to most developers (AKA @cjyetman has a lot of RAM).

Desired Structure for Workflow Repos

Ideally, workflow repos should handle inputs and outputs and orchestrate steps, calling on R package functions rather than embedding methodological decision-making.

Role of Docker Images

A debated topic

  • @cjyetman sees having a separated build.* repository and process as beneficial, since some repositories that is the primary end goal/ output of the repository (e.g. workflow.transition.monitor)
  • @AlexAxthelm and @jdhoffa see Docker images as just a distribution mechanism of a workflow repo, and not separate at all. In that sense, they are more an indication of the lifecycle of the repository (e.g. that it is closer to stable than experimental) rather than a separate repo entirely.

@RMI-PACTA/developers no urgency to this at all, but please review this comment when you get a chance, see if I've misrepresented anything, and comment as you see fit.

@cjyetman
Copy link
Collaborator

cjyetman commented Feb 1, 2024

The discussion recognizes the operational reality that some workflows cannot be feasibly run locally due to resource constraints and must be run on virtual machines or Azure (for some developers)

I think this unnecessarily implies that "running locally" somehow imposes resource constraints that don't otherwise exist, which seems to imply that committing to the ability to run locally is a unique burden, which I don't think is true. I remember a time when we didn't have VMs with adequate memory to run data.prep, which iirc eventually led to the dataprepbigmem VM for instance.

@jdhoffa
Copy link
Collaborator Author

jdhoffa commented Feb 1, 2024

The discussion recognizes the operational reality that some workflows cannot be feasibly run locally due to resource constraints and must be run on virtual machines or Azure (for some developers)

I think this unnecessarily implies that "running locally" somehow imposes resource constraints that don't otherwise exist, which seems to imply that committing to the ability to run locally is a unique burden, which I don't think is true. I remember a time when we didn't have VMs with adequate memory to run data.prep, which iirc eventually led to the dataprepbigmem VM for instance.

Comment updated for clarity

@cjyetman
Copy link
Collaborator

cjyetman commented Feb 1, 2024

One important note about prefixes for R package repo... while it's (probably) technically possible to have a repo name that is different than the R package's release name, I think that's a complication we should try to avoid, which means choosing the repo name has consequences on what the final R package's name is... so we may need to carve out a prefix for R packages just so that we can maintain consistency in the R package names that get released to CRAN, R-universe, or whatever.

@cjyetman
Copy link
Collaborator

cjyetman commented Feb 1, 2024

Ideally, workflow repos should handle inputs and outputs and orchestrate steps, calling on R package functions rather than embedding methodological decision-making.

I think this is mildly dicey... sometimes the input or parameters chosen themselves are methodological choices, e.g. a function that sets a parameterized threshold of when to include or exclude something. Not sure how to get around that.

@jdhoffa
Copy link
Collaborator Author

jdhoffa commented Feb 1, 2024

@cjyetman has brought up an important social point that, at times, he has sometimes felt forced to change something fundamental about one of his workflow.* repos because someone (likely me) has had strong opinions about it, especially regarding the build process. Often those changes came from strong opinions but weren't actually "required" or "urgent". This makes leniency a little ambiguous because it can feel necessary to develop differently or abandon an API that people feel more comfortable with.

Personally, I @jdhoffa recognize and take responsibility for my role in that. I attribute it to two main things:

  • On one hand, I think that's maybe a negative social implication of working entirely remotely, with little actual interaction. We end up working in bubbles and convince ourselves that we have the best and only opinion on something and start to push that it's how everything needs to be done.
  • I also think that may have partially been a learned behavior from years of "waiting" for things to change in the PACTA code but it not happening. So just went too far in the other direction to push too hard for things to change.

I think that defining repo maintainers and tech reviews can, to some extent, help the "symptoms" of the above, but solving the actual foundation of the problem requires not just processes, but improving the social atmosphere. In any case, I think that it's learned behavior that can be unlearned if we all put in an effort, and something we can try to codify in this repo.

Relates to:
#4 and #5

@AlexAxthelm
Copy link

pacta.* prefix for R packages that generally export functions or data-frames and have final outputs that are generally data-frames and/ or plots was useful

I think a simple way to phrase this is "R objects in, R objects out." There is space for packages that are meant to deal with a specific class/type of file (pacta.portfolio.import comes to mind), but generally thinking that this:

Broadly, it seems that [workflow.*] should deal more with file reading, I/O

is not what they should be worrying about.

@jdhoffa
Copy link
Collaborator Author

jdhoffa commented Feb 1, 2024

Thanks @AlexAxthelm, indeed that is a very concise description! I like it.

@cjyetman
Copy link
Collaborator

cjyetman commented Feb 7, 2024

in depth, relevant discussion was had here RMI-PACTA/workflow.transition.monitor#135

@AlexAxthelm
Copy link

From @cjyetman in that thread: RMI-PACTA/workflow.transition.monitor#135 (comment)

An additional characteristic that is becoming more clear to me now is...

A "workflow" repo may create an ephemeral Docker instance in order to complete a task.

A "build" repo should create a Docker image that's intended to be used in the future, and possibly in different contexts and/or with different parameters.

An ephemeral Docker instance has less security concerns because it's only ever intended to run once on the machine it was built. A "workflow" repo is for a user that maybe wants to experiment with parameters, but it essentially defines a specific "workflow" for doing a specific task, and therefore it could hypothetically be used in a "build" repo which either defines the parameters (while the "workflow" defines the process) or defines an interface to specify the parameters.

I see this definition of an ephemeral Docker instance as missing the point of docker-based development. But I love seeing the distinction that a workflow can be run locally for development/experimentation.

The thing I see missing in the Docker part of the comment is a distinction between images and containers. Images (the template, defined in the Dockerfile) are not supposed to be ephemeral, or designed to run on only one machine. Containers (images that are actually being executed) are ephemeral (and are expected to be pruned frequently). The biggest value of Docker-based development is a level of certainty that executing the same image on different machines (regardless of the host's environment), with access to the same explicit properties (volume mounts, .env files, etc) results in the same results, so if we're designing images to be run on a single machine, then there's no advantage to encapsulating it, and we could just run on bare metal.


Overall the 2 traits of Docker I see our team as getting value from are:

  1. Distribution: being able to define an execution environment for PACTA, and then run that on "not our machines" (CTM, Azure runners, etc.). Containers can run and produce the same results.
  2. "Freezing our code in amber 🦟 ": Each Docker image we produce encapsulates an entire environment (ignoring access to data) that allows us to reconstruct (and diagnose) a given analysis in the future

@cjyetman
Copy link
Collaborator

cjyetman commented Feb 8, 2024

I'm using "instance" and "container" interchangeably, possibly incorrectly.

So when I say ephemeral Docker instance, both "ephemeral" and "Docker" are being used as adjectives to describe "instance", or possibly in more appropriate language "container".

@jdhoffa
Copy link
Collaborator Author

jdhoffa commented Feb 8, 2024

got it.

@jdhoffa
Copy link
Collaborator Author

jdhoffa commented Feb 8, 2024

I have some further thinking on this, as per a call that I just had with @AlexAxthelm. None of it is ground-breaking, just continually refining our thinking here:
(and all of the below takes under the assumption that pacta.* and workflow.* may not be the optimal naming convention here, but I will use it since that is what we currently have:

In an ideal world

pacta.* repositories

would ALWAYS be R packages. These R packages would only include functions, that usually handle R-objects such as:

  • basic data types (numeric, integer, boolean, etc.)
  • vectors
  • lists
  • data-frames
  • ggplot objects
    and things like this

these packages would VERY RARELY (read: never unless there is an excellent reason to do so) include functions called for their side effects (including file I/O) etc.

  • The functions therein would include arguments with sensible defaults and good documentation of the different options there.
  • Crucially: A Dockerfile would NOT be needed or important here.

workflow.* repositories

These would handle all the rest, including:

  • config files (which serve primarily to feed into the parameters of the above-defined R functions), their documentation, and values
  • file I/O
  • Dockerfiles (if necessary) to ensure stability and reproducibility at run-time, especially when run-time involves serving multiple use-cases (and especially when the main output of those use-cases are NOT typical R- (interactive reports, tr
  • Potentially other types of code, including not-R-code

On dockerfiles in more detail

I agree with @AlexAxthelm that there should be no distinction between a build and workflow repository. We should rather strive that ALL of the Docker image that is built (by whatever workflow. needs to build it). The goal should always be that the image is re-usable (e.g. should probably only really change if one of the pacta.* packages is updated) and that the inputs (functional arguments, file I/O, etc.) are defined by a config file and volume mt on run-time.

Relating to our maintainer discussion, the handling of HOW the config, file I/O, etc. is served should be the responsibility of the maintainer of that repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants