-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The workflow
and package
paradigm
#2
Comments
is the general idea that most of our repos would fall into one of these categories? |
It is! Or at least, most repos that are directly or tangentially related to "data analysis" and "tools built on data analysis" As usual, we should not feel restricted to only ever base things/ try to fit things into these categories that don't fit, but rather utilize them as a common basis for styles of tasks that we find ourselves repeating |
Consider adding a |
What would be in the |
my thoughts on this were:
edit: |
At first thought, this seems like an anti-pattern, since any build steps should live close to the code they're building (in-repo, implemented by a CI/CD pipeline). The |
Tech Review and/ or Dev Hangout topic. I had a similar "I'm not sure about this" feeling initially, but CJ and I discussed it and I tend to keep wobbling back and forth on the topic to be honest. Would be good to have a focused discussion on it I think. |
I'd like to align on this soon because the two "workflows" I maintain, workflow.transition.monitor and workflow.data.preparation, are very different animals with very different goals. |
I would suggest that we follow a similar process to how we collectively brainstormed this document: #12 That is:
|
If we are happy with this approach, and we don't have a topic for this week's Dev Hangouts I would be happy to already lead it? I was waffling over if this should be a Dev Hangout thing or a Tech Review thing, but it feels more "dev hangouts" oriented since it's more process and not really a technical implementation decision/ doesn't really touch actual code still. |
Oh jeez how am I ever going to summarize that DevHangouts discussion XD |
Well here goes: First DraftPrefixes for RepositoriesThere seemed to be some agreement that using the Clarity of Purpose@cjyetman finds that the prefix system adds clarity to the purpose and conditions for creating repositories. It helps him more clearly understand and decide when it is a good time to split off a new repo. Implementation Across the TeamThe system of prefixes has started spreading across the team, with Jacob and others creating workflow repos. Especially as the team builds more products, it can be useful to have a prefix to quickly assess what the goals of the repository may be.
|
I think this unnecessarily implies that "running locally" somehow imposes resource constraints that don't otherwise exist, which seems to imply that committing to the ability to run locally is a unique burden, which I don't think is true. I remember a time when we didn't have VMs with adequate memory to run data.prep, which iirc eventually led to the dataprepbigmem VM for instance. |
Comment updated for clarity |
One important note about prefixes for R package repo... while it's (probably) technically possible to have a repo name that is different than the R package's release name, I think that's a complication we should try to avoid, which means choosing the repo name has consequences on what the final R package's name is... so we may need to carve out a prefix for R packages just so that we can maintain consistency in the R package names that get released to CRAN, R-universe, or whatever. |
I think this is mildly dicey... sometimes the input or parameters chosen themselves are methodological choices, e.g. a function that sets a parameterized threshold of when to include or exclude something. Not sure how to get around that. |
@cjyetman has brought up an important social point that, at times, he has sometimes felt forced to change something fundamental about one of his Personally, I @jdhoffa recognize and take responsibility for my role in that. I attribute it to two main things:
I think that defining repo maintainers and tech reviews can, to some extent, help the "symptoms" of the above, but solving the actual foundation of the problem requires not just processes, but improving the social atmosphere. In any case, I think that it's learned behavior that can be unlearned if we all put in an effort, and something we can try to codify in this repo. |
I think a simple way to phrase this is "R objects in, R objects out." There is space for packages that are meant to deal with a specific class/type of file (
is not what they should be worrying about. |
Thanks @AlexAxthelm, indeed that is a very concise description! I like it. |
in depth, relevant discussion was had here RMI-PACTA/workflow.transition.monitor#135 |
From @cjyetman in that thread: RMI-PACTA/workflow.transition.monitor#135 (comment)
I see this definition of an ephemeral Docker instance as missing the point of docker-based development. But I love seeing the distinction that a The thing I see missing in the Docker part of the comment is a distinction between images and containers. Images (the template, defined in the Dockerfile) are not supposed to be ephemeral, or designed to run on only one machine. Containers (images that are actually being executed) are ephemeral (and are expected to be pruned frequently). The biggest value of Docker-based development is a level of certainty that executing the same image on different machines (regardless of the host's environment), with access to the same explicit properties (volume mounts, Overall the 2 traits of Docker I see our team as getting value from are:
|
I'm using "instance" and "container" interchangeably, possibly incorrectly. So when I say ephemeral Docker instance, both "ephemeral" and "Docker" are being used as adjectives to describe "instance", or possibly in more appropriate language "container". |
got it. |
I have some further thinking on this, as per a call that I just had with @AlexAxthelm. None of it is ground-breaking, just continually refining our thinking here: In an ideal world pacta.* repositorieswould ALWAYS be R packages. These R packages would only include functions, that usually handle R-objects such as:
these packages would VERY RARELY (read: never unless there is an excellent reason to do so) include functions called for their side effects (including file I/O) etc.
workflow.* repositoriesThese would handle all the rest, including:
On dockerfiles in more detailI agree with @AlexAxthelm that there should be no distinction between a Relating to our |
Packages are used to keep track of modular code that we intend to re-use multiple times, and want to have unit-tested and portable.
A template PACTA R package repository can be found here: https://github.com/RMI-PACTA/pacta.r.package
Workflows are much less structured, and are used to keep track of any kind of data analysis or data science type pipelines.
A template PACTA workflow repository can be found here (with some useful guidance on how to conduct reproducible data science): https://github.com/RMI-PACTA/workflow.template.pacta
I made both of these template repos with little to NO input from anyone, and they include all of my bull-headed assumptions.
It would be cool to align and document our ideas of useful practices for either type of repo
The text was updated successfully, but these errors were encountered: