How to encourage drive-by commits? #361

vincentarelbundock · 2020-12-10T14:51:20Z

vincentarelbundock
Dec 10, 2020
Collaborator

Thanks for the super nice suite of packages. Very powerful and flexible, and the model coverage is simply outstanding! I also looove the lack of dependencies. Awesome work!

This morning, I tried to put together a PR to support lmtest::coeftest objects. Honestly, I found it quite hard to do. I spent several hours trying to understand the internals, and I'm still not done.

I'm now opening this Issue to start a brainstorm about ways to make drive-by commits easier. The intent is not to criticize the package design, but simply a recognition that developper burnout is endemic in open source software development, and that there is often a trade-off between optimizing for code re-use and optimizing for ease of contributions.

Here are some of the (mental) steps I went through:

OK, looks like I need to create a bunch of methods: ci.coeftest, p_value.coeftest, standard_error.coeftest, degrees_of_freedom.coef. Seems doable.
What's .data_frame? Why do I need to learn this layer of abstraction just to remove row.names?
In my .data_frame call, should I use .remove_backticks_from_string, .remove_backticks_from_parameter_names, .clean_parameter_names, or something else?
Should I use model_parameters.default? Dig in and ask: what is .model_parameters_generic? That goes one level deeper to .extract_parameters_generic. I don't understand what this function does, so I just use .model_parameters_generic in my model_parameters.coeftest method.
Arrgh! Doesn't work! The error message is uninformative: something to do with $something doesn't exist. Have to use traceback() to finally realize that .extract_parameters_generic calls insight::get_statistic.
Clone the insight package and define a new method get_statistic.coeftest

At this point, I have modified 5 files across two repositories, poured over traceback() call stacks, dug through 3-level abstraction layers of extraction functions, just to get model_parameters.coeftest working.

And I still haven't figured out how to get model_performance.coeftest to work. I'm not 100% sure, but I believe it won't be sufficient to make a PR to the performance repo, since I'd have to create a model_info.coeftest method.

In contrast, I copy a MWE broom-style extractor below. This minimal example also gets me glance, which I still haven't reached this morning.

I understand that there are benefits to a more complex design, because it gives you lots of stuff "for free". But I still think this example illustrates how hard it can be for newbies to contribute to a project where methods are split over so many repos and files, and that requires layers of extractor functions.

Again, this is not meant as a criticism of your design. The packages are awesome and super impressive!

My question is whether there are ways we could make it easier for newbies like me to help out.

library(lmtest)
library(generics)
mod <- coeftest(lm(hp ~ mpg, mtcars))

tidy.coeftest <- function(x, conf.int=FALSE, conf.level=.95, ...) {
  out <- data.frame(
    term = row.names(x),
    estimate = x[, 1],
    std.error = x[, 2],
    statistic = x[, 3],
    p.value = x[, 4])
  if (isTRUE(conf.int)) {
    tmp <- confint(x, level=.95)
    out$conf.low <- out[, 2]
    out$conf.high <- out[, 3]
  }
  row.names(out) <- NULL
  out
}

glance.coeftest <- function(x, ...) {
  out <- data.frame(
    nobs = nobs(x),
    logLik = logLik(x)
  )
  out
}

Answered by strengejacke

Dec 11, 2020

I think one disadvantage in our API design is that we 1) had not in mind how packages are evolving and how much functions / packages we now actually have; 2) thereby, we did not initially plan "classic" OOP style where it is possible for other developers to easily develop own methods, or in particular to contribute to our packages.

When we planned our packages, we had several ideas in mind and several things to consider:

the typical find/get pairs are located in insight, our most basic package. Thus, where it makes sense to have a find/get-method, we decided to put it in insight, like find_predictors() and get_predictors().
some functions in insight already relate to model parameters, like

View full answer

DominiqueMakowski · 2020-12-10T15:08:42Z

DominiqueMakowski
Dec 10, 2020
Maintainer

(Thanks a ton for opening this ~~issue~~ discussion 😁 we have been working on these packages for some time now and their complexity has extended definitely beyond our initial expectations, which in turned resulted in a somewhat more opaque internal organization for potential new contributors. So thanks for this opportunity for us to clarify, explain and maybe rethink some of our internal cuisine)

*passes the mic to @strengejacke and fades away*

1 reply

DominiqueMakowski Dec 11, 2020
Maintainer

@strengejacke maybe we could think of a vignette or a wikipage (in easystats/easystats?) with an example of the support for a new arbitrary class (e.g., "easyregression")?

model_regression <- function(data, formula, ...){
  m <- lm(formula, data=data, ...)
  class(m) <- c("easyregression", class(m))
  m
}

model <- model_regression(iris, Sepal.Length ~ Sepal.Width)
class(model)

In which we explain that there are essentially two core functions that need to work, parameters::parameters() and performance::performance (and eventually effectsize::effectsize()), and that for that they need to be a support in insight first. Then we detail all the methods that need to be added?

strengejacke · 2020-12-11T07:55:44Z

strengejacke
Dec 11, 2020
Maintainer

I think one disadvantage in our API design is that we 1) had not in mind how packages are evolving and how much functions / packages we now actually have; 2) thereby, we did not initially plan "classic" OOP style where it is possible for other developers to easily develop own methods, or in particular to contribute to our packages.

When we planned our packages, we had several ideas in mind and several things to consider:

the typical find/get pairs are located in insight, our most basic package. Thus, where it makes sense to have a find/get-method, we decided to put it in insight, like find_predictors() and get_predictors().
some functions in insight already relate to model parameters, like find_parameters() and get_parameters(), however, we decided to have it here because these are "essential" functions that are used by other easystats-packages a lot. "basic" functions should also go into insight, that was our plan.
as for parameters, we thought about what a user might need. sometimes just CI or SE, sometimes the "full" parameter table. Since both model_parameters() and ci() calculate CI, we thought about how to best implement this (i.e. model_parameters() extracts everything, ci() just calls model_parameters() and extracts the CI columns or vice versa - we decided the other way, i.e. single components are retrieved by ci(), standard_error() etc., and model_parameters() just calls all these functions).
mainly, our intention was to have insight as a package more for developers, while all other packages are dedicated to users. This, probably, lead to the situation you described above, i.e. it is quite complicated to understand where to add what when implementing support for new model classes...

However, there are also some advantages when going this way. For instance, if you want standardized estimates, or robust vcov-estimation for robust standard errors etc., you don't have to add much stuff to make it work. There is one central function for computing robust vcov's, and ci() mostly prepares the call to ci_wald() (so CIs are also based on robust vacov) etc.

Hence, once these functions are available:

In insight

get_varcov() (which works by default, when the model has a vcov() method)
get_statistic() (which works by default, when the model has a "classic" summary() output)
get_parameters() (which works by default, when the model has a "classic" summary() or coef() metod)
get_data() (which works by default, when the model has a model.frame() method, or a $call object to grep the environment)
n_obs() (which works by default, when the model has a "classic" nobs() metod)

in parameters

standard_error() (which works by default, when the model has a "classic" summary() output)
ci() (which also works by default when there is a standard_error() method)
p_value() (which works by default, when the model has a "classic" summary() output)
degrees_of_freedom() (which works by default, when the model has a df.residuals() method, else Inf (for z/unknown) or n-k is returned)

... all these "extras" like standardizing or robust SE work.

In total, yes, these are 9 functions to add. In theory, there could be none to add, if all models would have consistent methods / design. So somewhere in between is the effort of adding new models. However, if all these methods exist, functions like model_parameters(), simulate_parameters() or simulate_model(), effectsize::standardize() resp. effectsize::standardized_parameters(), robust SE or standardizing via model_parameters() etc. automatically works. You don't have to modify additional stuff to have these features for new types of models.

Hence, there are pros and cons regarding our approach, and it's difficult to find the best way for the API design, especially since we made some decisions in the past, where we could not foresee where our ecosystem is going...

1 reply

mattansb Dec 11, 2020
Maintainer

This is a really gear answer Daniel!
This is also true for effectsize, where, for the most part, if something is supported by parameters, it can be used with the default methods (e.g., standardize() standardized_parameters(), eta_squared(), ...).

strengejacke · 2020-12-11T08:00:24Z

strengejacke
Dec 11, 2020
Maintainer

model_performance.coeftest

model_performance() now has a default method, so it should always try to return as many indices as possible. However, coeftest object don't store much information, thus, there is not much to return in model_performance(), I think...

0 replies

vincentarelbundock · 2020-12-11T13:50:48Z

vincentarelbundock
Dec 11, 2020
Collaborator Author

Thanks for the useful and detailed answer. I understand that your design allows you to get a lot of free stuff. That's a testament to how well you have designed the functions built on top of parameters et al.

I think the main barrier to entry for me stemmed from a combination of (a) the decision to use separate methods instead of parameters(model)$standard_error, and (b) the decision to spread those methods across many different files and packages.

Given that most standard_error and p_value methods seem to call summary(model) under-the-hood (thereby building the whole table every time), there might be a performance penalty. But that's a minor concern.

The most difficult part, for me, was the decision to split all those methods across different files and packages. If all the *.fixest methods had been in a single file, I would have just copied it over to coeftest.R and made some tweaks to get support. So the issue is a combination of that initial (and likely irreversible) design decision and of repo organization (lots of work to change, I suppose).

In any case, I realize that this is just me raising minor critiques about a fantastically useful project that volunteers have sunk hundreds of hours into. Mostly, I'm just really appreciative of all the work you've put in this. Awesome stuff!

14 replies

DominiqueMakowski Dec 12, 2020
Maintainer

and you're right that this is actually how it's currently done in report, which is the latest (re)organized package, and it seems to work quite nicely. The problem is just that it's a hassle to reorganize 😅

strengejacke Dec 12, 2020
Maintainer

_{(I think @vincentarelbundock will do quite some work for us here)}

😬

vincentarelbundock Dec 12, 2020
Collaborator Author

Yeah, I'm willing to do a fair amount of work on this. The problem is my available free time is quite unpredictable; it all depends on teaching, research, family, etc. So I'm not 100% sure how far I can go in the very very short term. Might have to do it in a few batches.

What naming convention do you prefer? Is methods_* OK? (I don't have a strong feeling.)

mattansb Dec 12, 2020
Maintainer

I think for packages with multiple methods (insight, parameters, report, maybe performance?), a by-class/model file structure makes sense... But I don't think this makes much sense in effectsize or bayestestR, where most function for the most part (99%?) on those method-heavy packages. So there it should be by-method file structure.

strengejacke Dec 12, 2020
Maintainer

@vincentarelbundock I'd say methods_xyz.R is ok. You may file a PR from your branch, and this refactoring can indeed happen bit by bit.

vincentarelbundock · 2020-12-12T14:47:27Z

vincentarelbundock
Dec 12, 2020
Collaborator Author

Quick question: Does this need to be exported explicitly, or will the default just kick in automatically?

p_value.lm <- p_value.default

5 replies

vincentarelbundock Dec 12, 2020
Collaborator Author

Also, do you want me to combine se_ml1.R et al. with standard_error.R since the latter will be much smaller?

strengejacke Dec 12, 2020
Maintainer

It requires a #' @export if you meant that?

strengejacke Dec 12, 2020
Maintainer

The special se-function can for now remain as they are, since they only refer to a limited set of models. We still may refactor those methods later, if necessary or better.

vincentarelbundock Dec 12, 2020
Collaborator Author

No, I meant that if a method specific to model class XYZ is not defined, then parameters would automatically try the default. If that's the case, then we don't need to export or assign that explicitly.

strengejacke Dec 12, 2020
Maintainer

Ah, yes! No extra methods, actually you just need to copy/paste the pieces, and probably not remove/add anything, so no "code breaking" changes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to encourage drive-by commits? #361

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 21 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

How to encourage drive-by commits? #361

vincentarelbundock Dec 10, 2020 Collaborator

Replies: 5 comments · 21 replies

DominiqueMakowski Dec 10, 2020 Maintainer

DominiqueMakowski Dec 11, 2020 Maintainer

strengejacke Dec 11, 2020 Maintainer

mattansb Dec 11, 2020 Maintainer

strengejacke Dec 11, 2020 Maintainer

vincentarelbundock Dec 11, 2020 Collaborator Author

DominiqueMakowski Dec 12, 2020 Maintainer

strengejacke Dec 12, 2020 Maintainer

vincentarelbundock Dec 12, 2020 Collaborator Author

mattansb Dec 12, 2020 Maintainer

strengejacke Dec 12, 2020 Maintainer

vincentarelbundock Dec 12, 2020 Collaborator Author

vincentarelbundock Dec 12, 2020 Collaborator Author

strengejacke Dec 12, 2020 Maintainer

strengejacke Dec 12, 2020 Maintainer

vincentarelbundock Dec 12, 2020 Collaborator Author

strengejacke Dec 12, 2020 Maintainer

vincentarelbundock
Dec 10, 2020
Collaborator

Replies: 5 comments 21 replies

DominiqueMakowski
Dec 10, 2020
Maintainer

DominiqueMakowski Dec 11, 2020
Maintainer

strengejacke
Dec 11, 2020
Maintainer

mattansb Dec 11, 2020
Maintainer

strengejacke
Dec 11, 2020
Maintainer

vincentarelbundock
Dec 11, 2020
Collaborator Author

DominiqueMakowski Dec 12, 2020
Maintainer

strengejacke Dec 12, 2020
Maintainer

vincentarelbundock Dec 12, 2020
Collaborator Author

mattansb Dec 12, 2020
Maintainer

strengejacke Dec 12, 2020
Maintainer

vincentarelbundock
Dec 12, 2020
Collaborator Author

vincentarelbundock Dec 12, 2020
Collaborator Author

strengejacke Dec 12, 2020
Maintainer

strengejacke Dec 12, 2020
Maintainer

vincentarelbundock Dec 12, 2020
Collaborator Author

strengejacke Dec 12, 2020
Maintainer