11. Do we need to match SAS numerically when using a different language? #11

MichaelRimler · 2023-08-24T17:54:11Z

MichaelRimler
Aug 24, 2023
Maintainer

Is 100% match needed to establish trust - what is the 'truth'?
How is this manifested for Conventional analyses (continuous and categorical summaries)?
How is this manifested for Statistical inference and modeling?

MichaelRimler · 2023-09-14T10:34:03Z

MichaelRimler
Sep 14, 2023
Maintainer Author

No, in my opinion, we do not need to match one implementation with another to establish that a 'newly used' implementation is correct relative to an established and trusted implementation. Instead, we need to first understand how those implementations differ - the underlying decisions by the developer that implemented the algorithm - to see if any differences are indeed expected.

SAS vs R default rounding is a fundamental example (https://psiaims.github.io/CAMIS/Comp/r-sas_rounding.html). Both implementations are correct, relative to their documentation. But each implements a different rounding rule. [Did you know there were different ways to round??? I've been living a lie since the 4th grade!!]

PHUSE/PSI supports the CAMIS working group project: https://psiaims.github.io/CAMIS/
"The goal of this project is to demystify conflict when doing QC and to help ease the transitions to new languages and techniques with comparison and comprehensive explanations."
"Subtle differences exist between the fundamental approaches implemented by each language, yielding differences in results which are each correct in their own right. "

SAS vs R rounding:
https://psiaims.github.io/CAMIS/Comp/r-sas_rounding.html

SAS vs R Summary Stats:
https://psiaims.github.io/CAMIS/Comp/r-sas-summary-stats.html

SAS vs R Survival:
https://psiaims.github.io/CAMIS/Comp/r-sas_survival.html

There are other examples - and contributions are welcomed.

1 reply

DrLynTaylor Feb 1, 2024
Collaborator

The key thing is understanding why analysis performed in different languages doesn't give the same results. If doing different methods/different default options for example, then that's explainable and the most appropriate method/option for the data you are analyzing can be selected as the "Most appropriate" analysis results. However, if you can't explain the difference then we have a problem. This is what CAMIS project https://psiaims.github.io/CAMIS/ is trying to resolve, documenting in an open source repo, the reasons for differences observed between statistical methods performed in different languages. From a practical perspective, if you think you are doing exactly the same analysis in 2 languages, then a difference in results causes a big problem (even if the difference is small and not clinically important to the interpretation of the data)! Within company SOPs for programming and QC, generally insist on a 100% match between production & QC so any level of disagreement would not be allowed. A similar problem occurs, when there is a CRO working with a sponsor, and they can't match results exactly, generally both teams spend many hours trying to resolve whether it's due to selection of data points, manipulation of the data, analysis methods options or due to an actual difference between the software implementation of the methods. This can waste many hours and be very frustrating, also sometimes affecting the confidence of the sponsor in the CRO's quality (as often the sponsor believes they are correct so the CRO must be wrong!). Hence, I would argue whilst we may be open to software using different statistical methodology (and not having all options available to us), if a software claims it is doing a certain method, then it should obtain exactly the same results as another software. Any mismatch needs to be explainable.

rossfarrugia · 2023-10-10T12:48:50Z

rossfarrugia
Oct 10, 2023

This is a nice blog on rounding and i gave my general thoughts on this topic in the conclusion: https://pharmaverse.github.io/blog/posts/2023-07-24_rounding/rounding.html. I think when doing hybrid working and comparing 2 different languages as part of our QC strategy we should really change our mindset away from trying to get 100% perfect match, to instead looking at if differences can be explained and accepted.

4 replies

andyofsmeg Jan 3, 2024

Agreed on the change of mindset from 100% match.

DrLynTaylor Feb 6, 2024
Collaborator

If we follow the best practice of all derivations being 100% QCd in ADaM datasets (including valid record flags for each type of analysis), and we check the same input data is used (correct use of the analysis flags), then a mis-match in analysis results, can only be due to 1) the way the software is implementing the analysis method, 2) the model options you are using (default or specified not per SAP), or 3) rounding/post analysis data manipulation/formatting. Hopefully if we can prove it's not due to 2 or 3, then teams would accept not getting 100% match. However, we would have to fully investigate the mis-match and this always takes substantial time. Hopefully only the more complex analysis would result in us having to justify not getting a 100% agreement. I can't help but feel if we can get a 100% match between software, this would save time & be preferred from a practical perspective? Thoughts?

rossfarrugia Feb 7, 2024

If it were a complex method and 100% exact match of first line and QC is your goal then I'd suggest simplest approach would be to do first line and QC in the same language. But via CAMIS I'd hope we have enough material to know when to expect a non-100% match and how to then explain this with a note in the QC evidence for that program. Agree it's always simplest to open up a clean comparison report, but maybe questionable whether it's worth the extra effort to work around the defaults of one language to get it to work like another. A final (and possibly contentious) point would be that if we QC our ADaMs as you say and we make our TLGs "1 proc away" (in SAS terms) and then we use trusted open source packages for these outputs with many other pharma using the exact same codebases thus 1,000s of users across the world increasingly testing their reliability.... then do we really need to do as much double programming QC for TLGs in the new world?! You'd imagine code review at most could do the job besides maybe key endpoint tables.

DrLynTaylor Feb 16, 2024
Collaborator

It's a great point. For simple summaries, suppose we are using standard macros that tabulate and produce summary stats of the data. Then perhaps a code review would be enough? (Until something goes wrong, then we'd bring back double programming -ha ha!)

andyofsmeg · 2024-01-03T17:59:07Z

andyofsmeg
Jan 3, 2024

No. The main impact of using multiple languages will be that statisticians need to think more carefull about the wording in protocols and SAPs. Today, much of the language used biases towards SAS (eg options are not mentioned when SAS defaults are not changed). What needs to happen is that everything needs to be more explicitly detailed. This could include the rounding method used in summary displays.

1 reply

DrLynTaylor Feb 1, 2024
Collaborator

I agree Andy, we need to be more precise in our SAP language to allow any analysis to be replicable in any language!

MikeKSmith-Pfizer · 2024-03-26T10:04:31Z

MikeKSmith-Pfizer
Mar 26, 2024
Maintainer

Why do we need to compare to SAS? If the SAP has sufficient detail on methods (as @andyofsmeg has discussed), and if we follow those options & settings in our submission, then we've done enough.

The BIG question is if a regulatory reviewer then reanalyses in SAS and finds a (meaningful) difference. Who "owns" reconciliation? Sponsor? Regulator? If sponsor does what they detailed, then it's a grey area around reconciliation. Would be easy to argue that sponsor carried out analysis as specified. Would anyone go back through old submissions and reanalyse with non-SAS software to look for discrepancies?

Other issues around understanding HOW and WHY things are different is well made, and of course statistically justified. But comparing results and trying to find a match is an exercise that will take huge amounts of time and resource and for little appreciable gain, in my opinion.

0 replies

adrianolszewski · 2024-04-18T16:28:45Z

adrianolszewski
Apr 18, 2024

@MichaelRimler
Hello everyone!

Let me add my few cents about this topic. I write from the perspective of an employee responsible for setting up a trusted numerical environment for my Employer (a 100% based small CRO) and facing many issues that I need to address ad hoc, to complete statistical analyses.

So:
100% match - no, we don't need it
Validation against SAS (and maybe WinNonLin, nQuery and other well established industry standards): yes, it's important

Since if 90% of the industry uses SAS, we are almost guaranteed to be asked -sooner or later- for the discrepancies if our client / sponsor / statistical reviewer will validate our results with SAS. Personally I cannot count all the times (more than 50 requests in the last 8 years) when I was asked for this in the last 8 years.

If we are asked by the sponsor, reviewer, client, whoever, what caused the discrepancies, there are three ways to react:

ignoring it - which I truly don't recommend, at all...
saying "I don't know" - OK, it's honest, but it does not build the trust, does it? If we don't know what is going on in our analyses, then how can we be trusted elsewhere? Would you trust such contractor?
saying "Yes, we know about it and it is caused by an error, which are going to fix/report, or by this-and-that difference we can or cannot align but anyway both are valid" - and that's the best one IMO.

Unfortunately, finding this can be sometimes very difficult, because the used formulas are rarely documented, and only the reference to some book is provided. So one has to find the book (Google books, or buy it, assuming it's still accessible on the market!), then check the code and pray that author didn't use some numerical optimization, "blurring" the formula behind "smart tricks" to improve the performance.

Some people take it personal: "I love R, so how can somebody even dare to undermine the results!".
It's not personal at all. We don't debate whether SAS is better than R, vs SPSS, MedCalc, WinNonlin, nPASS, NCSS or Excel... It really doesn't matter as long as either implements scientifically valid approach, even if differing in some aspects.

But I always suggest putting oneself in others' shoes: people who trust some software that constitutes an actual industry standard for decades (whether you like it or not), have the natural right to worry about the discrepancies and expect explanations.

And I would expect this as well, knowing perfectly the size of inconsistency of R even internally, not to mention even inconsistency with other software. Because I already paid painfully for that, where there was an error.

If SAS gives you, for example, sample size N=50, and R gives N=53, which one is valid? If there is a common formula for that, how the two systems can vary? No, don't escape with "it's almost the same", really, it's not an answer. I can understand the "rounding issue" or using some "super-cool-adjustment-that-no-other-competitor-knows-and-has", but this can be a matter of suspicion.

Of course, systems can and they DO differ in their calculations, not only by rounding, which is cited so commonly, yet it seems to be the least serious problem:

different types of some default values, like type sum of squares (type-1 in R, type-3 in SAS), different default contrasts, different confidence intervals for Kaplan-Meier default implementation (log vs. clog-log)
different implementations of some measure: there are 7 types of quantiles, 3 basic (!) types of skewness and kurtosis. R and SAS differ here, but luckily this can be aligned by specifying appropriate option (mostly...). Even more confusion about estimating the mode with different density estimators. Ah, the and the density estimation may differ too.
different optimization in mixed models. If one uses Nelder-Mead and someone else Gaussian quadrature via gradient of the Laplace log–likelihood, one uses Gauss-Hermite and the other Gauss-Laguerre - it's obvious there may be noticeable difference
same about fitting procedures, like PQL vs say Laplacian approximation and so on. Sometimes they will be similar, sometimes they will be not. In R we have glmmTMB, MASS:glmmPQL, lme4, nlme - and it's well known to obtain differences between them.
storing floating point numbers: SAS: IBM, R: IEEE. This can lead to different results for trivial calculations like BMI (in this topic https://stats.stackexchange.com/questions/160711/how-to-solve-a-problem-with-different-results-in-sas-and-r-due-to-a-number-repre ), and since it has a closed formula, any discrepancies here cause natural questions
the famous rounding issues
approaches to ties. Only recently, on Linkedin, I received a message from a SAS statistical programmer, who cannot achieve same pseudomedian from wilcox.test(), compared to SAS. She didn't share the date with me, but it was likely a problem of ties. Even in R, the implementation of pseudomedian isn't clear: https://aakinshin.net/posts/r-hodges-lehmann-problems/
And even in R we have at least 3 ways (wilcox.test(), asht::wmwTest(), coin::wilcox_test()) to obtain the pseudomedian and the 3 approaches sometimes disagree. In this case, which one will you trust the most? If you have tied observations, you have still to options :)
the approaches to "central values". E.g. in car::Anova() we have the main effects at zero value of numerical covariates (which is a nonsense to me; really, "at zero age" or "zero weight"?), not compare this with emmeans::joint_tests(), which by default uses mean of the numerical covariates, unless set to 0
different precision of calculations: for example the Monte-Carl integration for multivariate t-distribution in R via pnorm is stable up to about 5 digits (if I recall correctly), and this is the core of the MVT adjustment for multiple comparisons, affecting also Tukey's and Dunetts's analysis. That's just single example
Different implementation of the random number generator. I don't know precisely if it's Mersenne-Twister in SAS, but I wasn't able to obtain same numbers for same seed in the past. Which immediately gives different results even under same seed(!), which will affect as basic things as randomization lists, not to mention bootstrap or permutation testing / MC integration /exact versions of tests using subsamples of the whole set of permutations (e.g. permutation Wald-Type Statistic, permutation Brunner-Munzel, Fisher, permutation Welch t-test, tc)
different implementations of degrees of freedom and variance estimators. Do you remember, in the past, the long lasting issue with obtaining the Satterthwaite DoF in nlme::gls()? Historically first approach was via lavaSearch2 then the emmeans which gave either approximate (via perturbation method) or "quasi-exact" Satterthwaite, which... still differed in some cases from SAS. Luckily the mmrm package has been implemented by the great OpenPharma joint project and today we have it comparable with SAS. But when I read the "Issues", I could find questions about the discrepancy in Kenward-Roger (now we have also the linear Kenward-Roger approximation, which I guess exactly matches SAS?).
Differences in the presentation layer - even the shape of histograms can vary if different rules for the number / width of bins is chosen, say Freedman-Diaconis vs Sturges. Luckily we don't use histograms for making decisions :) But still it's a well-known fact that different choices may hide or reveal some details in the histogram. Sure, we don't discuss the usefulness of histograms, but this can be another source of confusion.
mixed approaches. Once I used the prop.test() and found, that the confidence interval doesn't match the p-value. Can you imagine my surprise? How come?! Well, because the CI is the Wald's one for non-pooled SEs (=Wald's inference over the average marginal effect over the logistic regression for a single binary predictor = Wald's over the LS-means on the re-grided probability scale), and the p-value is the Rao's score z test for pooled SEs. Today I know about it, but how many people do not?
Other kinds of differences (a very broad topic), including numerical improvements. Did you try the GEE estimation in R? It's my daily workhorse, and how confused I was learning there are 10+ packages and all... differ from each other here-and-there? I addressed it in my LinkedIn post (should be accessible without logging in): https://www.linkedin.com/posts/adrianolszewski_generalized-estimating-equations-in-r-activity-7173811941797298178-HFH0?utm_source=share&utm_medium=member_desktop
Even within R itself there are difference big enough to make authors write dedicated articles about that: https://www.sciencedirect.com/science/article/abs/pii/S0167947314000863

Another issue: when using the weighted GEE, it's widely reported that the SAS implementation doesn't agree with the one in the geepack and a few more packages (I was told the glmtoolbox, the most mature GEE implementation finally does it well, but need to confirm this finally) for covariance structure other than "independence". Let's remember this affects the model coefficients estimation, so also the final inference.

This just scratches the surface of the list of problems that may cause discrepancies. I address it also in my presentation given at the R in Phrma 2020 conference in 2020: https://www.researchgate.net/publication/345778861_Numerical_validation_as_a_critical_aspect_in_bringing_R_to_the_Clinical_Research

And so far I didn't mention the most important source of discrepancies: actual errors ("bugs in the code" or wrong procedure).

It's my weekly routine to check the GitHub "Issues" page every week (we have a scheduled time for that) to check not only for new pending issues, but also the one that were closed, so we cannot see it in the first place (it's hidden/filtered out), but it could affect our work in the past.

I remember a few times, when a few months / years, after finalizing analyses that it used buggy procedure, I found such reported issue. You can imagine my fear and sleepless night when I hurriedly replicated the analyses to see if it was affected. And in a few cases it was affected, luckily to me the conclusions stayed the same. But it was truly embarrassing to return to the client and tell them what happened. They were calmed down that nothing wrong happened, but I well remember the question: "what if it happens in future and WILL change the decisions?"

These differences matter especially in experimental RCTs with clear thresholds, while the analysis gives a result on the boundary of them. If R will claim p=0.031 and SAS p=0.064 for a formal phase 3 study, which one will you take and convince the regulator it's valid? It's not something to be taken lightly, this is a very practical issue. What will you tell the sponsor or a statistical reviewer, if their calculations lead to a different conclusion?

Now we are approaching the most important question: should we worry about all these problems?

There are several aspects to consider, but ALL of them start form a common, single and the most important question: "do we precisely KNOW what caused the difference"?

If this is some common procedure with well-known formulas, no kind of discrepancy is justified. It should be just like 2+2. OK, let's be less "principled" and say that some little discrepancies at 3+th decimal digits, which can come from any of the above discrepancies, even if the formula was fully correct, may be acceptable.

We also don't have to worry, if the discrepancy is noticeable, but we DO know what caused the issue AND we know the approach was valid AND we can justify the choice (which is a separate, definitely non-trivial case). Still, if they both lead to opposite conclusions it's a problem, but this is rather a question for a fervent debate and checking one's ability to convince everyone around to their position, which is beyond the scope of this discussion.

What can we do to improve the process?

plan all the tasks in sufficient details. Highlight, if some adjustment is not available in R (or SAS).
illustrate the SAP with some basic R code telling exactly what is to be done, without the need to guess or risking with wrong or omitted option
if you are aware of any SAS-R discrepancy in some concrete case, mention the potential discrepancies in the SAP, so the client won't be "triggered" by such outcomes. Or, ideally, if possible, consider using options making the output maximally consistent (if you find it relevant), for example use type-2 quantile, or type-3 skewness coefficient, etc. Sometimes there are packages addressing some of these issues, so it's worth checking them and use if we search for such perfect agreement, for example: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7327187/
check the GitHub [NEWS] [Changelog] and [Issues] parts for new items. Don't ignore the "closed" issues, they are often an priceless source of knowledge about the package!
do not be 100% sure a package is fine because it passed all delivered unit tests. These tests are be as good, as the imagination of its author (same situation as with simulation of statistical methods). Not once I used a package that passed all unit tests, but had an error that revealed itself on the data I had.
every time you have a chance to replicate the analysis in other package, do it. This way you will build, incrementally, a proven track of "compliance" and learn about potential discrepancies, issues, errors. This will give you the ability to report the problem on GitHub or through the CRAN "maintainer" email. Don't believe that "once validated = always validated" because of the point "5" - different dataset
may reveal a well hidden error. The more real data you use for the validation, the calmer the sleep. This is the spirit of the CAMIS repository, but it's very limited so far, needing contribution.
in the free time, search the https://stats.stackexchange.com/ and similar forums for possible discrepancies in R and SAS routines (I collected lots of links, learning a lot by this occasion). Not everything is reported on github, all the more so as not all R key packages have the github repository maintained (other than copied by the CRAN bot).
If there are multiple R packages for doing something, check them and compare results. Don't assume the one you picked is absolutely right (I already paid for this relying on the ordgee in geepack, giving me nonsensical results). Constantly check if there are new R packages for doing some old stuff. It happens now more and more often these package come with better documentation.

1 reply

DrLynTaylor Apr 19, 2024
Collaborator

Hi Adrian, Lovely points above, Thanks for contributing! You clearly have a lot of info around mis-matches in results, so if you have anything else you can contribute to the CAMIS repo we'd really appreciate it: https://psiaims.github.io/CAMIS/. Thanks so much for the Wilcoxon test issue you already raised, I'm going to get that updated as soon as i get chance. You are also welcome to fork the repo & propose the updates directly. We welcome all suggestions for improvement! I still believe that when a mis-match occurs, we must investigate and determine if it's due to some programming error (prior to the analysis step), or if we are actually doing different methods (in the analysis step). Hopefully full description of the method should be in the SAP and hence, there should be a "correct" and "incorrect" result (one following SAP, one not following SAP, which we can only determine when we know fully what approach the software is using!). Therefore, even if we can't make one software match another, we still need to know why they differ and this is super time consuming. Hence why I'm passionate about the CAMIS project !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

11. Do we need to match SAS numerically when using a different language? #11

{{title}}

Replies: 5 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

11. Do we need to match SAS numerically when using a different language? #11

MichaelRimler Aug 24, 2023 Maintainer

Replies: 5 comments · 7 replies

MichaelRimler Sep 14, 2023 Maintainer Author

DrLynTaylor Feb 1, 2024 Collaborator

rossfarrugia Oct 10, 2023

andyofsmeg Jan 3, 2024

DrLynTaylor Feb 6, 2024 Collaborator

rossfarrugia Feb 7, 2024

DrLynTaylor Feb 16, 2024 Collaborator

andyofsmeg Jan 3, 2024

DrLynTaylor Feb 1, 2024 Collaborator

MikeKSmith-Pfizer Mar 26, 2024 Maintainer

adrianolszewski Apr 18, 2024

DrLynTaylor Apr 19, 2024 Collaborator

MichaelRimler
Aug 24, 2023
Maintainer

Replies: 5 comments 7 replies

MichaelRimler
Sep 14, 2023
Maintainer Author

DrLynTaylor Feb 1, 2024
Collaborator

rossfarrugia
Oct 10, 2023

DrLynTaylor Feb 6, 2024
Collaborator

DrLynTaylor Feb 16, 2024
Collaborator

andyofsmeg
Jan 3, 2024

DrLynTaylor Feb 1, 2024
Collaborator

MikeKSmith-Pfizer
Mar 26, 2024
Maintainer

adrianolszewski
Apr 18, 2024

DrLynTaylor Apr 19, 2024
Collaborator