11. Do we need to match SAS numerically when using a different language? #11
Replies: 5 comments 7 replies
-
No, in my opinion, we do not need to match one implementation with another to establish that a 'newly used' implementation is correct relative to an established and trusted implementation. Instead, we need to first understand how those implementations differ - the underlying decisions by the developer that implemented the algorithm - to see if any differences are indeed expected. SAS vs R default rounding is a fundamental example (https://psiaims.github.io/CAMIS/Comp/r-sas_rounding.html). Both implementations are correct, relative to their documentation. But each implements a different rounding rule. [Did you know there were different ways to round??? I've been living a lie since the 4th grade!!] PHUSE/PSI supports the CAMIS working group project: https://psiaims.github.io/CAMIS/ SAS vs R rounding: SAS vs R Summary Stats: SAS vs R Survival: There are other examples - and contributions are welcomed. |
Beta Was this translation helpful? Give feedback.
-
This is a nice blog on rounding and i gave my general thoughts on this topic in the conclusion: https://pharmaverse.github.io/blog/posts/2023-07-24_rounding/rounding.html. I think when doing hybrid working and comparing 2 different languages as part of our QC strategy we should really change our mindset away from trying to get 100% perfect match, to instead looking at if differences can be explained and accepted. |
Beta Was this translation helpful? Give feedback.
-
No. The main impact of using multiple languages will be that statisticians need to think more carefull about the wording in protocols and SAPs. Today, much of the language used biases towards SAS (eg options are not mentioned when SAS defaults are not changed). What needs to happen is that everything needs to be more explicitly detailed. This could include the rounding method used in summary displays. |
Beta Was this translation helpful? Give feedback.
-
Why do we need to compare to SAS? If the SAP has sufficient detail on methods (as @andyofsmeg has discussed), and if we follow those options & settings in our submission, then we've done enough. The BIG question is if a regulatory reviewer then reanalyses in SAS and finds a (meaningful) difference. Who "owns" reconciliation? Sponsor? Regulator? If sponsor does what they detailed, then it's a grey area around reconciliation. Would be easy to argue that sponsor carried out analysis as specified. Would anyone go back through old submissions and reanalyse with non-SAS software to look for discrepancies? Other issues around understanding HOW and WHY things are different is well made, and of course statistically justified. But comparing results and trying to find a match is an exercise that will take huge amounts of time and resource and for little appreciable gain, in my opinion. |
Beta Was this translation helpful? Give feedback.
-
@MichaelRimler Let me add my few cents about this topic. I write from the perspective of an employee responsible for setting up a trusted numerical environment for my Employer (a 100% based small CRO) and facing many issues that I need to address ad hoc, to complete statistical analyses. So: Since if 90% of the industry uses SAS, we are almost guaranteed to be asked -sooner or later- for the discrepancies if our client / sponsor / statistical reviewer will validate our results with SAS. Personally I cannot count all the times (more than 50 requests in the last 8 years) when I was asked for this in the last 8 years. If we are asked by the sponsor, reviewer, client, whoever, what caused the discrepancies, there are three ways to react:
Unfortunately, finding this can be sometimes very difficult, because the used formulas are rarely documented, and only the reference to some book is provided. So one has to find the book (Google books, or buy it, assuming it's still accessible on the market!), then check the code and pray that author didn't use some numerical optimization, "blurring" the formula behind "smart tricks" to improve the performance. Some people take it personal: "I love R, so how can somebody even dare to undermine the results!". But I always suggest putting oneself in others' shoes: people who trust some software that constitutes an actual industry standard for decades (whether you like it or not), have the natural right to worry about the discrepancies and expect explanations. And I would expect this as well, knowing perfectly the size of inconsistency of R even internally, not to mention even inconsistency with other software. Because I already paid painfully for that, where there was an error. If SAS gives you, for example, sample size N=50, and R gives N=53, which one is valid? If there is a common formula for that, how the two systems can vary? No, don't escape with "it's almost the same", really, it's not an answer. I can understand the "rounding issue" or using some "super-cool-adjustment-that-no-other-competitor-knows-and-has", but this can be a matter of suspicion. Of course, systems can and they DO differ in their calculations, not only by rounding, which is cited so commonly, yet it seems to be the least serious problem:
Another issue: when using the weighted GEE, it's widely reported that the SAS implementation doesn't agree with the one in the geepack and a few more packages (I was told the glmtoolbox, the most mature GEE implementation finally does it well, but need to confirm this finally) for covariance structure other than "independence". Let's remember this affects the model coefficients estimation, so also the final inference. This just scratches the surface of the list of problems that may cause discrepancies. I address it also in my presentation given at the R in Phrma 2020 conference in 2020: https://www.researchgate.net/publication/345778861_Numerical_validation_as_a_critical_aspect_in_bringing_R_to_the_Clinical_Research And so far I didn't mention the most important source of discrepancies: actual errors ("bugs in the code" or wrong procedure). It's my weekly routine to check the GitHub "Issues" page every week (we have a scheduled time for that) to check not only for new pending issues, but also the one that were closed, so we cannot see it in the first place (it's hidden/filtered out), but it could affect our work in the past. I remember a few times, when a few months / years, after finalizing analyses that it used buggy procedure, I found such reported issue. You can imagine my fear and sleepless night when I hurriedly replicated the analyses to see if it was affected. And in a few cases it was affected, luckily to me the conclusions stayed the same. But it was truly embarrassing to return to the client and tell them what happened. They were calmed down that nothing wrong happened, but I well remember the question: "what if it happens in future and WILL change the decisions?" These differences matter especially in experimental RCTs with clear thresholds, while the analysis gives a result on the boundary of them. If R will claim p=0.031 and SAS p=0.064 for a formal phase 3 study, which one will you take and convince the regulator it's valid? It's not something to be taken lightly, this is a very practical issue. What will you tell the sponsor or a statistical reviewer, if their calculations lead to a different conclusion? Now we are approaching the most important question: should we worry about all these problems? There are several aspects to consider, but ALL of them start form a common, single and the most important question: "do we precisely KNOW what caused the difference"? If this is some common procedure with well-known formulas, no kind of discrepancy is justified. It should be just like 2+2. OK, let's be less "principled" and say that some little discrepancies at 3+th decimal digits, which can come from any of the above discrepancies, even if the formula was fully correct, may be acceptable. We also don't have to worry, if the discrepancy is noticeable, but we DO know what caused the issue AND we know the approach was valid AND we can justify the choice (which is a separate, definitely non-trivial case). Still, if they both lead to opposite conclusions it's a problem, but this is rather a question for a fervent debate and checking one's ability to convince everyone around to their position, which is beyond the scope of this discussion. What can we do to improve the process?
|
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
All reactions