diff --git a/match.qmd b/match.qmd index 35ed154..fed9441 100644 --- a/match.qmd +++ b/match.qmd @@ -12,18 +12,21 @@ title: "Numerical Matching" ## Historical Background -The statistical analysis system (SAS), was developed by a collective of eight USA southern state universities in the late 1960's. The company SAS institute Inc. was founded in 1976, and their first product release of Base SAS, consisted of approximately 300,000 lines of code^1^. During the first 30 years of existence, SAS software became renowned in the highly regulated medical research industry, for its well documented methodology and its high quality, reproducible, robust and reliable analysis implementation. This made it the number one data analysis tool in the pharmaceutical industry and a gold standard for regulatory submissions around the world. +The statistical analysis system [SAS](https://www.sas.com/en_us/home.html) was developed by a collective of eight USA southern state universities in the late 1960's. The SAS Institute Inc. was founded in 1976, and their first product release of Base SAS, consisted of approximately 300,000 lines of code^1^. During the first 30 years of existence, SAS software became renowned in the highly regulated medical research industry, for its well documented methodology and its high quality, reproducible, robust and reliable analysis implementation. This made it the number one data analysis tool in the pharmaceutical industry and a gold standard for regulatory submissions around the world. -R has available since 1993 with packages being added continuously, providing a continual development cycle and user-led package development^2^. This open-source language has a package repository called CRAN (Comprehensive R Archive Network) with over 21000 packages in it. The momentum and adoption of R in Pharma is growing, likely due to: +R has been available since 1993 with packages being added continuously, providing a continual development cycle and user-led package development^2^. This open-source language has a package repository called [CRAN](The%20Comprehensive%20R%20Archive%20Network%20(r-project.org)) (Comprehensive R Archive Network) with over 21000 packages in it. The momentum and adoption of R in Pharma is growing, likely due to: -1) the wide range of additional functionality compared to SAS, which can lead to better efficiency, -2) the large quantity of open source development of packages, which is often conducted through github repositories. In recent times, the [pharmaverse](https://pharmaverse.org/e2eclinical/)^3^ project demonstrates multiple packages designed specifically to resolve the needs of conducting analysis in medical research with a focus on the regulatory needs. +1) it being a common tool used by students, academics and other industries, +2) it's ability to produce interactive reporting and graphics +3) the large quantity of open source development of packages, which is often conducted through github repositories. Recently, the [pharmaverse](https://pharmaverse.org/e2eclinical/)^3^ has been created which pulls together multiple packages designed specifically to resolve the needs of conducting analysis in medical research with a focus on the regulatory needs. + +**It is commonly found that if the user does not change the default options in SAS and R for a particular analysis, then the results that are output are different! But why is this ...?!** SAS and R would have experienced entirely different challenges during their early development periods. For SAS, latency (speed to do computations) was very low compared to modern computers. The time for a computer to implement a complex statistical analysis (or even a simple one) would have initially been very long! This would have made numerical approximations (faster algorithms) very attractive in order to reduce computational time required to conduct analysis. As computational power improved, the speed to do analysis rapidly improved. The need to simplify an analysis to an approximation or more simple algorithm was less important. Hence SAS increased its functionality adding more methods. However, due to its rigorous reproducibility and backwards compatibility commitments, the 'default' method remained as the original, and new methods (which were often more complex) were added as options to the original SAS procedures. With R being developed in the 1990's, it didn't have such a restriction on speed of computation. This means that the methods that R defaults to, are often the ones that were most commonly used when that package was developed, or that were documented in the literature with better performance compared to an older methodology. -In conclusion, if you write code in SAS and R without specifying fully your analysis using the optional parameters or knowing your default options, it is very likely that the analysis they conduct could be different. In many cases, adding detail to your code (Specifying all options clearly), specifies exactly the analysis you want SAS or R to do, which then ensures that SAS and R are applying the same methods and you get the same results. However, there are still cases where some analyses are only available in SAS and some only in R. +**In conclusion,** if you write code in SAS and R without specifying fully your analysis using the optional parameters or knowing your default options, it is very likely that the analysis they conduct could be different. In many cases, adding detail to your code (Specifying all options clearly), specifies exactly the analysis you want SAS or R to do, which then ensures that SAS and R are applying the same methods and you get the same results. However, there are still cases where some analyses are only available in SAS and some only in R. The [PHUSE CAMIS Project](https://psiaims.github.io/CAMIS/)^4^ compares default analysis methods in SAS, R and Python, documents which options need to be specified in order to obtain a reproduction of the same analysis and identifies cases where software can not replicate the same analysis. @@ -39,7 +42,7 @@ If results did differ by a clinically important amount, then it would be very im ### In Practice ! -Despite the above, in practice, the medical research industry is governed by strict processes and Standard operative procedures (SOPs) to ensure quality of statistical analysis. For this reason, many companies apply a double programming approach to ensure 100% independent replication of results. This requires a full identical match to be obtained when two programmers do the same work independently. In addition, if the initial work is done by a Contract Research Organization (CRO), working with a pharmaceutical company, the analysis may be programmed a third time to replicate results in the two different companies systems. Finally, when results are submitted to the regulators, they will also program the results and attempt replication. +Despite the above, in practice, the medical research industry is governed by strict processes and Standard Operative Procedures (SOPs) to ensure quality of statistical analysis. For this reason, many companies apply a double programming approach to ensure 100% independent replication of results. This requires a full identical match to be obtained when two programmers do the same work independently. In addition, if the initial work is done by a Contract Research Organization (CRO), working with a pharmaceutical company, the analysis may be programmed a third time to replicate results in the two different companies systems. Finally, when results are submitted to the regulators, they will also program the results and attempt replication. Therefore, in all these cases if a full identical match cannot be obtained, this introduces uncertainty, apprehension and nervousness about the results being presented. Are they correct ? ! Why don't we get the same results doing the same analysis in another language / software.