Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH improve speed of fitting Cox model by relying on the fast skglm solver #1531

Open
mathurinm opened this issue Jun 7, 2023 · 6 comments

Comments

@mathurinm
Copy link
Contributor

I am one of the developers of skglm, a python package that improves scikit-learn for Generalized Linear Models by providing more functionalities and faster solvers.
We have recently worked on a solver for the Cox estimator, for which lifelines provides a reference implementation.

Preliminary results indicate time speedups up to x500 when using skglm

Timing comparison between skglm and lifelines

Here is a notebook to illustrate the peformance and to showcase the scikit-learn like API of skglm. Also, here are the results of a complete benchmark with Benchopt and the link to the benchmark repo to reproduce it.

In addition, some skglm features might be useful to the users of lifelines:

  • support of design matrix with more columns than rows (may cause issue in lifelines)
  • support of sparse design matrix (currently not supported in lifelines)
  • immediate extension to other penalizers such as Weighted L1, non convex regularizers, group Lasso penalty, etc

Based on this, we'd like to discuss the potential integration of skglm solver into lifelines for fitting the Cox Estimator.

A noteworthy point is that skglm relies heavily on numba JIT compilation, which may introduce a slight overhead during the initial model fit. However, this inconvenience is compensated by the gained advantages namely handling datasets with thousands of features and samples within a reasonable time.

We'd be happy to have your feedback on this.

Also pinging @Badr-MOUFAD @PABannier @QB3

@CamDavidsonPilon
Copy link
Owner

Wow that's very impressive! One thing I think you should try is to bin the times into buckets (as tied times are common in survival datasets, as we are often rounding to months, days, hours, etc.). The Cox model works by sorting times, but when there are ties, it has to use a technique to handle them. There are a few technique to handle ties: random, Efron, Breslow, exact (the most accurate, but slowest). Lifelines uses Efron's method, as its accuracy-to-speed tradeoff is good.

@mathurinm
Copy link
Contributor Author

Indeed, we are working on adding support for Efron handling of ties here : scikit-learn-contrib/skglm#159, it should be merged shortly.

@CamDavidsonPilon
Copy link
Owner

Very exciting work, team!

@mathurinm
Copy link
Contributor Author

@BadrMOUFAD has just added support for the Efron handling of ties here : scikit-learn-contrib/skglm#159
Benchmarks results are the same

@CamDavidsonPilon
Copy link
Owner

I'm impressed. I'm going to have to try this library locally.

Is the following (mostly) correct?

One significant speed up is from using an approximation to the Hessian. This approximation is valid to use, and can be shown that using it will still converge to the same solution (albeit with perhaps more iterations, but the cost-savings are still there).

@Badr-MOUFAD
Copy link
Contributor

Thank you again for your interest!
Here are the key improvement factors

  • Levaraging the sparse nature of the solution with state-of-the-art working set strategy detailed in our Neurips 2022 paper (Algo 1 and 2)
  • Usage of Proximal Newton solver with diagonal upper-bound on the Hessian resulting in a linear computational and memory cost (skglm tutorial equation 6)
  • Efficient implementation of Cox datafit which achieves a linear cost of evaluating its value, gradient, and Hessian (skglm Cox implementation)

We are happy to discuss options for integrating skglm into lifelines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants