Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correlated rank similarity metric #59

Open
wants to merge 8 commits into
base: develop-paper
Choose a base branch
from

Conversation

gmingas
Copy link
Contributor

@gmingas gmingas commented Apr 6, 2021

This PR adds support for Jenning's and Sebastian's correlated rank similarity metric.

Changes:

  • Adds three new methods to the RankingSimilarity class of rbo.py. These implement the correlated rank metric, its extrapolated version and the LP solver.
  • Modifies feature_importance.py to calculate the metric and also adds more complete RBO calculation (all types of RBO apart from uneven extrapolation) when comparing orig vs. rlds, orig vs rand and orig vs lower.
  • Adds calculation of correlation matrix in feature_importance.py
  • Adds pulp to required libraries which will require a rebuild of the Docker image.

WIP:

  • The solver is too slow at the moment when testing it on the framingham dataset and I am trying to figure out why. It is possible that the LP problem has too many variables.<\s>

@gmingas gmingas force-pushed the feature/correlation-similarity branch from aae547f to b04dfae Compare April 8, 2021 16:15
@gmingas gmingas changed the title WIP: Correlated rank similarity metric Correlated rank similarity metric Apr 8, 2021
@gmingas
Copy link
Contributor Author

gmingas commented Apr 8, 2021

@OscartGiles This PR adds pulp to the required libraries list. Can you check if I have made all the necessary changes in the code for that to work? And is it easy to update the environment we use in Azure to support this?

@OscartGiles
Copy link
Contributor

@gmingas - Just fixing the tests now but looks good. To get onto the VMs I can either redeploy the VM(s) or we can just install manually for now. Should be added next time a VM is deployed.

@gmingas
Copy link
Contributor Author

gmingas commented Apr 12, 2021

Thanks! Yes, I did add it manually to do the weekend runs. No need to redeploy now, we can wait until the next time they are deployed.

@OscartGiles
Copy link
Contributor

The Test pipeline run now fails because it tries to run the household_poverty stuff but doesn't have the data.
Can we either grab the data in the makefile or tell it not to run the household poverty stuff when it runs the pipeline?

@gmingas
Copy link
Contributor Author

gmingas commented Apr 12, 2021

We can grab the data in the makefile using the Kaggle API but that would require adding an authentication token to the repo. I don't know if it is possible to do this is a secure way, probably not from a quick search but maybe someone has done this before?

The other option is to remove the household cleaning code from the makefile and run it manually whenever we need it after the data are added manually too.

@OscartGiles
Copy link
Contributor

OscartGiles commented Apr 12, 2021

We could set it as an environment variable (can save it as a secret on github for use in the CI pipeline). But then we also need to make sure it is an environment variable on all our VMs and make it clear in the README that you need a kaggle API token.
Save you manually downloading it on the VMs though.

For ref https://github.com/Kaggle/kaggle-api#api-credentials

…ifications

Feature/dataset modifications
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants