Polaris Recipes

The Polaris datasets and benchmarks recipes.

This repository is a central hub for the storage, organization, and collaboration of notebooks essential for data curation and design benchmarking tasks listed in the Polaris Hub. The Auroris package was used to curate the data.

Datasets 101 - Basic Checks

A little bit of effort spent on dataset curation can go a long way to improving your models performance in real-world settings. Below, we outline some high level checks that you should be applying to any dataset you work with in drug discovery.

Step 1 - Check that the dataset is representative of applications in real-world drug discovery.

Creators of the dataset must be able to explain the data generation process and describe the specific applications of this dataset in drug discovery.
Take for example the FreeSolv dataset in MoleculeNet mentioned in Pat Walter’s blog. Although the dataset was designed to evaluate molecular dynamics methods, it has turned into a generic property prediction task for the free energy of solvation. However, this quantity used in isolation is not particularly useful.

Step 2 - Check that the dataset stems from a consistent, original source

Creators of the dataset must share references to where the dataset was originally sourced from. If data is aggregated from multiple sources or preprocessed in some way, this process needs to be transparent and the rationale should be well documented. Blindly combining datasets can introduce significant noise.
Some examples that violate this rule include datasets like tdcommons/solubility-aqsoldb and tdcommons/bbb-martins. In both cases, data has been collected from multiple sources yet there are no references to primary literature.

Step 3 - Check that the dataset does not contain obvious errors or ambiguous data

Creators of the dataset should check for obvious duplicates, invalid data, or ambiguous data. You should also visualize the data distributions to highlight potential outliers.
For example, tdcommons/bbb-martins violates this rule as it contains many duplicate structures.

Example notebooks

If you're looking for examples of curated datasets, we recommended checking out the notebooks for the adme-fang and pkis2 datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
.github/DISCUSSION_TEMPLATE		.github/DISCUSSION_TEMPLATE
org-AdaptyvBio		org-AdaptyvBio
org-Biogen/fang2023_ADME		org-Biogen/fang2023_ADME
org-Graphium		org-Graphium
org-MolecularML/moleculeACE		org-MolecularML/moleculeACE
org-Novartis/CYP		org-Novartis/CYP
org-Polaris		org-Polaris
org-TDC		org-TDC
utils		utils
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
env.yml		env.yml
ruff.toml		ruff.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Polaris Recipes

Datasets 101 - Basic Checks

Example notebooks

About

Releases

Packages

Contributors 5

Languages

polaris-hub/polaris-recipes

Folders and files

Latest commit

History

Repository files navigation

Polaris Recipes

Datasets 101 - Basic Checks

Example notebooks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages