diff --git a/README.md b/README.md
index 4471e86..fadbe91 100644
--- a/README.md
+++ b/README.md
@@ -2,17 +2,18 @@
[![Build Status](https://dev.azure.com/cwzou/mongo-rdkit/_apis/build/status/rdkit.mongo-rdkit?branchName=master)](https://dev.azure.com/cwzou/mongo-rdkit/_build/latest?definitionId=1&branchName=master)
Mongo-rdkit is an integration between MongoDB,
-a NoSQL database platform, and RDKit, a collection of chemoinformatics and machine-learning software.
+a NoSQL database platform, and RDKit, a collection of cheminformatics and machine-learning software.
This package contains tools to create and manipulate a chemically-intelligent database, as well as
methods for high-performance searches on the database that leverage native MongoDB features.
Useful links:
* [BSD License](https://github.com/rdkit/mongo-rdkit/blob/master/LICENSE) - a business friendly license for open-source.
-* [Jupyter Notebooks](https://github.com/rdkit/mongo-rdkit/tree/master/docs) - resources for getting started.
+* [Jupyter Notebooks](https://github.com/rdkit/mongo-rdkit/tree/master/docs) - walkthroughs for main functionality.
+* [Testing Guide](https://github.com/rdkit/mongo-rdkit/blob/master/docs/testing.md) - walkthrough of running `mongordkit` tests.
## Documentation
Jupyter Notebooks and resources for getting started in the [docs](https://github.com/rdkit/mongo-rdkit/tree/master/docs)
-folder on GitHub
+folder on GitHub.
## Installation
As the package is not officially configured with a setup.py file or pushed onto PyPi, these are working install instructions.
@@ -43,9 +44,43 @@ echo $PYTHONPATH
You can now `import mongordkit` in your Python interpreter or run all tests using the `pytest` command.
### Windows:
+Similarly, ensure that `conda` has been added to `PATH`.
+Clone the repository into your desired directory and navigate into it.
+Create a conda environment called mongo_rdkit that includes dependencies:
+```
+conda env create --quiet --force --file env.yml
+```
+Activate this conda environment:
+```
+call activate mongo_rdkit
+```
+Check that you are able to import mongordkit:
+```
+python -c "import mongordkit"
+```
+If this fails, you may need to add the current directory manually to `PYTHONPATH`:
+```
+set PYTHONPATH=%PYTHONPATH%;C:.
+```
+You can now use `mongordkit` in your interpreter and run tests using `python -m pytest`.
+## Package Contents
+### Modules
+`mongordkit` contains two main modules, each of which contains a variety of importable methods and classes.
+`Database` contains functionality for writing and registering data. `Search` contains functionality for setting up and performing
+substructure and similarity search. Detailed walkthroughs can be found in the notebooks, listed below.
+
+### Notebooks
+- **Creating and Writing to MongoDB**: documentation and demos for creating and modifying mongo-rdkit databases.
+- **Similarity and Substructure Search**: documentation and demos for similarity and substructure search.
+- **Similarity Benchmarking**: documentation for reproducing similarity benchmarking.
+- **Substructure Benchmarking**: documentation for reproducing substructure benchmarking.
+### Configuration
+- **azure_pipelines.yml**: CI/CD pipeline configurations.
+- **conftest.py**: `pytest` configurations.
+- **env.yml**: required dependencies.
## License
Code released under the BSD License.
diff --git a/docs/notebooks/.ipynb_checkpoints/.ipynb-checkpoint b/docs/notebooks/.ipynb_checkpoints/.ipynb-checkpoint
deleted file mode 100644
index 80af95b..0000000
--- a/docs/notebooks/.ipynb_checkpoints/.ipynb-checkpoint
+++ /dev/null
@@ -1,92 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Creating and Writing to MongoDB\n",
- "\n",
- "Methods that directly modify MongoDB database instances are included in the `mongordkit.Database` module.\n",
- "\n",
- "\n",
- "\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "metadata": {},
- "outputs": [],
- "source": [
- "from mongordkit.Database import *"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Creating Databases\n",
- "Users can opt to bring their own database instances, but `Database.create` provides a variety of ways to create a `mongordkit`-compatible MongoDB instance:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 13,
- "metadata": {},
- "outputs": [
- {
- "ename": "DuplicateKeyError",
- "evalue": "E11000 duplicate key error collection: MyDatabase.registration index: _id_ dup key: { _id: ObjectId('5f0f64b2eaae47671ad2fb9d') }",
- "output_type": "error",
- "traceback": [
- "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
- "\u001b[0;31mDuplicateKeyError\u001b[0m Traceback (most recent call last)",
- "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdb\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mDatabase\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcreate\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcreateFromHostPort\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'MyDatabase'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mhost\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mport\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0mdb\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mDatabase\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcreate\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcreateFromURI\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'MyDatabase'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0murl\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
- "\u001b[0;32m~/Desktop/mongo-rdkit/mongordkit/Database/create.py\u001b[0m in \u001b[0;36mcreateFromHostPort\u001b[0;34m(dbname, host, port)\u001b[0m\n\u001b[1;32m 17\u001b[0m \u001b[0mdb\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mclient\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mdbname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 18\u001b[0m \u001b[0mcollection\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdb\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'registration'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 19\u001b[0;31m \u001b[0mcollection\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0minsert_one\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mSTANDARD_SETTING\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 20\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mdb\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 21\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
- "\u001b[0;32m~/anaconda3/envs/py37_rdkit_beta/lib/python3.7/site-packages/pymongo/collection.py\u001b[0m in \u001b[0;36minsert_one\u001b[0;34m(self, document, bypass_document_validation, session)\u001b[0m\n\u001b[1;32m 696\u001b[0m \u001b[0mwrite_concern\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mwrite_concern\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 697\u001b[0m \u001b[0mbypass_doc_val\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mbypass_document_validation\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 698\u001b[0;31m session=session),\n\u001b[0m\u001b[1;32m 699\u001b[0m write_concern.acknowledged)\n\u001b[1;32m 700\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
- "\u001b[0;32m~/anaconda3/envs/py37_rdkit_beta/lib/python3.7/site-packages/pymongo/collection.py\u001b[0m in \u001b[0;36m_insert\u001b[0;34m(self, docs, ordered, check_keys, manipulate, write_concern, op_id, bypass_doc_val, session)\u001b[0m\n\u001b[1;32m 610\u001b[0m return self._insert_one(\n\u001b[1;32m 611\u001b[0m \u001b[0mdocs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mordered\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcheck_keys\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmanipulate\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mwrite_concern\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mop_id\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 612\u001b[0;31m bypass_doc_val, session)\n\u001b[0m\u001b[1;32m 613\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 614\u001b[0m \u001b[0mids\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
- "\u001b[0;32m~/anaconda3/envs/py37_rdkit_beta/lib/python3.7/site-packages/pymongo/collection.py\u001b[0m in \u001b[0;36m_insert_one\u001b[0;34m(self, doc, ordered, check_keys, manipulate, write_concern, op_id, bypass_doc_val, session)\u001b[0m\n\u001b[1;32m 598\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 599\u001b[0m self.__database.client._retryable_write(\n\u001b[0;32m--> 600\u001b[0;31m acknowledged, _insert_command, session)\n\u001b[0m\u001b[1;32m 601\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 602\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdoc\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mRawBSONDocument\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
- "\u001b[0;32m~/anaconda3/envs/py37_rdkit_beta/lib/python3.7/site-packages/pymongo/mongo_client.py\u001b[0m in \u001b[0;36m_retryable_write\u001b[0;34m(self, retryable, func, session)\u001b[0m\n\u001b[1;32m 1489\u001b[0m \u001b[0;34m\"\"\"Internal retryable write helper.\"\"\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1490\u001b[0m \u001b[0;32mwith\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_tmp_session\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msession\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0ms\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1491\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_retry_with_session\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mretryable\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0ms\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1492\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1493\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_reset_server\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maddress\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
- "\u001b[0;32m~/anaconda3/envs/py37_rdkit_beta/lib/python3.7/site-packages/pymongo/mongo_client.py\u001b[0m in \u001b[0;36m_retry_with_session\u001b[0;34m(self, retryable, func, session, bulk)\u001b[0m\n\u001b[1;32m 1382\u001b[0m \u001b[0;32mraise\u001b[0m \u001b[0mlast_error\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1383\u001b[0m \u001b[0mretryable\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1384\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msession\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msock_info\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mretryable\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1385\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mServerSelectionTimeoutError\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1386\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mis_retrying\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
- "\u001b[0;32m~/anaconda3/envs/py37_rdkit_beta/lib/python3.7/site-packages/pymongo/collection.py\u001b[0m in \u001b[0;36m_insert_command\u001b[0;34m(session, sock_info, retryable_write)\u001b[0m\n\u001b[1;32m 595\u001b[0m retryable_write=retryable_write)\n\u001b[1;32m 596\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 597\u001b[0;31m \u001b[0m_check_write_command_response\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mresult\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 598\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 599\u001b[0m self.__database.client._retryable_write(\n",
- "\u001b[0;32m~/anaconda3/envs/py37_rdkit_beta/lib/python3.7/site-packages/pymongo/helpers.py\u001b[0m in \u001b[0;36m_check_write_command_response\u001b[0;34m(result)\u001b[0m\n\u001b[1;32m 219\u001b[0m \u001b[0mwrite_errors\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mresult\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"writeErrors\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 220\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mwrite_errors\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 221\u001b[0;31m \u001b[0m_raise_last_write_error\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mwrite_errors\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 222\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 223\u001b[0m \u001b[0merror\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mresult\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"writeConcernError\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
- "\u001b[0;32m~/anaconda3/envs/py37_rdkit_beta/lib/python3.7/site-packages/pymongo/helpers.py\u001b[0m in \u001b[0;36m_raise_last_write_error\u001b[0;34m(write_errors)\u001b[0m\n\u001b[1;32m 200\u001b[0m \u001b[0merror\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mwrite_errors\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 201\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0merror\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"code\"\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;36m11000\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 202\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mDuplicateKeyError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0merror\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"errmsg\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m11000\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merror\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 203\u001b[0m \u001b[0;32mraise\u001b[0m \u001b[0mWriteError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0merror\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"errmsg\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merror\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"code\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merror\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 204\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
- "\u001b[0;31mDuplicateKeyError\u001b[0m: E11000 duplicate key error collection: MyDatabase.registration index: _id_ dup key: { _id: ObjectId('5f0f64b2eaae47671ad2fb9d') }"
- ]
- }
- ],
- "source": [
- "db = Database.create.createFromHostPort('MyDatabase', host=None, port=None)\n",
- "db = Database.create.createFromURI('MyDatabase', url=None)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "py37_rdkit_beta",
- "language": "python",
- "name": "py37_rdkit_beta"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.7.7"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
diff --git a/docs/notebooks/.ipynb_checkpoints/Creating and Writing to MongoDB-checkpoint.ipynb b/docs/notebooks/.ipynb_checkpoints/Creating and Writing to MongoDB-checkpoint.ipynb
deleted file mode 100644
index 0dd6819..0000000
--- a/docs/notebooks/.ipynb_checkpoints/Creating and Writing to MongoDB-checkpoint.ipynb
+++ /dev/null
@@ -1,199 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Creating and Writing to MongoDB\n",
- "\n",
- "Last updated: 7/12/20\n",
- "\n",
- "Methods that directly modify MongoDB database instances are included in the `mongordkit.Database` module.\n",
- "\n",
- "\n",
- "\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from mongordkit.Database import create, write, utils\n",
- "import pymongo"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Reset Cells\n",
- "Run the contents of this cell to reset the local MongoDB database used in this notebook."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "client = pymongo.MongoClient()\n",
- "print(client.list_database_names())\n",
- "client.drop_database('TestDatabase')\n",
- "print(client.list_database_names())"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Creating Databases\n",
- "Users can opt to bring their own database instances, but `Database.create` provides methods that will create ready-made MongoDB instances, defaulting to your local MongoDB:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Return a database using a host port, such as the local port:\n",
- "TestDB = create.createFromHostPort('TestDatabase', host='localhost', port=27017)\n",
- "\n",
- "# Return a database using a MongoDB URI, such as that provided by Atlas:\n",
- "TestDB = create.createFromURL('TestDatabase', url=None)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "These databases are created with a `registration` collection. The registration collection includes several documents that consist of common pre-made settings, with the default being `STANDARD_SETTING`. All settings are documented in `Database.utils`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "print(utils.STANDARD_SETTING)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Writing to a Database\n",
- "`Database.write` provides write functionality. Its core method is `writeFromSDF`, which relies on rdkit's `ForwardSDMolSupplier` to write data from an SDF file into a specified database.\n",
- "\n",
- "For each molecule in the SDF, `writeFromSDF` inserts a document containing at the minimum a unique identifying index, that molecule's SMILES, a pickle of the molecule's rdmol, and a field that specifies the registration option used to store the molecule."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Write the contents of first_200_props.sdf, a test dataset, into the TestDatabase created above. \n",
- "# The index will default to the molecule's inchikey.\n",
- "# Return the number of molecules succesfully imported.\n",
- "write.writeFromSDF(TestDB, '../../data/test_data/first_200.props.sdf', 'test')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The above call is the most basic version of `writeFromSDF`. For additional flexibility, `writeFromSDF` takes several optional arguments that allow users to specify how inbound molecules should be standardized, a field relating to the data's origin, customize the index, and change how many molecules are inserted into the database at a time. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Write the contents of first_200_props.sdf, a test dataset, into the TestDatabase created above. \n",
- "# This write will use canonical SMILES as the identifying index and thus does not conflict with the above write. \n",
- "# If we had used inchikey again, the write would have imported 0 molecules.\n",
- "write.writeFromSDF(TestDB, '../../data/test_data/first_200.props.sdf', 'test', reg_option='standard_setting', index_option='canonical_smiles', chunk_size=100, limit=None)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "In order to maintain consistency, the registration options and index options are drawn from a set of predetermined options specified in `Database.utils`."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## `.create` Module Contents\n",
- "\n",
- "mongordkit.Database.create.**createFromHostPort**(database, host=None (*string*), port=None (*string*))\n",
- "\n",
- "mongordkit.Database.create.**createFromURL**(database, url=None (*string*))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## `.write` Module Contents"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "mongordkit.Database.write.**writeFromSDF**(database, source_sdf, source_name *(string)*, reg_option=\"standard_setting\", index_option=\"inchikey\", chunk_size=100, limit=None)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "As of 7/15/20, `writeFromSDF` supports the following registration options: \n",
- "* 'standard_setting'\n",
- "\n",
- "And the following index options: \n",
- "* 'inchikey'\n",
- "* 'canonical_smiles'\n",
- "* 'het_atom_tautomer'"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "py37_rdkit_beta",
- "language": "python",
- "name": "py37_rdkit_beta"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.7.7"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
diff --git a/docs/notebooks/.ipynb_checkpoints/Similarity Testing-checkpoint.ipynb b/docs/notebooks/.ipynb_checkpoints/Similarity Testing-checkpoint.ipynb
deleted file mode 100644
index c2a351f..0000000
--- a/docs/notebooks/.ipynb_checkpoints/Similarity Testing-checkpoint.ipynb
+++ /dev/null
@@ -1,244 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Similarity Search Benchmarking"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {},
- "outputs": [],
- "source": [
- "import mongordkit\n",
- "import time\n",
- "import pymongo\n",
- "import rdkit\n",
- "import matplotlib\n",
- "import matplotlib.pyplot as plt\n",
- "import numpy as np\n",
- "import pandas as pd\n",
- "from rdkit import Chem\n",
- "from statistics import mean\n",
- "import mongomock\n",
- "from rdkit.Chem import AllChem\n",
- "from mongordkit.Database import write\n",
- "from mongordkit.Search import similarity"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "populating mongodb collection with compounds from chembl...\n",
- "inserted chunk...\n",
- "inserted chunk...\n",
- "200 molecules successfully imported\n"
- ]
- },
- {
- "data": {
- "text/plain": [
- "200"
- ]
- },
- "execution_count": 2,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "#Create a mongomock database instance and write to it. \n",
- "client = mongomock.MongoClient()\n",
- "db = client.db\n",
- "\n",
- "#Write 200 molecules into the database\n",
- "write.writeFromSDF(db, '../../data/test_data/first_200.props.sdf', 'test')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {},
- "outputs": [],
- "source": [
- "#Add Morgan fingerprints into the database\n",
- "similarity.addMorganFingerprints(db)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[[1.0, 'CC1=CC(=O)C=CC1=O']]"
- ]
- },
- "execution_count": 4,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "#Check that similarity search is working, at least for one molecule. \n",
- "doc = db.molecules.find_one()\n",
- "m = Chem.Mol(doc['rdmol'])\n",
- "results = similarity.similaritySearch(m, db, .8)\n",
- "results"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 14,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "populating mongodb collection with compounds from chembl...\n",
- "The specified setting does not exist. Will only insert default molecules\n",
- "inserted chunk...\n",
- "inserted chunk...\n",
- "1000 molecules successfully imported\n"
- ]
- },
- {
- "data": {
- "text/plain": [
- "1000"
- ]
- },
- "execution_count": 14,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "#Create a regular mongoDB database instance and write the first 1000 molecules to it. \n",
- "client = pymongo.MongoClient()\n",
- "db = client.db\n",
- "db.molecules.drop()\n",
- "db.mfp_counts.drop()\n",
- "write.writeFromSDF(db, '../../../chembl_27.sdf', 'test', reg_option='inchikey', index_option='inchikey', chunk_size=500, limit=500)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 26,
- "metadata": {},
- "outputs": [],
- "source": [
- "similarity.addMorganFingerprints(db)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Measuring performance for similarity threshold 0.7.\n"
- ]
- }
- ],
- "source": [
- "#Run benchmarks for similarity search with and without aggregation parameters. \n",
- "thresholds = [0.7, 0.75, 0.8, 0.85, 0.9, 0.95]\n",
- "times = []\n",
- "repetitions = 5\n",
- "for t in thresholds: \n",
- " print(\"Measuring performance for similarity threshold {}.\".format(t))\n",
- " temp_times = []\n",
- " for r in range(repetitions):\n",
- " start = time.time()\n",
- " for m in db.molecules.find():\n",
- " mol = Chem.Mol(m['rdmol'])\n",
- " similarity.similaritySearchAggregate(mol, db, t)\n",
- " print('working')\n",
- " end = time.time()\n",
- " temp_times.append(end - start)\n",
- " times.append([t, mean(temp_times)])\n",
- "print(times)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 25,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[]"
- ]
- },
- "execution_count": 25,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "x_list = [v[0] for v in times]\n",
- "y_list = [v[1]*1000 for v in times]\n",
- "plt.xlabel('thresholds')\n",
- "plt.ylabel('time (ms)')\n",
- "plt.plot(x_list, y_list)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "py37_rdkit_beta",
- "language": "python",
- "name": "py37_rdkit_beta"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.7.7"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
diff --git a/docs/notebooks/.ipynb_checkpoints/Similarity and Substructure Search-checkpoint.ipynb b/docs/notebooks/.ipynb_checkpoints/Similarity and Substructure Search-checkpoint.ipynb
deleted file mode 100644
index 9e5d205..0000000
--- a/docs/notebooks/.ipynb_checkpoints/Similarity and Substructure Search-checkpoint.ipynb
+++ /dev/null
@@ -1,318 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Similarity and Substructure Search\n",
- "\n",
- "Last updated: 7/11/20\n",
- "\n",
- "Methods for similarity and substructure search are included in the `mongordkit.Search` module."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "metadata": {},
- "outputs": [],
- "source": [
- "from mongordkit.Search import similarity, substructure, utils\n",
- "from mongordkit.Database import create, write\n",
- "from rdkit import Chem\n",
- "import pymongo"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Reset Cells\n",
- "\n",
- "Run these cells to reset the local MongoDB instance used in this notebook."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "['TestDatabase', 'admin', 'config', 'db', 'local']\n",
- "['admin', 'config', 'db', 'local']\n"
- ]
- }
- ],
- "source": [
- "client = pymongo.MongoClient()\n",
- "print(client.list_database_names())\n",
- "client.drop_database('TestDatabase')\n",
- "print(client.list_database_names())"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Similarity Search\n",
- "\n",
- "`mongordkit.Search.similarity` supports similarity search best on a database prepared by `mongordkit.Database.write`. Users can also use any database that has a `molecules` collection where each document in that collection has the following fields:\n",
- "- `'rdmol': binary pickle object`\n",
- "- `'smiles': some SMILES string`"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Let's run through an example of similarity search. First, we'll have to set up our database:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "populating mongodb collection with compounds from chembl...\n",
- "200 molecules successfully imported\n"
- ]
- },
- {
- "data": {
- "text/plain": [
- "200"
- ]
- },
- "execution_count": 4,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "TestDB = create.createFromHostPort('TestDatabase', host='localhost', port=27017)\n",
- "write.writeFromSDF(TestDB, '../../data/test_data/first_200.props.sdf', 'test')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "`similarity.SimSearchNaive` will directly loop through the database and display results. However, this implementation is extremely slow for any decently-sized database. Instead, `similarity` supports precalculating the following kinds of fingerprints for screening: \n",
- "- Morgan (length 1048)\n",
- "\n",
- "through `similarity.addMorganFingerprints`. For each document in a passed in database's `molecules` collection, this method creates a nested field that contains `{morgan_fp: {bits: }, {count: }}`. Note that `addMorganFingerprints` also creates indices on `morgan_fp[bits]` and `morgan_fp[count]` to speed search. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {},
- "outputs": [],
- "source": [
- "similarity.addMorganFingerprints(TestDB, radius=2, length=1024)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "{'bits': [33,\n",
- " 56,\n",
- " 84,\n",
- " 130,\n",
- " 313,\n",
- " 314,\n",
- " 356,\n",
- " 547,\n",
- " 650,\n",
- " 698,\n",
- " 744,\n",
- " 747,\n",
- " 849,\n",
- " 853,\n",
- " 967],\n",
- " 'count': 15}"
- ]
- },
- "execution_count": 6,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "TestDB.molecules.find_one()['morgan_fp']"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "From here, we can directly perform similarity search. `similarity` provides two methods that take advantage of fingerprint screening: `similaritySearch` and `similaritySearchAggregate`. The latter shifts much of the computation into the MongoDB server by using an aggregation pipeline and may improve performance when working with performant or sharded MongoDB servers. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 19,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "similaritySearch: [[0.35294117647058826, 'c1ccc(P(c2ccccc2)c2ccccc2)cc1'], [0.4117647058823529, 'Cc1ccc(S)cc1'], [0.35, 'CC(O)(c1ccccc1)c1ccccc1']]\n",
- "\n",
- "\n",
- "similaritySearchAggregate: [[0.35294117647058826, 'c1ccc(P(c2ccccc2)c2ccccc2)cc1'], [0.4117647058823529, 'Cc1ccc(S)cc1'], [0.35, 'CC(O)(c1ccccc1)c1ccccc1']]\n"
- ]
- }
- ],
- "source": [
- "q_mol = Chem.MolFromSmiles('Cc1ccccc1')\n",
- "\n",
- "# Perform a similarity search on TestDB for q_mol with a Tanimoto threshold of 0.4. \n",
- "results1 = similarity.similaritySearch(q_mol, TestDB, 0.35)\n",
- "\n",
- "# Do the same thing, but use the MongoDB Aggregation Pipeline. \n",
- "results2 = similarity.similaritySearchAggregate(q_mol, TestDB, 0.35)\n",
- "\n",
- "print('similaritySearch: {}'.format(results1))\n",
- "print('\\n')\n",
- "print('similaritySearchAggregate: {}'.format(results2))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Substructure Search\n",
- "\n",
- "Likewise, `mongordkit.Search.substructure` supports substructure search best on databases prepared by `write`. Database requirements are identical to those for similarity search: a `molecules` collection whose documents have `rdmol` and `smiles` fields. \n",
- "\n",
- "`substructure.SubSearchNaive` provides a fingerprint-less, slower implementation of substructure search suitable for very small databases:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 27,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "['c1ccc(-c2ccccc2OCCOc2ccccc2-c2ccccc2)cc1',\n",
- " 'COc1ccc(Cc2ccc(OC)cc2)cc1',\n",
- " 'COc1cc([N+](=O)[O-])c(N)c([N+](=O)[O-])c1',\n",
- " 'COc1ccc(/C=N/O)cc1',\n",
- " 'Cc1nc2ccccc2c(Oc2ccccc2)c1-c1ccccc1',\n",
- " 'O/N=C/c1ccc2c(c1)OCO2',\n",
- " 'COc1ccc(CC#N)cc1',\n",
- " 'COc1ccc(C(C)(C)C#N)cc1']"
- ]
- },
- "execution_count": 27,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "q_mol = Chem.MolFromSmiles('C1=CC=CC=C1OC')\n",
- "\n",
- "# Perform a substructure search for q_mol on TestDB. \n",
- "substructure.SubSearchNaive(q_mol, TestDB, chirality=False)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "By adding pattern fingerprints, which are optimized for substructure search, we can use `substructure.SubSearch`, which takes advantage of fingerprint screening to avoid as many expensive calls to `HasSubstructMatch` as possible. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {},
- "outputs": [
- {
- "ename": "NameError",
- "evalue": "name 'substructure' is not defined",
- "output_type": "error",
- "traceback": [
- "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
- "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
- "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0msubstructure\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mAddPatternFingerprints\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mTestDB\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmolecules\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mTestDB\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmorgan_fp_counts\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlength\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0msubstructure\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mSubSearch\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mq_mol\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mTestDB\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mchirality\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
- "\u001b[0;31mNameError\u001b[0m: name 'substructure' is not defined"
- ]
- }
- ],
- "source": [
- "substructure.AddPatternFingerprints(TestDB.molecules, TestDB.mfp_counts, length=None)\n",
- "substructure.SubSearch(q_mol, TestDB, chirality=False)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## `.similarity` Contents\n",
- "\n",
- "mongordkit.Search.similarity.**AddMorganFingerprints**(mol_collection (*MongoDB collection*), count_collection (*MongoDB collection*), radius=2 (*int: radius of Morgan fingerprint*), length=2048 (*int: length of Morgan fingerprint bit vector*)) --> None\n",
- "\n",
- "mongordkit.Search.similarity.**SimSearchNaive**(mol (*rdmol object*), mol_collection (*MongoDB collection*), threshold=0.8 (*Tanimoto threshold between 0 and 1, float*)) --> *list: results with format [tanimoto, smiles]*\n",
- "\n",
- "mongordkit.Search.similarity.**SimSearch**(mol (*rdmol object*), mol_collection (*MongoDB collection*), threshold=0.8 (*Tanimoto threshold between 0 and 1, float*)) --> *list: results with format [tanimoto, smiles]*\n",
- "\n",
- "mongordkit.Search.similarity.**SimSearchAggregate**(mol (*rdmol object*), mol_collection (*MongoDB collection*), threshold=0.8 (*Tanimoto threshold between 0 and 1, float*)) --> *list: results with format [tanimoto, smiles]*\n",
- "\n",
- "mongordkit.Search.**AddRandPermutations**(perm_collection (*MongoDB collection), len=2048, num=100) --> None\n",
- "\n",
- "mongordkit.Search.similarity.**SimSearchLSH**(mol (*rdmol object*), db (*MongoDB database containing hash collections*), mol_collection (*MongoDB collection*), perm_collection (*MongoDB collection*), threshold=0.8 (*Tanimoto threshold between 0 and 1, float*)) --> *list: results with format [tanimoto, smiles]*"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## `.substructure` Contents\n",
- "\n",
- "mongordkit.Search.substructure.**AddPatternFingerprints**(db, length=2048 (*int: length of Pattern fingerprint bit vector*)) --> None\n",
- "\n",
- "mongordkit.Search.similarity.**SubSearchNaive**(pattern (*rdmol object*), db, chirality=False (*boolean: include chirality in search or not*)) --> *list: results with format [smiles]*\n",
- "\n",
- "mongordkit.Search.similarity.**SubSearch**(pattern (*rdmol object*), db, chirality=False (*boolean: include chirality in search or not*)) --> *list: results with format [smiles]*"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "py37_rdkit_beta",
- "language": "python",
- "name": "py37_rdkit_beta"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.7.7"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
diff --git a/docs/notebooks/Creating and Writing to MongoDB.ipynb b/docs/notebooks/Creating and Writing to MongoDB.ipynb
index 91e557e..fdbf379 100644
--- a/docs/notebooks/Creating and Writing to MongoDB.ipynb
+++ b/docs/notebooks/Creating and Writing to MongoDB.ipynb
@@ -6,7 +6,7 @@
"source": [
"# Creating and Writing to MongoDB\n",
"\n",
- "Last updated: 7/12/20\n",
+ "Last updated: 8/10/20\n",
"\n",
"Methods that directly modify MongoDB database instances are included in the `mongordkit.Database` module.\n",
"\n",
@@ -16,11 +16,12 @@
},
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
- "from mongordkit.Database import create, write, utils\n",
+ "from mongordkit.Database import create, write, utils, registration\n",
+ "from rdkit import Chem\n",
"import pymongo"
]
},
@@ -29,26 +30,28 @@
"metadata": {},
"source": [
"## Reset Cells\n",
- "Run the contents of this cell to reset the local MongoDB database used in this notebook."
+ "Run the contents of this cell to reset the local MongoDB database, `demo_db`, used in this notebook."
]
},
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"client = pymongo.MongoClient()\n",
- "print(client.list_database_names())\n",
- "client.drop_database('TestDatabase')\n",
- "print(client.list_database_names())"
+ "client.drop_database('demo_db')\n",
+ "demo_db = client.demo_db\n",
+ "\n",
+ "# Disable rdkit warnings\n",
+ "rdkit.RDLogger.DisableLog('rdApp.*')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "## Creating Databases\n",
+ "## Creating Databases (DEPRECATED for now)\n",
"Users can opt to bring their own database instances, but `Database.create` provides methods that will create ready-made MongoDB instances, defaulting to your local MongoDB:"
]
},
@@ -58,11 +61,11 @@
"metadata": {},
"outputs": [],
"source": [
- "# Return a database using a host port, such as the local port:\n",
- "TestDB = create.createFromHostPort('TestDatabase', host='localhost', port=27017)\n",
+ "# # Return a database using a host port, such as the local port:\n",
+ "# db = create.createFromHostPort('demo_db', host='localhost', port=27017)\n",
"\n",
- "# Return a database using a MongoDB URI, such as that provided by Atlas:\n",
- "TestDB = create.createFromURL('TestDatabase', url=None)"
+ "# # Return a database using a MongoDB URI, such as that provided by Atlas:\n",
+ "# TestDB = create.createFromURL('demo_db', url=None)"
]
},
{
@@ -78,7 +81,140 @@
"metadata": {},
"outputs": [],
"source": [
- "print(utils.STANDARD_SETTING)"
+ "# print(utils.STANDARD_SETTING)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Data Registration\n",
+ "`Database.registration` constructs document representations of molecules according to configurable schemes and handles data registration settings.\n",
+ "\n",
+ "It does this in two parts. First, it defines the global variable `HASH_FUNCTIONS` as a dictionary that maps hash function names to methods. It also defines the global variables `DEFAULT_SCHEME_NAME`, `DEFAULT_AUTHOR`, `DEFAULT_PREPROCESS`, and `DEFAULT_INDEX`, which are used in scheme creation and are thus defined for easy configuration. \n",
+ "\n",
+ "Second, the file defines the `MolDocScheme` object, which stores scheme information in its instance variables and is passed into `.write` methods in order to specify molecule document format. By default, `MolDocScheme` includes scheme name, author, whether or not the molecule has been pre-processed, an index option, two hashes, fingerprints, and value fields. All of the information contained in a `MolDocScheme` object can be used directly to generate documents for molecules:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "{'rdmol': Binary(b'\\xef\\xbe\\xad\\xde\\x00\\x00\\x00\\x00\\x0b\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x07\\x00\\x00\\x00\\x07\\x00\\x00\\x00\\x80\\x01\\x06\\x00`\\x00\\x00\\x00\\x01\\x03\\x06@(\\x00\\x00\\x00\\x03\\x04\\x06@h\\x00\\x00\\x00\\x03\\x03\\x01\\x06@h\\x00\\x00\\x00\\x03\\x03\\x01\\x06@h\\x00\\x00\\x00\\x03\\x03\\x01\\x06@h\\x00\\x00\\x00\\x03\\x03\\x01\\x06@h\\x00\\x00\\x00\\x03\\x03\\x01\\x0b\\x00\\x01\\x00\\x01\\x02h\\x0c\\x02\\x03h\\x0c\\x03\\x04h\\x0c\\x04\\x05h\\x0c\\x05\\x06h\\x0c\\x06\\x01h\\x0c\\x14\\x01\\x06\\x01\\x06\\x05\\x04\\x03\\x02\\x17\\x00\\x00\\x00\\x00\\x16', 0),\n",
+ " 'index': 'YXFVVABEGXRONW-UHFFFAOYSA-N',\n",
+ " 'smiles': 'Cc1ccccc1',\n",
+ " 'scheme': 'default',\n",
+ " 'hashes': {'MolFormula': 'C7H8',\n",
+ " 'SmallWorldIndexBRL': 'B7R1L5',\n",
+ " 'AtomBondCounts': '7,7',\n",
+ " 'cx_smiles': 'Cc1ccccc1',\n",
+ " 'NetCharge': '0',\n",
+ " 'CanonicalSmiles': 'Cc1ccccc1',\n",
+ " 'inchikey_standard': 'YXFVVABEGXRONW-UHFFFAOYSA-N',\n",
+ " 'inchikey_KET_15T': 'YXFVVABEGXRONW-UHFFFAOYNA-N',\n",
+ " 'SmallWorldIndexBR': 'B7R1',\n",
+ " 'DegreeVector': '0,1,5,1',\n",
+ " 'ElementGraph': 'CC1CCCCC1',\n",
+ " 'HetAtomTautomer': 'C[C]1[CH][CH][CH][CH][CH]1_0_0',\n",
+ " 'inchi_standard': 'InChI=1S/C7H8/c1-7-5-3-2-4-6-7/h2-6H,1H3',\n",
+ " 'RedoxPair': 'C[C]1[CH][CH][CH][CH][CH]1',\n",
+ " 'AnonymousGraph': '**1*****1',\n",
+ " 'Mesomer': 'C[C]1[CH][CH][CH][CH][CH]1_0',\n",
+ " 'Regioisomer': '*C.c1ccccc1',\n",
+ " 'inchi_KET_15T': 'InChI=1/C7H8/c1-7-5-3-2-4-6-7/h2-6H,1H3',\n",
+ " 'MurckoScaffold': 'c1ccccc1',\n",
+ " 'ArthorSubstructureOrder': '00070007010007000000002a000000',\n",
+ " 'noiso_smiles': 'Cc1ccccc1',\n",
+ " 'ExtendedMurcko': '*c1ccccc1',\n",
+ " 'HetAtomProtomer': 'C[C]1[CH][CH][CH][CH][CH]1_0'},\n",
+ " 'fingerprints': {},\n",
+ " 'value_data': {}}"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "rdmol = Chem.MolFromSmiles('Cc1ccccc1')\n",
+ "scheme = registration.MolDocScheme()\n",
+ "scheme.generate_mol_doc(rdmol)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The `MolDocScheme` class also defines a series of instance methods, such as `MolDocScheme.set_index` and `MolDocScheme.remove_field`, that can be used to modify document schemes:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "removed AnonymousGraph from scheme\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ "{'rdmol': Binary(b'\\xef\\xbe\\xad\\xde\\x00\\x00\\x00\\x00\\x0b\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x07\\x00\\x00\\x00\\x07\\x00\\x00\\x00\\x80\\x01\\x06\\x00`\\x00\\x00\\x00\\x01\\x03\\x06@(\\x00\\x00\\x00\\x03\\x04\\x06@h\\x00\\x00\\x00\\x03\\x03\\x01\\x06@h\\x00\\x00\\x00\\x03\\x03\\x01\\x06@h\\x00\\x00\\x00\\x03\\x03\\x01\\x06@h\\x00\\x00\\x00\\x03\\x03\\x01\\x06@h\\x00\\x00\\x00\\x03\\x03\\x01\\x0b\\x00\\x01\\x00\\x01\\x02h\\x0c\\x02\\x03h\\x0c\\x03\\x04h\\x0c\\x04\\x05h\\x0c\\x05\\x06h\\x0c\\x06\\x01h\\x0c\\x14\\x01\\x06\\x01\\x06\\x05\\x04\\x03\\x02\\x17\\x00\\x00\\x00\\x00\\x16', 0),\n",
+ " 'index': 'C7H8',\n",
+ " 'smiles': 'Cc1ccccc1',\n",
+ " 'scheme': 'default',\n",
+ " 'hashes': {'MolFormula': 'C7H8',\n",
+ " 'SmallWorldIndexBRL': 'B7R1L5',\n",
+ " 'AtomBondCounts': '7,7',\n",
+ " 'cx_smiles': 'Cc1ccccc1',\n",
+ " 'NetCharge': '0',\n",
+ " 'CanonicalSmiles': 'Cc1ccccc1',\n",
+ " 'inchikey_standard': 'YXFVVABEGXRONW-UHFFFAOYSA-N',\n",
+ " 'inchikey_KET_15T': 'YXFVVABEGXRONW-UHFFFAOYNA-N',\n",
+ " 'SmallWorldIndexBR': 'B7R1',\n",
+ " 'DegreeVector': '0,1,5,1',\n",
+ " 'ElementGraph': 'CC1CCCCC1',\n",
+ " 'HetAtomTautomer': 'C[C]1[CH][CH][CH][CH][CH]1_0_0',\n",
+ " 'inchi_standard': 'InChI=1S/C7H8/c1-7-5-3-2-4-6-7/h2-6H,1H3',\n",
+ " 'RedoxPair': 'C[C]1[CH][CH][CH][CH][CH]1',\n",
+ " 'Mesomer': 'C[C]1[CH][CH][CH][CH][CH]1_0',\n",
+ " 'Regioisomer': '*C.c1ccccc1',\n",
+ " 'inchi_KET_15T': 'InChI=1/C7H8/c1-7-5-3-2-4-6-7/h2-6H,1H3',\n",
+ " 'MurckoScaffold': 'c1ccccc1',\n",
+ " 'ArthorSubstructureOrder': '00070007010007000000002a000000',\n",
+ " 'noiso_smiles': 'Cc1ccccc1',\n",
+ " 'ExtendedMurcko': '*c1ccccc1',\n",
+ " 'HetAtomProtomer': 'C[C]1[CH][CH][CH][CH][CH]1_0'},\n",
+ " 'fingerprints': {},\n",
+ " 'value_data': {}}"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "scheme.remove_field('CanonicalSmiles')\n",
+ "scheme.add_hash_field('MolFormula')\n",
+ "scheme.set_index('MolFormula')\n",
+ "scheme.generate_mol_doc(rdmol)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Because `MolDocScheme` objects contain no functions—only references to functions—they can be pickled. In fact, the methods in `write` can save `MolDocSchemes` so that custom schemes are retrievable for later use."
]
},
{
@@ -86,47 +222,288 @@
"metadata": {},
"source": [
"## Writing to a Database\n",
- "`Database.write` provides write functionality. Its core method is `writeFromSDF`, which relies on rdkit's `ForwardSDMolSupplier` to write data from an SDF file into a specified database.\n",
+ "`Database.write` provides write functionality. Its core method is `WriteFromSDF`, which relies on rdkit's `ForwardSDMolSupplier` to write data from an SDF file into a specified database.\n",
"\n",
- "For each molecule in the SDF, `writeFromSDF` inserts a document containing at the minimum a unique identifying index, that molecule's SMILES, a pickle of the molecule's rdmol, and a field that specifies the registration option used to store the molecule."
+ "For each molecule in the SDF, `WriteFromSDF` inserts a document whose fields are specified by the `MolDocScheme` object passed into the function (one with default settings is created if the `scheme` argument is left blank)."
]
},
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 5,
"metadata": {},
- "outputs": [],
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "populating mongodb collection with compounds from SDF...\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "RDKit WARNING: [15:39:46] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:46] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:46] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:46] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:46] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:46] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:46] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:46] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:46] WARNING: Charges were rearranged; Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:46] WARNING: Charges were rearranged; Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:46] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:46] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:46] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:46] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:46] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:46] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:46] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:46] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:46] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:46] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:46] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:46] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:46] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:46] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:46] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:46] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:46] WARNING: Accepted unusual valence(s): Cu(4); Metal was disconnected\n",
+ "RDKit WARNING: [15:39:46] WARNING: Accepted unusual valence(s): Cu(4); Metal was disconnected\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Accepted unusual valence(s): Cu(4); Metal was disconnected; Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Accepted unusual valence(s): Cu(4); Metal was disconnected; Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:47] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:47] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:47] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Charges were rearranged; Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Charges were rearranged; Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:47] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:47] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:47] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "200 molecules successfully imported\n",
+ "0 duplicates skipped\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ "200"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
"source": [
- "# Write the contents of first_200_props.sdf, a test dataset, into the TestDatabase created above. \n",
+ "# Write the contents of first_200_props.sdf, a test dataset, into the collection demo_db.molecules.\n",
"# The index will default to the molecule's inchikey.\n",
"# Return the number of molecules succesfully imported.\n",
- "write.writeFromSDF(TestDB, '../../data/test_data/first_200.props.sdf', 'test')"
+ "write.WriteFromSDF(demo_db.molecules, '../../data/test_data/first_200.props.sdf')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "The above call is the most basic version of `writeFromSDF`. For additional flexibility, `writeFromSDF` takes several optional arguments that allow users to specify how inbound molecules should be standardized, a field relating to the data's origin, customize the index, and change how many molecules are inserted into the database at a time. "
+ "The above call is the most basic version of `writeFromSDF`. For additional flexibility, `writeFromSDF` takes several optional arguments—users can specify a custom scheme object, a registration collection to write scheme objects to, how many molecules are inserted at a time (this can affect performance), and limit the number of molecules written in."
]
},
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 6,
"metadata": {},
- "outputs": [],
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "populating mongodb collection with compounds from SDF...\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:47] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:47] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:47] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:47] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:47] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:47] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:48] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:48] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:48] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:48] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:48] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:48] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:48] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:48] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:48] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:48] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:50] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:50] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:50] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:50] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:50] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:50] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:50] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:50] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:50] WARNING: Charges were rearranged; Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:50] WARNING: Charges were rearranged; Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:50] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:50] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:50] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:50] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:50] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:50] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:50] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:50] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:50] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:50] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:50] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:50] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:50] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:50] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:50] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:50] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:39:50] WARNING: Accepted unusual valence(s): Cu(4); Metal was disconnected\n",
+ "RDKit WARNING: [15:39:50] WARNING: Accepted unusual valence(s): Cu(4); Metal was disconnected\n",
+ "RDKit WARNING: [15:39:50] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:50] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:50] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:50] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:50] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:50] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:50] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:50] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:39:50] WARNING: Omitted undefined stereo\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "100 molecules successfully imported\n",
+ "0 duplicates skipped\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ "100"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
"source": [
- "# Write the contents of first_200_props.sdf, a test dataset, into the TestDatabase created above. \n",
+ "# Write the first 100 molecules of first_200_props.sdf, a test dataset, into demo_db.molecules\n",
"# This write will use canonical SMILES as the identifying index and thus does not conflict with the above write. \n",
"# If we had used inchikey again, the write would have imported 0 molecules.\n",
- "write.writeFromSDF(TestDB, '../../data/test_data/first_200.props.sdf', 'test', reg_option='standard_setting', index_option='canonical_smiles', chunk_size=100, limit=None)"
+ "scheme = registration.MolDocScheme()\n",
+ "scheme.set_index('CanonicalSmiles')\n",
+ "write.WriteFromSDF(demo_db.molecules, '../../data/test_data/first_200.props.sdf', \n",
+ " scheme, reg_collection=demo_db.schema, chunk_size=50, limit=100)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "In order to maintain consistency, the registration options and index options are drawn from a set of predetermined options specified in `Database.utils`."
+ "In the case that users aren't working with an SDF, `.write` also provides `WriteFromMolList`, which will take a Python list of rdmol objects in place of the SDF argument in `WriteFromSDF`."
]
},
{
@@ -144,27 +521,91 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "## `.write` Module Contents"
+ "## `.registration` Module Contents"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "{'AnonymousGraph': (rdmol, f=rdkit.Chem.rdMolHash.HashFunction.AnonymousGraph)>,\n",
+ " 'ElementGraph': (rdmol, f=rdkit.Chem.rdMolHash.HashFunction.ElementGraph)>,\n",
+ " 'CanonicalSmiles': (rdmol, f=rdkit.Chem.rdMolHash.HashFunction.CanonicalSmiles)>,\n",
+ " 'MurckoScaffold': (rdmol, f=rdkit.Chem.rdMolHash.HashFunction.MurckoScaffold)>,\n",
+ " 'ExtendedMurcko': (rdmol, f=rdkit.Chem.rdMolHash.HashFunction.ExtendedMurcko)>,\n",
+ " 'MolFormula': (rdmol, f=rdkit.Chem.rdMolHash.HashFunction.MolFormula)>,\n",
+ " 'AtomBondCounts': (rdmol, f=rdkit.Chem.rdMolHash.HashFunction.AtomBondCounts)>,\n",
+ " 'DegreeVector': (rdmol, f=rdkit.Chem.rdMolHash.HashFunction.DegreeVector)>,\n",
+ " 'Mesomer': (rdmol, f=rdkit.Chem.rdMolHash.HashFunction.Mesomer)>,\n",
+ " 'HetAtomTautomer': (rdmol, f=rdkit.Chem.rdMolHash.HashFunction.HetAtomTautomer)>,\n",
+ " 'HetAtomProtomer': (rdmol, f=rdkit.Chem.rdMolHash.HashFunction.HetAtomProtomer)>,\n",
+ " 'RedoxPair': (rdmol, f=rdkit.Chem.rdMolHash.HashFunction.RedoxPair)>,\n",
+ " 'Regioisomer': (rdmol, f=rdkit.Chem.rdMolHash.HashFunction.Regioisomer)>,\n",
+ " 'NetCharge': (rdmol, f=rdkit.Chem.rdMolHash.HashFunction.NetCharge)>,\n",
+ " 'SmallWorldIndexBR': (rdmol, f=rdkit.Chem.rdMolHash.HashFunction.SmallWorldIndexBR)>,\n",
+ " 'SmallWorldIndexBRL': (rdmol, f=rdkit.Chem.rdMolHash.HashFunction.SmallWorldIndexBRL)>,\n",
+ " 'ArthorSubstructureOrder': (rdmol, f=rdkit.Chem.rdMolHash.HashFunction.ArthorSubstructureOrder)>,\n",
+ " 'inchi_standard': ,\n",
+ " 'inchikey_standard': ,\n",
+ " 'inchi_KET_15T': (rdmol)>,\n",
+ " 'inchikey_KET_15T': (rdmol)>,\n",
+ " 'noiso_smiles': (rdmol)>,\n",
+ " 'cx_smiles': }"
+ ]
+ },
+ "execution_count": 7,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "registration.HASH_FUNCTIONS"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "mongordkit.Database.write.**writeFromSDF**(database, source_sdf, source_name *(string)*, reg_option=\"standard_setting\", index_option=\"inchikey\", chunk_size=100, limit=None) --> *int: number of molecules imported*"
+ "**Class** mongordkit.Database.registration.**MolDocScheme()**\n",
+ "\n",
+ "**Instance variables**:\n",
+ "```\n",
+ "self.scheme_name = DEFAULT_SCHEME_NAME\n",
+ "self.author = DEFAULT_AUTHOR\n",
+ "self.pre_processed = DEFAULT_PREPROCESS\n",
+ "self.index_option = DEFAULT_INDEX\n",
+ "self.hashes = set(HASH_FUNCTIONS.keys())\n",
+ "self.fingerprints = {}\n",
+ "self.value_fields = {}\n",
+ "```\n",
+ "**Instance methods**:\n",
+ "- set_index(self, new_index) --> *None*\n",
+ "- get_index_value(self, rdmol) --> *calculated index value*\n",
+ "- add_hash_field(self, field_name, field_method) --> *None*\n",
+ "- add_value_field(self, field_name, field_value) --> *None*\n",
+ "- add_all_hashes(self) --> *None*\n",
+ "- remove_field(self, field_name) --> *None*\n",
+ "- generate_mol_doc(self, rdmol) --> *Dict: document representing molecule according to scheme*"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## `.write` Module Contents"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "As of 7/15/20, `writeFromSDF` supports the following registration options: \n",
- "* 'standard_setting'\n",
+ "mongordkit.Database.write.**WriteFromSDF**(database, sdf, scheme=MolDocScheme(), reg_collection=None, chunk_size=100, limit=None, warnings=False (*Make this true to turn on rdkit warnings*) --> *int: number of molecules imported*\n",
"\n",
- "And the following index options: \n",
- "* 'inchikey'\n",
- "* 'canonical_smiles'\n",
- "* 'het_atom_tautomer'"
+ "mongordkit.Database.write.**WriteFromMolList**(database, list, scheme=MolDocScheme(), reg_collection=None, chunk_size=100, limit=None) --> *int: number of molecules imported*"
]
}
],
diff --git a/docs/notebooks/Explore LSH.ipynb b/docs/notebooks/Explore LSH.ipynb
deleted file mode 100644
index ed142b5..0000000
--- a/docs/notebooks/Explore LSH.ipynb
+++ /dev/null
@@ -1,284 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Exploring Locality Sensitive Hashing"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 56,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Imports\n",
- "\n",
- "import numpy as np\n",
- "from rdkit import Chem\n",
- "from rdkit.Chem import AllChem\n",
- "import sys\n",
- "import functools\n",
- "import mongomock"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 14,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Permutation function\n",
- "\n",
- "def get_permutations(len_permutations=2048, num_permutations=100):\n",
- " return map(lambda _: np.random.permutation(2048), range(num_permutations))"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 44,
- "metadata": {},
- "outputs": [],
- "source": [
- "permutations = get_permutations()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 45,
- "metadata": {},
- "outputs": [],
- "source": [
- "def get_min_hash(mol, permutations):\n",
- " qfp = list(AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=2048))\n",
- " min_hash = []\n",
- " for perm in permutations:\n",
- " for idx, i in enumerate(perm):\n",
- " if qfp_bits[i]:\n",
- " min_hash.append(idx)\n",
- " break \n",
- " return min_hash"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 46,
- "metadata": {},
- "outputs": [],
- "source": [
- "mol = Chem.MolFromSmiles('C1=CC=CC=C1OC')\n",
- "min_hash = get_min_hash(mol, permutations)\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 54,
- "metadata": {},
- "outputs": [],
- "source": [
- "def hash_to_buckets(min_hash, num_buckets=25, nBits=2048):\n",
- " if len(min_hash) % num_buckets:\n",
- " raise Exception('number of buckets must be divisiable by the hash length')\n",
- " buckets = []\n",
- " hash_per_bucket = int(len(min_hash) / num_buckets)\n",
- " num_bits = (nBits-1).bit_length()\n",
- "# if num_bits * hash_per_bucket > sys.maxint.bit_length():\n",
- "# raise Exception('numbers are too large to produce valid buckets')\n",
- " for b in range(num_buckets):\n",
- " buckets.append(functools.reduce(lambda x,y: (x << num_bits) + y, min_hash[b:(b + hash_per_bucket)]))\n",
- " return buckets"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 55,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[250056202389,\n",
- " 1941707205052,\n",
- " 782309908621,\n",
- " 1281762813978,\n",
- " 3814522409145,\n",
- " 1211290208280,\n",
- " 224114294943,\n",
- " 1589238888575,\n",
- " 206825584784,\n",
- " 1366332571687,\n",
- " 1091525753125,\n",
- " 1237114759205,\n",
- " 336236456125,\n",
- " 2517006411838,\n",
- " 318620430363,\n",
- " 1623757740190,\n",
- " 532689514536,\n",
- " 232591015954,\n",
- " 1357377474657,\n",
- " 343673079808,\n",
- " 155025670508,\n",
- " 833224401069,\n",
- " 1527081060,\n",
- " 3127462010894,\n",
- " 1486478143488]"
- ]
- },
- "execution_count": 55,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "hash_to_buckets(min_hash)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 57,
- "metadata": {},
- "outputs": [],
- "source": [
- "client = mongomock.MongoClient()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 58,
- "metadata": {},
- "outputs": [],
- "source": [
- "db = client.db"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 59,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[]"
- ]
- },
- "execution_count": 59,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "db.list_collection_names()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 60,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- ""
- ]
- },
- "execution_count": 60,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "db.molecules.insert_one({'_id': 1, 'molecule': 'boom'})"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 61,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "['molecules']"
- ]
- },
- "execution_count": 61,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "db.list_collection_names()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 62,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- ""
- ]
- },
- "execution_count": 62,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "db.molecules.find()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 63,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "{'_id': 1, 'molecule': 'boom'}"
- ]
- },
- "execution_count": 63,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "db.molecules.find_one()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "py37_rdkit_beta",
- "language": "python",
- "name": "py37_rdkit_beta"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.7.7"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
diff --git a/docs/notebooks/Exploring Multiprocessing.ipynb b/docs/notebooks/Exploring Multiprocessing.ipynb
deleted file mode 100644
index 300afb8..0000000
--- a/docs/notebooks/Exploring Multiprocessing.ipynb
+++ /dev/null
@@ -1,340 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {},
- "outputs": [],
- "source": [
- "import pymongo\n",
- "import rdkit\n",
- "import math\n",
- "from rdkit import Chem\n",
- "from rdkit.Chem import AllChem\n",
- "from rdkit.Chem.rdmolops import PatternFingerprint\n",
- "from mongordkit.Database import write\n",
- "from mongordkit import Search"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "metadata": {},
- "outputs": [],
- "source": [
- "client = pymongo.MongoClient()\n",
- "db = client['multip']\n",
- "molecules = db.molecules"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "write.writeFromSDF(db.molecules, '../../data/test_data/first_200.props.sdf', 'test')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 10,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[ObjectId('5f22bce9ad3e9621e4119c64'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c65'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c66'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c67'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c68'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c69'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c6a'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c6b'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c6c'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c6d'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c6e'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c6f'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c70'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c71'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c72'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c73'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c74'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c75'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c76'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c77'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c78'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c79'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c7a'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c7b'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c7c'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c7d'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c7e'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c7f'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c80'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c81'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c82'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c83'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c84'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c85'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c86'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c87'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c88'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c89'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c8a'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c8b'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c8c'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c8d'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c8e'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c8f'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c90'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c91'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c92'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c93'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c94'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c95'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c96'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c97'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c98'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c99'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c9a'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c9b'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c9c'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c9d'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c9e'),\n",
- " ObjectId('5f22bce9ad3e9621e4119c9f'),\n",
- " ObjectId('5f22bce9ad3e9621e4119ca0'),\n",
- " ObjectId('5f22bce9ad3e9621e4119ca1'),\n",
- " ObjectId('5f22bce9ad3e9621e4119ca2'),\n",
- " ObjectId('5f22bce9ad3e9621e4119ca3'),\n",
- " ObjectId('5f22bce9ad3e9621e4119ca4'),\n",
- " ObjectId('5f22bce9ad3e9621e4119ca5'),\n",
- " ObjectId('5f22bce9ad3e9621e4119ca6'),\n",
- " ObjectId('5f22bce9ad3e9621e4119ca7'),\n",
- " ObjectId('5f22bce9ad3e9621e4119ca8'),\n",
- " ObjectId('5f22bce9ad3e9621e4119ca9'),\n",
- " ObjectId('5f22bce9ad3e9621e4119caa'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cab'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cac'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cad'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cae'),\n",
- " ObjectId('5f22bce9ad3e9621e4119caf'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cb0'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cb1'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cb2'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cb3'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cb4'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cb5'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cb6'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cb7'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cb8'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cb9'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cba'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cbb'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cbc'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cbd'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cbe'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cbf'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cc0'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cc1'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cc2'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cc3'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cc4'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cc5'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cc6'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cc7'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cc8'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cc9'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cca'),\n",
- " ObjectId('5f22bce9ad3e9621e4119ccb'),\n",
- " ObjectId('5f22bce9ad3e9621e4119ccc'),\n",
- " ObjectId('5f22bce9ad3e9621e4119ccd'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cce'),\n",
- " ObjectId('5f22bce9ad3e9621e4119ccf'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cd0'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cd1'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cd2'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cd3'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cd4'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cd5'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cd6'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cd7'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cd8'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cd9'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cda'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cdb'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cdc'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cdd'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cde'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cdf'),\n",
- " ObjectId('5f22bce9ad3e9621e4119ce0'),\n",
- " ObjectId('5f22bce9ad3e9621e4119ce1'),\n",
- " ObjectId('5f22bce9ad3e9621e4119ce2'),\n",
- " ObjectId('5f22bce9ad3e9621e4119ce3'),\n",
- " ObjectId('5f22bce9ad3e9621e4119ce4'),\n",
- " ObjectId('5f22bce9ad3e9621e4119ce5'),\n",
- " ObjectId('5f22bce9ad3e9621e4119ce6'),\n",
- " ObjectId('5f22bce9ad3e9621e4119ce7'),\n",
- " ObjectId('5f22bce9ad3e9621e4119ce8'),\n",
- " ObjectId('5f22bce9ad3e9621e4119ce9'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cea'),\n",
- " ObjectId('5f22bce9ad3e9621e4119ceb'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cec'),\n",
- " ObjectId('5f22bce9ad3e9621e4119ced'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cee'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cef'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cf0'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cf1'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cf2'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cf3'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cf4'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cf5'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cf6'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cf7'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cf8'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cf9'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cfa'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cfb'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cfc'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cfd'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cfe'),\n",
- " ObjectId('5f22bce9ad3e9621e4119cff'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d00'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d01'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d02'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d03'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d04'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d05'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d06'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d07'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d08'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d09'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d0a'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d0b'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d0c'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d0d'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d0e'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d0f'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d10'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d11'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d12'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d13'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d14'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d15'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d16'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d17'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d18'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d19'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d1a'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d1b'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d1c'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d1d'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d1e'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d1f'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d20'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d21'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d22'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d23'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d24'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d25'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d26'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d27'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d28'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d29'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d2a'),\n",
- " ObjectId('5f22bce9ad3e9621e4119d2b')]"
- ]
- },
- "execution_count": 10,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "document_ids = molecules.find().distinct('_id')\n",
- "document_ids"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 11,
- "metadata": {},
- "outputs": [],
- "source": [
- "def chunk(list, length):\n",
- " for i in range(0, len(l), length):\n",
- " yield l[i:i + n]"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 13,
- "metadata": {},
- "outputs": [],
- "source": [
- "def calculate(chunk, input):\n",
- " client = pymongo.MongoClient('localhost', 27017, maxPoolSize=1000)\n",
- " db = client['multip']\n",
- " collection = db['molecules']\n",
- " chunk_list = []\n",
- " for id in chunk: \n",
- " result = \n",
- " chunk_list.append()\n",
- " \n",
- " \n",
- "# # define client inside function\n",
- " \n",
- " \n",
- " \n",
- " \n",
- "# client = pymongo.MongoClient('localhost', 27017, maxPoolSize=10000)\n",
- "\n",
- "# db = client['multip']\n",
- "# collection = db['molecules']\n",
- "# chunk_result_list = []\n",
- "# # loop over the id's in the chunk and do the calculation with each\n",
- "# # my problem right now is that I want to be able to chunk the CURSOR.\n",
- "# for id in chunk:\n",
- "# #do the calculation with document collection.find_one(id) \n",
- "# chunk_result_list.append(result)\n",
- "# return chunk_result_list"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "There are two ways that we can split the processing here. \n",
- "\n",
- "When the cursor query is expensive, we want to split up the cursor. \n",
- "\n",
- "When the cursor query is not, we simply want to keep the cursor together, and parallelize the subsequent operations. Let's start with the latter. We've got a cursor object that is iterable. \n",
- "\n",
- "What you could do is iterate through the cursor and get all the object ids. Then you could find all of them again and \n",
- "For "
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "py37_rdkit_beta",
- "language": "python",
- "name": "py37_rdkit_beta"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.7.7"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
diff --git a/docs/notebooks/Similarity Benchmarking.ipynb b/docs/notebooks/Similarity Benchmarking.ipynb
new file mode 100644
index 0000000..701d377
--- /dev/null
+++ b/docs/notebooks/Similarity Benchmarking.ipynb
@@ -0,0 +1,441 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Similarity Search Benchmarking\n",
+ "\n",
+ "These benchmarks were originally run on an early 2015 MacBook Pro with a 2.7 GHz dual-core i5 processor and 8GB of memory. \n",
+ "\n",
+ "They make use of a ChEMBL_27 dataset. \n",
+ "## Setup Work\n",
+ "### Imports"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import mongordkit\n",
+ "import time\n",
+ "import pymongo\n",
+ "import rdkit\n",
+ "import matplotlib\n",
+ "import matplotlib.pyplot as plt\n",
+ "import numpy as np\n",
+ "from os import sys\n",
+ "import pandas as pd\n",
+ "from rdkit import Chem\n",
+ "from statistics import mean, median\n",
+ "import mongomock\n",
+ "from rdkit.Chem import AllChem\n",
+ "from mongordkit.Database import write\n",
+ "from mongordkit.Search import similarity\n",
+ "from mongordkit import Search"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Database Setup\n",
+ "Here we set up a database called `test` that will hold our molecules. We will construct a collection called `molecules_100K` to hold the first 100,000 molecules in the ChEMBL_27 dataset and a collection called `molecules_1M` to hold the first 1,000,000 molecules in the ChEMBL_27 dataset. If you have already run benchmarks from `mongo-rdkit` on your local MongoDB instance, these should have been set up already."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Initialize the client that will connect to the database.\n",
+ "client = pymongo.MongoClient()\n",
+ "db = client.test\n",
+ "chembl = '../../../chembl_27.sdf'\n",
+ "\n",
+ "# Disable rdkit warnings\n",
+ "rdkit.RDLogger.DisableLog('rdApp.*')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# If necessary, write the first 100,000 compounds to molecules_100K.\n",
+ "if db.molecules_100K.count_documents({}) != 100000:\n",
+ " write.WriteFromSDF(db.molecules_100K, chembl, chunk_size=1000, limit=100000)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# If necessary, write the first 1,000,000 compounds to molecules_1M.\n",
+ "if db.molecules_1M.count_documents({}) != 1000000:\n",
+ " write.writeFromSDF(db.molecules_1M, chembl, chunk_size=1000, limit=1000000)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "In molecules_100K: 100000 documents\n",
+ "In molecules_1M: 629512 documents\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Let's ensure that there are actually 100,000 and 1M documents in these collections, respectively.\n",
+ "print(f\"In molecules_100K: {db.molecules_100K.count_documents({})} documents\")\n",
+ "print(f\"In molecules_1M: {db.molecules_1M.count_documents({})} documents\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Next, we have to prepare the database for search by adding in fingerprints and hash collections.\n",
+ "Search.PrepareForSearch(db, db.molecules_100K, db.molecules_100KCt, db.molecules_100KPm)\n",
+ "Search.PrepareForSearch(db, db.molecules_1M, db.molecules_1MCt, db.molecules_1MPm)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Query Set Setup\n",
+ "To benchmark, we'll use the first 200 compounds in ChEMBL. Let's get an rdmol for each of these and write them into a list. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "first_200 = []\n",
+ "for rdmol in Chem.ForwardMolSupplier('../../data/test_data/first_200.props.sdf'): \n",
+ " first_200.append(rdmol)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Benchmarks"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We will search each compound five times against the target database, taking the mean value as representative of that molecule. We'll then take the median and mean for all 200 compounds, repeating the entire process for thresholds 0.7, 0.75, 0.8, 0.85, and 0.9. \n",
+ "\n",
+ "We will benchmark both the `SimSearchAggregate` and `SimSearchLSH` methods, keeping in mind that the LSH method does not return exact results. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "thresholds = [0.7, 0.75, 0.8, 0.85, 0.9, 0.95]\n",
+ "repetitions = 5"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### `SimSearchAggregate`"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Benchmark against the first 100,000 molecules in ChEMBL. \n",
+ "aggregate_means_100K = []\n",
+ "aggregate_medians_100K = []\n",
+ "\n",
+ "for t in thresholds: \n",
+ " print(f\"Measuring performance for similarity threshold {t}...\")\n",
+ " query_times = []\n",
+ " for rdmol in first_200:\n",
+ " temp_times = []\n",
+ " for r in range(repetitions):\n",
+ " start = time.time()\n",
+ " _ = similarity.SimSearchAggregate(rdmol, db.molecules_100K, db.molecules_100KCt, threshold=t)\n",
+ " end = time.time()\n",
+ " temp_times.append(end - start)\n",
+ " query_times.append(mean(temp_times))\n",
+ " aggregate_means_100K.append([t, mean(query_times)])\n",
+ " aggregate_medians_100K.append([t, median(query_times)])\n",
+ "\n",
+ "print(f\"Aggregate means: {aggregate_means_100K}\")\n",
+ "print(f\"Aggregate medians: {aggregate_medians_100K}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Before we take a look at the 1M molecule dataset, let's graph these times to get a better idea of how similarity search increases in time required with lowered similarity thresholds: "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "x_list = [v[0] for v in aggregate_medians_100K]\n",
+ "y_list = [v[1] for v in aggregate_medians_100K]\n",
+ "plt.xlabel('thresholds')\n",
+ "plt.ylabel('time (s)')\n",
+ "plt.title('SimSearchAggregate medians / 100K dataset')\n",
+ "plt.plot(x_list, y_list)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "And here are the equivalent benchmarks against a million-molecule dataset:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Benchmark against the first 1M molecules in ChEMBL. \n",
+ "aggregate_means_1M = []\n",
+ "aggregate_medians_1M = []\n",
+ "\n",
+ "for t in thresholds: \n",
+ " print(f\"Measuring performance for similarity threshold {t}...\")\n",
+ " query_times = []\n",
+ " for rdmol in first_200:\n",
+ " temp_times = []\n",
+ " for r in range(repetitions):\n",
+ " start = time.time()\n",
+ " _ = similarity.SimSearchAggregate(rdmol, db.molecules_1M, db.molecules_1MCt, threshold=t)\n",
+ " end = time.time()\n",
+ " temp_times.append(end - start)\n",
+ " query_times.append(mean(temp_times))\n",
+ " aggregate_means_1M.append([t, mean(query_times)])\n",
+ " aggregate_medians_1M.append([t, median(query_times)])\n",
+ "\n",
+ "print(f\"Aggregate means: {aggregate_means_1M}\")\n",
+ "print(f\"Aggregate medians: {aggregate_medians_1M}\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "x_list = [v[0] for v in aggregate_medians_1M]\n",
+ "y_list = [v[1] for v in aggregate_medians_1M]\n",
+ "plt.xlabel('thresholds')\n",
+ "plt.ylabel('time (s)')\n",
+ "plt.title('SimSearchAggregate medians / 1M dataset')\n",
+ "plt.plot(x_list, y_list)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### `SimSearchLSH`"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We will benchmark speed for LSH in the same way as we did for normal similarity search. As noted by the original ChEMBL authors of this approach, however, LSH also introduces an element of inaccuracy. Thus, we will also include a section on comparing results of `SimSearchAggregate` and `SimSearchLSH`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Benchmark against the first 100,000 molecules in ChEMBL. \n",
+ "LSH_means_100K = []\n",
+ "LSH_medians_100K = []\n",
+ "\n",
+ "for t in thresholds: \n",
+ " print(f\"Measuring performance for similarity threshold {t}...\")\n",
+ " query_times = []\n",
+ " for rdmol in first_200:\n",
+ " temp_times = []\n",
+ " for r in range(repetitions):\n",
+ " start = time.time()\n",
+ " _ = similarity.SimSearchLSH(rdmol, db, db.molecules_100K, \n",
+ " db.molecules_100KP, db.molecules_100KCt, threshold=t)\n",
+ " end = time.time()\n",
+ " temp_times.append(end - start)\n",
+ " query_times.append(mean(temp_times))\n",
+ " LSH_means_100K.append([t, mean(query_times)])\n",
+ " LSH_medians_100K.append([t, median(query_times)])\n",
+ "\n",
+ "print(f\"LSH means: {LSH_means_100K}\")\n",
+ "print(f\"LSH medians: {LSH_medians_100K}\")\n",
+ "\n",
+ "x_list = [v[0] for v in LSH_medians_100K]\n",
+ "y_list = [v[1] for v in LSH_medians_100K]\n",
+ "plt.xlabel('thresholds')\n",
+ "plt.ylabel('time (s)')\n",
+ "plt.title('SimSearchLSH medians / 100K dataset')\n",
+ "plt.plot(x_list, y_list)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Benchmark against the first 100,000 molecules in ChEMBL. \n",
+ "LSH_means_1M = []\n",
+ "LSH_medians_1M = []\n",
+ "\n",
+ "for t in thresholds: \n",
+ " print(f\"Measuring performance for similarity threshold {t}...\")\n",
+ " query_times = []\n",
+ " for rdmol in first_200:\n",
+ " temp_times = []\n",
+ " for r in range(repetitions):\n",
+ " start = time.time()\n",
+ " _ = similarity.SimSearchLSH(rdmol, db, db.molecules_1M, \n",
+ " db.molecules_1MP, db.molecules_1MCt, threshold=t)\n",
+ " end = time.time()\n",
+ " temp_times.append(end - start)\n",
+ " query_times.append(mean(temp_times))\n",
+ " LSH_means_1M.append([t, mean(query_times)])\n",
+ " LSH_medians_1M.append([t, median(query_times)])\n",
+ "\n",
+ "print(f\"LSH means: {LSH_means_1M}\")\n",
+ "print(f\"LSH medians: {LSH_medians_1M}\")\n",
+ "\n",
+ "x_list = [v[0] for v in LSH_medians_1M]\n",
+ "y_list = [v[1] for v in LSH_medians_1M]\n",
+ "plt.xlabel('thresholds')\n",
+ "plt.ylabel('time (s)')\n",
+ "plt.title('SimSearchLSH medians / 1M dataset')\n",
+ "plt.plot(x_list, y_list)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "In order to compare accuracy, we will use the approach written about in the ChEMBL blog post: finding the symmetric set difference between the two sets of results as a percentage of the size of the union of the two result sets. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "results = []\n",
+ "\n",
+ "for t in thresholds: \n",
+ " print(f\"Measuring accuracy for similarity threshold {t}...\")\n",
+ " nmols_w_discrepancies = 0\n",
+ " discrepancies_per_mol = []\n",
+ " discrepancy_percent_per_mol = []\n",
+ " for rdmol in first_200:\n",
+ " sim_lsh = similarity.SimSearchLSH(rdmol, db, db.molecules_100K, \n",
+ " db.molecules_100KP, db.molecules_100KCt, threshold=t)\n",
+ " sim_agg = similarity.SimSearchAggregate(rdmol, db.molecules_100K, db.molecules_100KCt, threshold=t)\n",
+ " if sim_lsh: \n",
+ " set_lsh = set(result[1] for result in sim_lsh)\n",
+ " else:\n",
+ " set_lsh = set()\n",
+ " if sim_agg: \n",
+ " set_agg = set(result[1] for result in sim_agg)\n",
+ " else: \n",
+ " set_agg = set()\n",
+ " sym_set_diff = (set_lsh ^ set_agg)\n",
+ " discrepancies = len(sym_set_diff)\n",
+ " total = len(set_lsh | set_agg)\n",
+ " if discrepancies:\n",
+ " nmols_w_discrepancies += 1\n",
+ " discrepancies_per_mol.append(discrepancies)\n",
+ " discrepancy_percent_per_mol.append(discrepancies / total * 100)\n",
+ " results.append([t, f'nmols_w_discrepancies: {nmols_w_discrepancies}', \n",
+ " np.mean(discrepancies_per_mol), np.mean(discrepancy_percent_per_mol)])\n",
+ "print(results)\n",
+ "x_list = [v[0] for v in results]\n",
+ "y_list = [v[3] for v in results]\n",
+ "plt.xlabel('thresholds')\n",
+ "plt.ylabel('discrepancy percent per molecule')\n",
+ "plt.title('LSH Accuracy / 100K dataset')\n",
+ "plt.plot(x_list, y_list)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Discussion\n",
+ "These times are already very reasonable for a similarity search. However, it is worth noting that these benchmarks were run on a local MongoDB instance, effectively making no distinction between the client and the server. A MongoDB instance that has more horizontal scaling could benefit greatly from the aggregation pipeline, thus speeding search even further. \n",
+ "\n",
+ "The time complexity also increases greatly with decreasing Tanimoto thresholds."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "py37_rdkit_beta",
+ "language": "python",
+ "name": "py37_rdkit_beta"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.7"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/docs/notebooks/Similarity Testing.ipynb b/docs/notebooks/Similarity Testing.ipynb
deleted file mode 100644
index 986a6c0..0000000
--- a/docs/notebooks/Similarity Testing.ipynb
+++ /dev/null
@@ -1,532 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Similarity Search Benchmarking\n",
- "\n",
- "These benchmarks were originally run on an early 2015 MacBook Pro with a 2.7 GHz dual-core i5 processor and 8GB of memory. \n",
- "\n",
- "They make use of a ChEMBL_27 dataset. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {},
- "outputs": [],
- "source": [
- "import mongordkit\n",
- "import time\n",
- "import pymongo\n",
- "import rdkit\n",
- "import matplotlib\n",
- "import matplotlib.pyplot as plt\n",
- "import numpy as np\n",
- "from os import sys\n",
- "import pandas as pd\n",
- "from rdkit import Chem\n",
- "from statistics import mean\n",
- "import mongomock\n",
- "from rdkit.Chem import AllChem\n",
- "from mongordkit.Database import write\n",
- "from mongordkit.Search import similarity"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "populating mongodb collection with compounds from chembl...\n",
- "199 molecules successfully imported\n"
- ]
- }
- ],
- "source": [
- "#Create a mongomock database instance and write to it. \n",
- "client = mongomock.MongoClient()\n",
- "db = client.db\n",
- "\n",
- "#Write 200 molecules into the database\n",
- "write.writeFromSDF(db.molecules, '../../data/test_data/first_200.props.sdf', 'test', chunk_size=100, limit=199)\n",
- "doc = db.molecules.find_one()\n",
- "m = Chem.Mol(doc['rdmol'])"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {},
- "outputs": [],
- "source": [
- "#Add Morgan fingerprints into the database\n",
- "similarity.addMorganFingerprints(db)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 14,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[]"
- ]
- },
- "execution_count": 14,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "#Check that similarity search is working, at least for one molecule. \n",
- "doc = db.molecules.find_one()\n",
- "m = Chem.Mol(doc['rdmol'])\n",
- "results = similarity.similaritySearch(m, db, .8)\n",
- "results"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "populating mongodb collection with compounds from chembl...\n",
- "inserted chunk...\n",
- "inserted chunk...\n",
- "1000 molecules successfully imported\n"
- ]
- },
- {
- "data": {
- "text/plain": [
- "1000"
- ]
- },
- "execution_count": 4,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "#Create a regular mongoDB database instance and write the first 1000 molecules to it. \n",
- "client = pymongo.MongoClient()\n",
- "db = client.db\n",
- "db.molecules.drop()\n",
- "db.mfp_counts.drop()\n",
- "write.writeFromSDF(db, '../../../chembl_27.sdf', 'test', reg_option='standard_setting', index_option='inchikey', chunk_size=500, limit=500)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "metadata": {},
- "outputs": [],
- "source": [
- "def calc_tanimoto(Na, Nb):\n",
- " Nab = len(set(Na).intersection((set(Nb))))\n",
- " return float(Nab) / (len(Na) + len(Nb) - Nab)\n",
- "\n",
- "def similarity_search_naive(query_mol, db, threshold): \n",
- " results = []\n",
- " qfp = list(AllChem.GetMorganFingerprintAsBitVect(query_mol, 2, nBits=1024).GetOnBits())\n",
- " for mol in db.molecules.find():\n",
- " mfp = list(AllChem.GetMorganFingerprintAsBitVect(Chem.Mol(mol['rdmol']), 2, nBits=1024).GetOnBits())\n",
- " tanimoto = calc_tanimoto(qfp, mfp)\n",
- " if calc_tanimoto(qfp, mfp) >= threshold:\n",
- " results.append([tanimoto, mol['smiles']])\n",
- " return results"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Measuring performance for similarity threshold 0.7.\n",
- "Measuring performance for similarity threshold 0.75.\n",
- "Measuring performance for similarity threshold 0.8.\n",
- "Measuring performance for similarity threshold 0.85.\n",
- "Measuring performance for similarity threshold 0.9.\n",
- "Measuring performance for similarity threshold 0.95.\n",
- "[[0.7, 3.236401987075806], [0.75, 2.964214563369751], [0.8, 2.850223159790039], [0.85, 2.716036558151245], [0.9, 2.50888934135437], [0.95, 2.7822859287261963]]\n",
- "Measuring performance for similarity threshold 0.7.\n"
- ]
- },
- {
- "ename": "NameError",
- "evalue": "name 'query_mol' is not defined",
- "output_type": "error",
- "traceback": [
- "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
- "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
- "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 23\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mm\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mdb\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmolecules\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfind\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 24\u001b[0m \u001b[0mmol\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mChem\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mMol\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mm\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'rdmol'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 25\u001b[0;31m \u001b[0m_\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msimilarity_search_naive\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmol\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdb\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mt\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 26\u001b[0m \u001b[0mend\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtime\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtime\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 27\u001b[0m \u001b[0mtemp_times\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mend\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mstart\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
- "\u001b[0;32m\u001b[0m in \u001b[0;36msimilarity_search_naive\u001b[0;34m(mol, db, t)\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0msimilarity_search_naive\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmol\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdb\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mt\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 6\u001b[0m \u001b[0mresults\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 7\u001b[0;31m \u001b[0mqfp\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mlist\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mAllChem\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mGetMorganFingerprintAsBitVect\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mquery_mol\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m2\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnBits\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1024\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mGetOnBits\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 8\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mmol\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mdb\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmolecules\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfind\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 9\u001b[0m \u001b[0mmfp\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mlist\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mAllChem\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mGetMorganFingerprintAsBitVect\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mChem\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mMol\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmol\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'rdmol'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m2\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnBits\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1024\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mGetOnBits\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
- "\u001b[0;31mNameError\u001b[0m: name 'query_mol' is not defined"
- ]
- }
- ],
- "source": [
- "#Run benchmarks for similarity search with and without aggregation parameters, then with LSH + aggregation. \n",
- "thresholds = [0.7, 0.75, 0.8, 0.85, 0.9, 0.95]\n",
- "times = []\n",
- "repetitions = 5\n",
- "for t in thresholds: \n",
- " print(\"Measuring performance for similarity threshold {}.\".format(t))\n",
- " temp_times = []\n",
- " for r in range(repetitions):\n",
- " start = time.time()\n",
- " for m in db.molecules.find():\n",
- " mol = Chem.Mol(m['rdmol'])\n",
- " _ = similarity.similaritySearch(mol, db, t)\n",
- " end = time.time()\n",
- " temp_times.append(end - start)\n",
- " times.append([t, mean(temp_times)])\n",
- "print(times)\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 17,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Measuring performance for similarity threshold 0.7.\n",
- "Measuring performance for similarity threshold 0.75.\n"
- ]
- },
- {
- "ename": "KeyboardInterrupt",
- "evalue": "",
- "output_type": "error",
- "traceback": [
- "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
- "\u001b[0;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)",
- "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 9\u001b[0m \u001b[0;32mbreak\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 10\u001b[0m \u001b[0mmol\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mChem\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mMol\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mm\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'rdmol'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 11\u001b[0;31m \u001b[0m_\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msimilarity_search_naive\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmol\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdb\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mt\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 12\u001b[0m \u001b[0mcounter\u001b[0m \u001b[0;34m+=\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 13\u001b[0m \u001b[0mend\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtime\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtime\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
- "\u001b[0;32m\u001b[0m in \u001b[0;36msimilarity_search_naive\u001b[0;34m(query_mol, db, threshold)\u001b[0m\n\u001b[1;32m 7\u001b[0m \u001b[0mqfp\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mlist\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mAllChem\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mGetMorganFingerprintAsBitVect\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mquery_mol\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m2\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnBits\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1024\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mGetOnBits\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 8\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mmol\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mdb\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmolecules\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfind\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 9\u001b[0;31m \u001b[0mmfp\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mlist\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mAllChem\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mGetMorganFingerprintAsBitVect\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mChem\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mMol\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmol\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'rdmol'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m2\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnBits\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1024\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mGetOnBits\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 10\u001b[0m \u001b[0mtanimoto\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcalc_tanimoto\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mqfp\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmfp\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 11\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mcalc_tanimoto\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mqfp\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmfp\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m>=\u001b[0m \u001b[0mthreshold\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
- "\u001b[0;31mKeyboardInterrupt\u001b[0m: "
- ]
- }
- ],
- "source": [
- "for t in thresholds: \n",
- " print(\"Measuring performance for similarity threshold {}.\".format(t))\n",
- " temp_times = []\n",
- " for r in range(5):\n",
- " start = time.time()\n",
- " counter = 0\n",
- " for m in db.molecules.find():\n",
- " if counter > 100: \n",
- " break\n",
- " mol = Chem.Mol(m['rdmol'])\n",
- " _ = similarity_search_naive(mol, db, t)\n",
- " counter += 1\n",
- " end = time.time()\n",
- " temp_times.append(end - start)\n",
- " times.append([t, mean(temp_times)])\n",
- "print(times)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 40,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[]"
- ]
- },
- "execution_count": 40,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "x_list = [v[0] for v in times]\n",
- "y_list = [v[1]*1000 for v in times]\n",
- "plt.xlabel('thresholds')\n",
- "plt.ylabel('time (ms)')\n",
- "plt.title('Without Aggregation')\n",
- "plt.plot(x_list, y_list)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 41,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Measuring performance for similarity threshold 0.7.\n",
- "Measuring performance for similarity threshold 0.75.\n",
- "Measuring performance for similarity threshold 0.8.\n",
- "Measuring performance for similarity threshold 0.85.\n",
- "Measuring performance for similarity threshold 0.9.\n",
- "Measuring performance for similarity threshold 0.95.\n",
- "[[0.7, 6.002911186218261], [0.75, 5.983159065246582], [0.8, 5.641262626647949], [0.85, 5.888340759277344], [0.9, 6.869273900985718], [0.95, 5.446581506729126]]\n"
- ]
- },
- {
- "data": {
- "text/plain": [
- "[]"
- ]
- },
- "execution_count": 41,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "thresholds = [0.7, 0.75, 0.8, 0.85, 0.9, 0.95]\n",
- "times = []\n",
- "repetitions = 5\n",
- "for t in thresholds: \n",
- " print(\"Measuring performance for similarity threshold {}.\".format(t))\n",
- " temp_times = []\n",
- " for r in range(repetitions):\n",
- " start = time.time()\n",
- " for m in db.molecules.find():\n",
- " mol = Chem.Mol(m['rdmol'])\n",
- " _ = similarity.similaritySearchAggregate(mol, db, t)\n",
- " end = time.time()\n",
- " temp_times.append(end - start)\n",
- " times.append([t, mean(temp_times)])\n",
- "print(times)\n",
- "x_list = [v[0] for v in times]\n",
- "y_list = [v[1]*1000 for v in times]\n",
- "plt.xlabel('thresholds')\n",
- "plt.ylabel('time (ms)')\n",
- "plt.title('With Aggregation')\n",
- "plt.plot(x_list, y_list)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 42,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "populating mongodb collection with compounds from chembl...\n",
- "The specified setting does not exist. Will only insert default molecules\n",
- "inserted chunk...\n",
- "inserted chunk...\n",
- "1000 molecules successfully imported\n"
- ]
- }
- ],
- "source": [
- "db.molecules.drop()\n",
- "db.mfp_counts.drop()\n",
- "write.writeFromSDF(db, '../../../chembl_27.sdf', 'test', reg_option='inchikey', index_option='inchikey', chunk_size=500, limit=500)\n",
- "similarity.addMorganFingerprints(db)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 47,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Measuring performance for similarity threshold 0.7.\n",
- "Measuring performance for similarity threshold 0.75.\n",
- "Measuring performance for similarity threshold 0.8.\n",
- "Measuring performance for similarity threshold 0.85.\n",
- "Measuring performance for similarity threshold 0.9.\n",
- "Measuring performance for similarity threshold 0.95.\n",
- "[[0.7, 0.0038346290588378907], [0.75, 0.003863954544067383], [0.8, 0.00497593879699707], [0.85, 0.00534672737121582], [0.9, 0.004187107086181641], [0.95, 0.006510639190673828]]\n"
- ]
- },
- {
- "data": {
- "text/plain": [
- "[]"
- ]
- },
- "execution_count": 47,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "#Compute benchmarks with a fingerprint counts collection.\n",
- "thresholds = [0.7, 0.75, 0.8, 0.85, 0.9, 0.95]\n",
- "times = []\n",
- "repetitions = 5\n",
- "for t in thresholds: \n",
- " print(\"Measuring performance for similarity threshold {}.\".format(t))\n",
- " temp_times = []\n",
- " for r in range(repetitions):\n",
- " start = time.time()\n",
- " for m in db.molecules.find():\n",
- " mol = Chem.Mol(m['rdmol'])\n",
- " _ = similarity.similaritySearch(mol, db, t)\n",
- " end = time.time()\n",
- " temp_times.append(end - start)\n",
- " times.append([t, mean(temp_times)])\n",
- "print(times)\n",
- "x_list = [v[0] for v in times]\n",
- "y_list = [v[1]*1000 for v in times]\n",
- "plt.xlabel('thresholds')\n",
- "plt.ylabel('time (ms)')\n",
- "plt.title('Without Aggregation')\n",
- "plt.plot(x_list, y_list)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 45,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Measuring performance for similarity threshold 0.7.\n",
- "Measuring performance for similarity threshold 0.75.\n",
- "Measuring performance for similarity threshold 0.8.\n",
- "Measuring performance for similarity threshold 0.85.\n",
- "Measuring performance for similarity threshold 0.9.\n",
- "Measuring performance for similarity threshold 0.95.\n",
- "[[0.7, 0.009987068176269532], [0.75, 0.0059474468231201175], [0.8, 0.005334234237670899], [0.85, 0.0038384437561035157], [0.9, 0.0040286540985107425], [0.95, 0.0037296295166015627]]\n"
- ]
- },
- {
- "data": {
- "text/plain": [
- "[]"
- ]
- },
- "execution_count": 45,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "times = []\n",
- "for t in thresholds: \n",
- " print(\"Measuring performance for similarity threshold {}.\".format(t))\n",
- " temp_times = []\n",
- " for r in range(repetitions):\n",
- " start = time.time()\n",
- " for m in db.molecules.find():\n",
- " mol = Chem.Mol(m['rdmol'])\n",
- " _ = similarity.similaritySearchAggregate(mol, db, t)\n",
- " end = time.time()\n",
- " temp_times.append(end - start)\n",
- " times.append([t, mean(temp_times)])\n",
- "print(times)\n",
- "x_list = [v[0] for v in times]\n",
- "y_list = [v[1]*1000 for v in times]\n",
- "plt.xlabel('thresholds')\n",
- "plt.ylabel('time (ms)')\n",
- "plt.title('Without Aggregation')\n",
- "plt.plot(x_list, y_list)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "py37_rdkit_beta",
- "language": "python",
- "name": "py37_rdkit_beta"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.7.7"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
diff --git a/docs/notebooks/Similarity and Substructure Search.ipynb b/docs/notebooks/Similarity and Substructure Search.ipynb
index c4263b5..33285bc 100644
--- a/docs/notebooks/Similarity and Substructure Search.ipynb
+++ b/docs/notebooks/Similarity and Substructure Search.ipynb
@@ -6,18 +6,19 @@
"source": [
"# Similarity and Substructure Search\n",
"\n",
- "Last updated: 7/27/20\n",
+ "Last updated: 8/11/20\n",
"\n",
"Methods for similarity and substructure search are included in the `mongordkit.Search` module."
]
},
{
"cell_type": "code",
- "execution_count": 7,
+ "execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from mongordkit.Search import similarity, substructure, utils\n",
+ "from mongordkit import Search\n",
"from mongordkit.Database import create, write\n",
"from rdkit import Chem\n",
"import pymongo"
@@ -29,28 +30,70 @@
"source": [
"## Reset Cells\n",
"\n",
- "Run these cells to reset the local MongoDB instance used in this notebook."
+ "Run these cells to reset the MongoDB database used in this notebook."
]
},
{
"cell_type": "code",
- "execution_count": 3,
+ "execution_count": 2,
"metadata": {},
+ "outputs": [],
+ "source": [
+ "client = pymongo.MongoClient()\n",
+ "client.drop_database('demo_db')\n",
+ "demo_db = client.demo_db\n",
+ "\n",
+ "# Disable rdkit warnings\n",
+ "rdkit.RDLogger.DisableLog('rdApp.*')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Preparing for Search\n",
+ "Adequately preparing the database for searching requires adding a variety of fingerprints and hashes. You can easily perform all of the setup work required for similarity and substructure search by calling the method `Search.PrepareForSearch`. Generally, workflow will follow straight from the following two lines into search calls:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {
+ "scrolled": true
+ },
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
- "['TestDatabase', 'admin', 'config', 'db', 'local']\n",
- "['admin', 'config', 'db', 'local']\n"
+ "populating mongodb collection with compounds from SDF...\n",
+ "200 molecules successfully imported\n",
+ "0 duplicates skipped\n",
+ "Preparing database and collections for search...\n",
+ "Added pattern fps, morgan fps, and support for LSH.\n"
]
}
],
"source": [
- "client = pymongo.MongoClient()\n",
- "print(client.list_database_names())\n",
- "client.drop_database('TestDatabase')\n",
- "print(client.list_database_names())"
+ "write.WriteFromSDF(demo_db.molecules, '../../data/test_data/first_200.props.sdf')\n",
+ "Search.PrepareForSearch(demo_db, demo_db.molecules, demo_db.mfp_counts, demo_db.permutations)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "However, the rest of this notebook will explicitly note the addition of fingerprints and hashes in an effort to better communicate how the code actually works. Let's reset the database again so that we can insert the hashes step by step without any issues."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "client.drop_database('demo_db')\n",
+ "demo_db = client.demo_db"
]
},
{
@@ -59,29 +102,221 @@
"source": [
"## Similarity Search\n",
"\n",
- "`mongordkit.Search.similarity` supports similarity search best on a database prepared by `mongordkit.Database.write`. Users can also use any database that has a `molecules` collection where each document in that collection has the following fields:\n",
+ "`mongordkit.Search.similarity` supports similarity search best on a MongoDB collection prepared by `mongordkit.Database.write`. For the general level of similarity search, users can also use any collection that has documents with the following fields:\n",
"- `'rdmol': binary pickle object`\n",
- "- `'smiles': some SMILES string`"
+ "- `'index': a unique identifier for each molecule`\n",
+ "- `'fingerprints': {a nested document that can be blank at the start}'`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "Let's run through an example of similarity search. First, we'll have to set up our database:"
+ "Let's run through an example of similarity search. First, we'll write into the database 200 molecules from a data file included in the `mongordkit` package. We will use default write settings."
]
},
{
"cell_type": "code",
- "execution_count": 4,
- "metadata": {},
+ "execution_count": 5,
+ "metadata": {
+ "scrolled": true
+ },
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
- "populating mongodb collection with compounds from chembl...\n",
- "200 molecules successfully imported\n"
+ "populating mongodb collection with compounds from SDF...\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:43:23] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:43:23] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:43:23] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:43:23] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:43:23] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:43:23] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:23] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:38] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:43:38] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:43:38] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:43:38] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:43:38] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:43:38] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:43:38] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:38] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:38] WARNING: Charges were rearranged; Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:38] WARNING: Charges were rearranged; Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:38] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:43:38] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:43:38] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:43:38] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:43:38] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:43:38] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:43:38] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:38] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:38] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:38] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:38] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:38] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:38] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:38] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:38] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:43:38] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:43:38] WARNING: Accepted unusual valence(s): Cu(4); Metal was disconnected\n",
+ "RDKit WARNING: [15:43:38] WARNING: Accepted unusual valence(s): Cu(4); Metal was disconnected\n",
+ "RDKit WARNING: [15:43:38] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:38] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Accepted unusual valence(s): Cu(4); Metal was disconnected; Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Accepted unusual valence(s): Cu(4); Metal was disconnected; Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:43:39] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:43:39] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:43:39] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Charges were rearranged; Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Charges were rearranged; Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:43:39] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:43:39] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:43:39] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:39] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:40] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:43:40] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:43:40] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:43:40] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:43:40] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:43:40] WARNING: Charges were rearranged\n",
+ "RDKit WARNING: [15:43:40] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:40] WARNING: Omitted undefined stereo\n",
+ "RDKit WARNING: [15:43:40] WARNING: Omitted undefined stereo\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "200 molecules successfully imported\n",
+ "0 duplicates skipped\n"
]
},
{
@@ -90,90 +325,90 @@
"200"
]
},
- "execution_count": 4,
+ "execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
- "TestDB = create.createFromHostPort('TestDatabase', host='localhost', port=27017)\n",
- "write.writeFromSDF(TestDB, '../../data/test_data/first_200.props.sdf', 'test')"
+ "write.WriteFromSDF(demo_db.molecules, '../../data/test_data/first_200.props.sdf')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "`similarity.SimSearchNaive` will directly loop through the database and display results. However, this implementation is extremely slow for any decently-sized database. Instead, `similarity` supports precalculating the following kinds of fingerprints for screening: \n",
- "- Morgan (length 1048)\n",
+ "`similarity.SimSearchNaive` will directly loop through the database and display results. This is good for purposes of verifying accuracy. However, this implementation is extremely slow for any decently-sized database. Instead, `similarity` supports precalculating the following kinds of fingerprints for screening: \n",
+ "- Morgan (default radius 2, length 2048)\n",
"\n",
- "through `similarity.addMorganFingerprints`. For each document in a passed in database's `molecules` collection, this method creates a nested field that contains `{morgan_fp: {bits: }, {count: }}`. Note that `addMorganFingerprints` also creates indices on `morgan_fp[bits]` and `morgan_fp[count]` to speed search. "
+ "through `similarity.AddMorganFingerprints`. For each document in a passed in collection, this method adds the nested field `{morgan_fp: {bits: }, {count: }}` to the document's `fingerprint` field. `AddMorganFingerprints` also creates indices on `morgan_fp[bits]` and `morgan_fp[count]` to speed search. "
]
},
{
"cell_type": "code",
- "execution_count": 5,
+ "execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
- "similarity.addMorganFingerprints(TestDB, radius=2, length=1024)"
+ "similarity.AddMorganFingerprints(demo_db.molecules, demo_db.mfp_counts)"
]
},
{
"cell_type": "code",
- "execution_count": 6,
+ "execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
- "{'bits': [33,\n",
- " 56,\n",
- " 84,\n",
- " 130,\n",
- " 313,\n",
+ "{'bits': [84,\n",
" 314,\n",
" 356,\n",
" 547,\n",
" 650,\n",
- " 698,\n",
- " 744,\n",
" 747,\n",
- " 849,\n",
- " 853,\n",
- " 967],\n",
- " 'count': 15}"
+ " 967,\n",
+ " 1057,\n",
+ " 1080,\n",
+ " 1154,\n",
+ " 1337,\n",
+ " 1380,\n",
+ " 1722,\n",
+ " 1768,\n",
+ " 1873,\n",
+ " 1877],\n",
+ " 'count': 16}"
]
},
- "execution_count": 6,
+ "execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
- "TestDB.molecules.find_one()['morgan_fp']"
+ "demo_db.molecules.find_one()['fingerprints']['morgan_fp']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "From here, we can directly perform similarity search. `similarity` provides two methods that take advantage of fingerprint screening: `similaritySearch` and `similaritySearchAggregate`. The latter shifts much of the computation into the MongoDB server by using an aggregation pipeline and may improve performance when working with performant or sharded MongoDB servers. "
+ "From here, we can directly perform similarity search. `similarity` provides two methods that take advantage of fingerprint screening: `similaritySearch` and `similaritySearchAggregate`. The latter shifts much of the computation into the MongoDB server by using an aggregation pipeline and can dramatically improve performance when working with sharded MongoDB servers."
]
},
{
"cell_type": "code",
- "execution_count": 19,
+ "execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
- "similaritySearch: [[0.35294117647058826, 'c1ccc(P(c2ccccc2)c2ccccc2)cc1'], [0.4117647058823529, 'Cc1ccc(S)cc1'], [0.35, 'CC(O)(c1ccccc1)c1ccccc1']]\n",
+ "similaritySearch: [[0.4117647058823529, 'WLHCBQAPPJAULW-UHFFFAOYSA-N']]\n",
"\n",
"\n",
- "similaritySearchAggregate: [[0.35294117647058826, 'c1ccc(P(c2ccccc2)c2ccccc2)cc1'], [0.4117647058823529, 'Cc1ccc(S)cc1'], [0.35, 'CC(O)(c1ccccc1)c1ccccc1']]\n"
+ "similaritySearchAggregate: [[0.4117647058823529, 'WLHCBQAPPJAULW-UHFFFAOYSA-N']]\n"
]
}
],
@@ -181,46 +416,108 @@
"q_mol = Chem.MolFromSmiles('Cc1ccccc1')\n",
"\n",
"# Perform a similarity search on TestDB for q_mol with a Tanimoto threshold of 0.4. \n",
- "results1 = similarity.similaritySearch(q_mol, TestDB, 0.35)\n",
+ "results1 = similarity.SimSearch(q_mol, demo_db.molecules, demo_db.mfp_counts, 0.4)\n",
"\n",
"# Do the same thing, but use the MongoDB Aggregation Pipeline. \n",
- "results2 = similarity.similaritySearchAggregate(q_mol, TestDB, 0.35)\n",
+ "results2 = similarity.SimSearchAggregate(q_mol, demo_db.molecules, demo_db.mfp_counts, 0.4)\n",
"\n",
"print('similaritySearch: {}'.format(results1))\n",
"print('\\n')\n",
"print('similaritySearchAggregate: {}'.format(results2))"
]
},
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Note that the search returns only the index for the molecule, which in this case is the inchikey; users should find it easy to go from the index to the full molecule document by way of a quick search. This also makes it easier for users to retrieve molecules when indices represent multiple tautomers or isomers in the collection.\n",
+ "\n",
+ "`SimSearch` and `SimSearchAggregate` both make use of the conventional fingerprint screening method. `similarity` also supports searching using Locality Sensitive Hashing, as developed by ChemBL in an excellent [blog post](http://chembl.blogspot.com/2015/08/lsh-based-similarity-search-in-mongodb.html). The method here is called `SimSearchLSH` and requires a little bit more setup work:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Generate 100 different permutations of length 2048 and save them in demo_db.permutations as separate documents.\n",
+ "similarity.AddRandPermutations(demo_db.permutations)\n",
+ "\n",
+ "# Add locality-sensitive hash values to each documents in demo_db.molecules by splitting the 100 different permutations\n",
+ "# in demo_db.permutations into 25 different buckets. \n",
+ "similarity.AddLocalityHashes(demo_db.molecules, demo_db.permutations, 25)\n",
+ "\n",
+ "# Create 25 different collections in db_demo each store a subset of hash values for molecules in demo_db.molecules.\n",
+ "similarity.AddHashCollections(demo_db, demo_db.molecules)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Now let's try a search using the query molecule from earlier:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "similaritySearchLSH: []\n"
+ ]
+ }
+ ],
+ "source": [
+ "q_mol = Chem.MolFromSmiles('Cc1ccccc1')\n",
+ "\n",
+ "results3 = similarity.SimSearchLSH(q_mol, demo_db, demo_db.molecules, \n",
+ " demo_db.permutations, demo_db.mfp_counts, threshold=0.8)\n",
+ "\n",
+ "print('similaritySearchLSH: {}'.format(results3))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The LSH algorithm relies on random permutations using the `numpy` module, so it yields non-deterministic results. This means that LSH is well-suited for *scanning* datasets (its performance on large datasets is faster than either similarity search method), but is less accurate than regular similarity search, especially below thresholds of 0.7. Specific notes on benchmarks can be found in \"Benchmarking Similarity Search.\""
+ ]
+ },
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Substructure Search\n",
"\n",
- "Likewise, `mongordkit.Search.substructure` supports substructure search best on databases prepared by `write`. Database requirements are identical to those for similarity search: a `molecules` collection whose documents have `rdmol` and `smiles` fields. \n",
+ "`mongordkit.Search.substructure` supports substructure search best on collections prepared by `write`. Requirements are identical to those for similarity search: a `molecules` collection whose documents have `rdmol` and `index` fields. \n",
"\n",
"`substructure.SubSearchNaive` provides a fingerprint-less, slower implementation of substructure search suitable for very small databases:"
]
},
{
"cell_type": "code",
- "execution_count": 27,
+ "execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
- "['c1ccc(-c2ccccc2OCCOc2ccccc2-c2ccccc2)cc1',\n",
- " 'COc1ccc(Cc2ccc(OC)cc2)cc1',\n",
- " 'COc1cc([N+](=O)[O-])c(N)c([N+](=O)[O-])c1',\n",
- " 'COc1ccc(/C=N/O)cc1',\n",
- " 'Cc1nc2ccccc2c(Oc2ccccc2)c1-c1ccccc1',\n",
- " 'O/N=C/c1ccc2c(c1)OCO2',\n",
- " 'COc1ccc(CC#N)cc1',\n",
- " 'COc1ccc(C(C)(C)C#N)cc1']"
+ "['RUTYZGCHBCCSKD-UHFFFAOYSA-N',\n",
+ " 'WECJUPODCKXNQK-UHFFFAOYSA-N',\n",
+ " 'GZZJZWYIOOPHOV-UHFFFAOYSA-N',\n",
+ " 'FXOSHPAYNZBSFO-RMKNXTFCSA-N',\n",
+ " 'KWLUBKHLCNCFQI-UHFFFAOYSA-N',\n",
+ " 'VDAJDWUTRXNYMU-RUDMXATFSA-N',\n",
+ " 'PACGLQCRGWFBJH-UHFFFAOYSA-N',\n",
+ " 'CDCRUVGWQJYTFO-UHFFFAOYSA-N']"
]
},
- "execution_count": 27,
+ "execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
@@ -229,7 +526,7 @@
"q_mol = Chem.MolFromSmiles('C1=CC=CC=C1OC')\n",
"\n",
"# Perform a substructure search for q_mol on TestDB. \n",
- "substructure.SubSearchNaive(q_mol, TestDB, chirality=False)"
+ "substructure.SubSearchNaive(q_mol, demo_db.molecules, chirality=False)"
]
},
{
@@ -241,31 +538,44 @@
},
{
"cell_type": "code",
- "execution_count": 1,
+ "execution_count": 13,
"metadata": {},
"outputs": [
{
- "ename": "NameError",
- "evalue": "name 'substructure' is not defined",
- "output_type": "error",
- "traceback": [
- "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
- "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
- "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0msubstructure\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mAddPatternFingerprints\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mTestDB\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmolecules\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mTestDB\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmorgan_fp_counts\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlength\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0msubstructure\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mSubSearch\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mq_mol\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mTestDB\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mchirality\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
- "\u001b[0;31mNameError\u001b[0m: name 'substructure' is not defined"
- ]
+ "data": {
+ "text/plain": [
+ "['RUTYZGCHBCCSKD-UHFFFAOYSA-N',\n",
+ " 'WECJUPODCKXNQK-UHFFFAOYSA-N',\n",
+ " 'GZZJZWYIOOPHOV-UHFFFAOYSA-N',\n",
+ " 'FXOSHPAYNZBSFO-RMKNXTFCSA-N',\n",
+ " 'KWLUBKHLCNCFQI-UHFFFAOYSA-N',\n",
+ " 'VDAJDWUTRXNYMU-RUDMXATFSA-N',\n",
+ " 'PACGLQCRGWFBJH-UHFFFAOYSA-N',\n",
+ " 'CDCRUVGWQJYTFO-UHFFFAOYSA-N']"
+ ]
+ },
+ "execution_count": 13,
+ "metadata": {},
+ "output_type": "execute_result"
}
],
"source": [
- "substructure.AddPatternFingerprints(TestDB.molecules, TestDB.mfp_counts, length=None)\n",
- "substructure.SubSearch(q_mol, TestDB, chirality=False)"
+ "substructure.AddPatternFingerprints(demo_db.molecules)\n",
+ "substructure.SubSearch(q_mol, demo_db.molecules, chirality=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "## Substructure Searching using Locality Sensitive Hashing"
+ "## `.Search` contents"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "mongordkit.Search.**PrepareForSearch**(db (*MongoDB database for hash information*), mol_collection (*MongoDB collection*), count_collection (*MongoDB collection*), perm_collection (*MongoDB collection*)) --> None"
]
},
{
@@ -274,13 +584,22 @@
"source": [
"## `.similarity` Contents\n",
"\n",
+ "### Constants:\n",
+ "- DEFAULT_THRESHOLD = 0.8\n",
+ "- DEFAULT_MORGAN_RADIUS = 2\n",
+ "- DEFAULT_MORGAN_LENGTH = 2048\n",
+ "- DEFAULT_BIT_N = 2048\n",
+ "- DEFAULT_BUCKET_N = 25\n",
+ "- DEFAULT_PERM_LEN = 2048\n",
+ "- DEFAULT_PERM_N = 100\n",
+ "\n",
"mongordkit.Search.similarity.**AddMorganFingerprints**(mol_collection (*MongoDB collection*), count_collection (*MongoDB collection*), radius=2 (*int: radius of Morgan fingerprint*), length=2048 (*int: length of Morgan fingerprint bit vector*)) --> None\n",
"\n",
- "mongordkit.Search.similarity.**SimSearchNaive**(mol (*rdmol object*), mol_collection (*MongoDB collection*), threshold=0.8 (*Tanimoto threshold between 0 and 1, float*)) --> *list: results with format [tanimoto, smiles]*\n",
+ "mongordkit.Search.similarity.**SimSearchNaive**(mol (*rdmol object*), mol_collection (*MongoDB collection*), threshold=0.8 (*Tanimoto threshold between 0 and 1, float*)) --> *list: results with format [tanimoto, index]*\n",
"\n",
- "mongordkit.Search.similarity.**SimSearch**(mol (*rdmol object*), mol_collection (*MongoDB collection*), threshold=0.8 (*Tanimoto threshold between 0 and 1, float*)) --> *list: results with format [tanimoto, smiles]*\n",
+ "mongordkit.Search.similarity.**SimSearch**(mol (*rdmol object*), mol_collection (*MongoDB collection*), threshold=0.8 (*Tanimoto threshold between 0 and 1, float*)) --> *list: results with format [tanimoto, index]*\n",
"\n",
- "mongordkit.Search.similarity.**SimSearchAggregate**(mol (*rdmol object*), mol_collection (*MongoDB collection*), threshold=0.8 (*Tanimoto threshold between 0 and 1, float*)) --> *list: results with format [tanimoto, smiles]*\n",
+ "mongordkit.Search.similarity.**SimSearchAggregate**(mol (*rdmol object*), mol_collection (*MongoDB collection*), threshold=0.8 (*Tanimoto threshold between 0 and 1, float*)) --> *list: results with format [tanimoto, index]*\n",
"\n",
"mongordkit.Search.similarity.**AddRandPermutations**(perm_collection (*MongoDB collection*), len=2048 (*int: length corresponding to length of fingerprint bit vectors*), num=100 (*int: number of permutations*)) --> None\n",
"\n",
@@ -288,7 +607,7 @@
"\n",
"mongordkit.Search.similarity.**AddHashCollections**(db (*MongoDB database*), mol_collection (*MongoDB collection*)) --> None\n",
"\n",
- "mongordkit.Search.similarity.**SimSearchLSH**(mol (*rdmol object*), db (*MongoDB database containing hash collections*), mol_collection (*MongoDB collection*), perm_collection (*MongoDB collection*), threshold=0.8 (*Tanimoto threshold between 0 and 1, float*)) --> *list: results with format [tanimoto, smiles]*"
+ "mongordkit.Search.similarity.**SimSearchLSH**(mol (*rdmol object*), db (*MongoDB database containing hash collections*), mol_collection (*MongoDB collection*), perm_collection (*MongoDB collection*), count_collection (*MongoDB collection*), threshold=0.8 (*Tanimoto threshold between 0 and 1, float*)) --> *list: results with format [tanimoto, index]*"
]
},
{
@@ -297,7 +616,7 @@
"source": [
"## `.substructure` Contents\n",
"\n",
- "mongordkit.Search.substructure.**AddPatternFingerprints**(db, length=2048 (*int: length of Pattern fingerprint bit vector*)) --> None\n",
+ "mongordkit.Search.substructure.**AddPatternFingerprints**(mol_collection (MongoDB collection), length=2048 (*int: length of Pattern fingerprint bit vector*)) --> None\n",
"\n",
"mongordkit.Search.similarity.**SubSearchNaive**(pattern (*rdmol object*), db, chirality=False (*boolean: include chirality in search or not*)) --> *list: results with format [smiles]*\n",
"\n",
diff --git a/docs/notebooks/Substructure Benchmarking.ipynb b/docs/notebooks/Substructure Benchmarking.ipynb
index 26f2eb8..53a5ef7 100644
--- a/docs/notebooks/Substructure Benchmarking.ipynb
+++ b/docs/notebooks/Substructure Benchmarking.ipynb
@@ -15,7 +15,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "## Imports"
+ "## Setup Work\n",
+ "### Imports"
]
},
{
@@ -38,105 +39,98 @@
"import numpy as np\n",
"from os import sys\n",
"import pandas as pd\n",
- "from statistics import mean, median\n",
"from IPython.display import display, HTML\n",
"\n",
"from mongordkit.Database import write\n",
"from mongordkit.Search import similarity\n",
- "from mongordkit.Search import substructure"
+ "from mongordkit.Search import substructure\n",
+ "from mongordkit import Search"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "## Database Setup\n",
- "Here we set up a database called `test` that will hold our molecules. We will construct 1 collection called `molecules_100K` to hold the first 100,000 molecules in the ChEMBL_27 dataset and a collection called `molecules_1M` to hold the first 1,000,000 molecules in the ChEMBL_27 dataset."
+ "### Database Setup\n",
+ "Here we set up a database called `test` that will hold our molecules. We will construct a collection called `molecules_100K` to hold the first 100,000 molecules in the ChEMBL_27 dataset and a collection called `molecules_1M` to hold the first 1,000,000 molecules in the ChEMBL_27 dataset. If you have already run search or similarity benchmarks from `mongo-rdkit` on your local MongoDB instance, these should have been set up already."
]
},
{
"cell_type": "code",
- "execution_count": 8,
+ "execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Initialize the client that will connect to the database.\n",
"client = pymongo.MongoClient()\n",
- "db = client.test"
+ "db = client.test\n",
+ "chembl = '../../../chembl_27.sdf'\n",
+ "\n",
+ "# Disable rdkit warnings\n",
+ "rdkit.RDLogger.DisableLog('rdApp.*')"
]
},
{
"cell_type": "code",
- "execution_count": 15,
+ "execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
- "populating mongodb collection with compounds from chembl...\n"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "RDKit WARNING: [15:22:20] Warning: conflicting stereochemistry at atom 11 ignored.\n",
- "RDKit WARNING: [15:45:12] Warning: conflicting stereochemistry at atom 14 ignored.\n",
- "RDKit WARNING: [16:15:11] Warning: conflicting stereochemistry at atom 10 ignored.\n",
- "RDKit WARNING: [16:15:11] Warning: conflicting stereochemistry at atom 10 ignored.\n",
- "RDKit WARNING: [16:15:40] Warning: conflicting stereochemistry at atom 10 ignored.\n",
- "RDKit WARNING: [16:15:40] Warning: conflicting stereochemistry at atom 10 ignored.\n",
- "RDKit WARNING: [16:26:44] Warning: conflicting stereochemistry at atom 6 ignored.\n",
- "RDKit WARNING: [16:26:44] Warning: conflicting stereochemistry at atom 6 ignored.\n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "101001 molecules successfully imported\n"
+ "populating mongodb collection with compounds from SDF...\n",
+ "100000 molecules successfully imported\n",
+ "1 duplicates skipped\n"
]
},
{
"data": {
"text/plain": [
- "101001"
+ "100000"
]
},
- "execution_count": 15,
+ "execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
- "# Write the first 100,000 compounds to molecules_100K. \n",
- "write.writeFromSDF(db.molecules_100K, '../../../chembl_27.sdf', 'test', reg_option='standard_setting', \n",
- " index_option='inchikey', chunk_size=1000, limit=100000)"
+ "# If necessary, write the first 100,000 compounds to molecules_100K.\n",
+ "if db.molecules_100K.count_documents({}) != 100000:\n",
+ " write.WriteFromSDF(db.molecules_100K, chembl, chunk_size=1000, limit=100000)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
- "outputs": [],
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "populating mongodb collection with compounds from SDF...\n"
+ ]
+ }
+ ],
"source": [
- "# Write the first 1,000,000 compounds to molecules_1M.\n",
- "write.writeFromSDF(db.molecules_1M, '../../../chembl27_sdf', 'test', reg_option='standard_setting', \n",
- " index_option='inchikey', chunk_size=1000, limit=1000000)"
+ "# If necessary, write the first 1,000,000 compounds to molecules_1M.\n",
+ "if db.molecules_1M.count_documents({}) != 1000000:\n",
+ " write.WriteFromSDF(db.molecules_1M, chembl, chunk_size=1000, limit=1000000)"
]
},
{
"cell_type": "code",
- "execution_count": 17,
+ "execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
- "In molecules_100K: 101000 documents\n",
- "In molecules_1M: 0 documents\n"
+ "In molecules_100K: 100000 documents\n",
+ "In molecules_1M: 180512 documents\n"
]
}
],
@@ -146,11 +140,22 @@
"print(f\"In molecules_1M: {db.molecules_1M.count_documents({})} documents\")"
]
},
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Next, we have to prepare all of the documents in our collections for search by adding in fingerprints.\n",
+ "substructure.AddPatternFingerprints(db.molecules_100K)\n",
+ "substructure.AddPatternFingerprints(db.molecules_1M)"
+ ]
+ },
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "## Query Set Setup\n",
+ "### Query Set Setup\n",
"For our queries, we'll use three sets of patterns identified by Greg Landrum in one of his [blog posts](http://rdkit.blogspot.com/2013/11/fingerprint-based-substructure.html) on substructure searching and discussed in this [mailing list](http://www.mail-archive.com/rdkit-discuss@lists.sourceforge.net/msg02066.html) and this [presentation](http://www.hinxton.wellcome.ac.uk/advancedcourses/MIOSS%20Greg%20Landrum.pdf). They are: \n",
"- Fragments: 500 diverse molecules taken from the ZINC Fragments set\n",
"- Leads: 500 diverse molecules taken from the ZINC Lead-like set\n",
@@ -159,7 +164,7 @@
},
{
"cell_type": "code",
- "execution_count": 16,
+ "execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
@@ -184,12 +189,12 @@
"### Naive Substructure Search\n",
"`substructure.SubSearchNaive` is a search that simply loops through the dataset and checks for a substructure match on each molecule. This method is not directly benchmarked here because searching through a single molecule takes upward of 5 seconds; this means that it is far too slow to feel directly interactive.\n",
"### Substructure Search with Fingerprint Screening\n",
- "Instead, we will benchmark the standard `SubSearch`, which makes use of fingerprint screening to dramatically increase efficiency. First, we want to see what kinds of times we are dealing with. For each of our query sets, we will search all of their elements against `molecules_100K` and `molecules_1M`, then return the median and mean query times in seconds. "
+ "Instead, we will benchmark the standard `SubSearch`, which makes use of fingerprint screening to dramatically increase efficiency. For each of our query sets, we will search all of their elements against `molecules_100K` and `molecules_1M`, then return the median and mean query times in seconds. "
]
},
{
"cell_type": "code",
- "execution_count": 18,
+ "execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
@@ -199,34 +204,118 @@
" start = time.time()\n",
" substructure.SubSearch(pattern, dataset)\n",
" end = time.time()\n",
- " results.append(end - start)"
+ " results.append(end - start)\n",
+ " return results"
]
},
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 12,
"metadata": {},
- "outputs": [],
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " mean | \n",
+ " median | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " fragments | \n",
+ " 0.062740 | \n",
+ " 0.062074 | \n",
+ "
\n",
+ " \n",
+ " leads | \n",
+ " 0.062592 | \n",
+ " 0.062289 | \n",
+ "
\n",
+ " \n",
+ " pieces | \n",
+ " 0.062739 | \n",
+ " 0.061950 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " mean median\n",
+ "fragments 0.062740 0.062074\n",
+ "leads 0.062592 0.062289\n",
+ "pieces 0.062739 0.061950"
+ ]
+ },
+ "execution_count": 12,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
"source": [
"# Benchmark for search of all three query sets against 100K and 1M.\n",
- "# This should take around five minutes; these calls can be split up if necessary.\n",
+ "# This should take around five minutes; these calls commented out if necessary.\n",
"frag_times_100K = benchmark_query_set(fragments, db.molecules_100K)\n",
- "frag_times_1M = benchmark_query_set(fragments, db.molecules_1M)\n",
"lead_times_100K = benchmark_query_set(leads, db.molecules_100K)\n",
- "lead_times_1M = benchmark_query_set(leads, db.molecules_1M)\n",
"pieces_times_100K = benchmark_query_set(pieces, db.molecules_100K)\n",
- "pieces_times_1M = benchmark_query_set(pieces, db.molecules_1M)\n",
"\n",
- "results = [frag_times_100K, frag_times_1M, lead_times_100K, lead_times_1M, pieces_times_100K, pieces_times_1M]\n",
+ "results = [frag_times_100K, lead_times_100K, pieces_times_100K]\n",
+ "means_100K = [np.mean(times) for times in results]\n",
+ "medians_100K = [np.median(times) for times in results]\n",
"\n",
- "means = [mean(times) for times in results]\n",
- "medians = [median(times) for times in results]\n",
+ "data = {'mean (100K)': means, 'median (100K)': medians}\n",
+ "df = pd.DataFrame(data, index =['fragments', 'leads', 'pieces']) \n",
+ "df"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Benchmark for search of all three query sets against 1M. \n",
+ "# This should take around five minutes; these calls can be commented out if necessary.\n",
+ "frag_times_1M = benchmark_query_set(fragments, db.molecules_1M)\n",
+ "lead_times_1M = benchmark_query_set(leads, db.molecules_1M)\n",
+ "pieces_times_1M = benchmark_query_set(pieces, db.molecules_1M)\n",
"\n",
- "data = {'mean': means, 'median': medians}\n",
+ "results = [frag_times_1M, lead_times_1M, pieces_times_1M]\n",
+ "means_1M = [np.mean(times) for times in results]\n",
+ "medians_1M = [np.median(times) for times in results]\n",
+ "\n",
+ "data = {'mean (1M)': means, 'median (1M)': medians}\n",
"df = pd.DataFrame(data, index =['fragments', 'leads', 'pieces']) \n",
"df"
]
},
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Discussion\n",
+ "\n",
+ "A median search time of less than 70ms indicates decent performance, certainly fast enough to have interactive search performance on large datasets with single molecules (the traditional UI benchmark for instant feedback being 100ms). "
+ ]
+ },
{
"cell_type": "markdown",
"metadata": {},
diff --git a/docs/notebooks/Write and Registration Benchmarking.ipynb b/docs/notebooks/Write and Registration Benchmarking.ipynb
new file mode 100644
index 0000000..a849c11
--- /dev/null
+++ b/docs/notebooks/Write and Registration Benchmarking.ipynb
@@ -0,0 +1,459 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Write and Registration Benchmarks\n",
+ "\n",
+ "These benchmarks were originally run on an early 2015 MacBook Pro with a 2.7 GHz dual-core i5 processor and 8GB of memory. All molecules are written into a data directory stored locally via `--dbpath`.\n",
+ "\n",
+ "They make use of molecules found in the data folder. \n",
+ "\n",
+ "Last updated: 8/24/20 by Christopher Zou\n",
+ "\n",
+ "## Setup Work\n",
+ "### Imports"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from mongordkit.Database import write, registration\n",
+ "from rdkit import Chem\n",
+ "import rdkit\n",
+ "import numpy as np\n",
+ "import time\n",
+ "import pymongo\n",
+ "import mongomock\n",
+ "import matplotlib\n",
+ "import matplotlib.pyplot as plt"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Database Setup\n",
+ "Here we set up a database called `test` that will hold our molecules. We will construct a collection called `molecules_write_testing` to benchmark the speed of writing to a collection."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Initialize the client that will connect to the database.\n",
+ "client = pymongo.MongoClient()\n",
+ "db = client.test\n",
+ "db.molecules_write_testing.drop()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Defining Some Useful Variables"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "hash_functions = registration.HASH_FUNCTIONS\n",
+ "first_200_mols = '../../data/test_data/first_200.props.sdf'\n",
+ "chembl = '../../../chembl_27.sdf'\n",
+ "\n",
+ "# Disable RDLogger to reduce system output.\n",
+ "rdkit.RDLogger.DisableLog('rdApp.*')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Benchmarking Write"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We want to know the performance of `write.WriteFromSDF`. To find out, let's write the first 1000-10000 (incrementing by 1000 every time) molecules of a ChEMBL dataset using a scheme that contains all 23 available hashes and take median write times:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "testing 1000\n",
+ "populating mongodb collection with compounds from SDF...\n",
+ "1000 molecules successfully imported\n",
+ "0 duplicates skipped\n",
+ "populating mongodb collection with compounds from SDF...\n",
+ "1000 molecules successfully imported\n",
+ "0 duplicates skipped\n",
+ "populating mongodb collection with compounds from SDF...\n",
+ "1000 molecules successfully imported\n",
+ "0 duplicates skipped\n",
+ "populating mongodb collection with compounds from SDF...\n",
+ "1000 molecules successfully imported\n",
+ "0 duplicates skipped\n",
+ "populating mongodb collection with compounds from SDF...\n",
+ "1000 molecules successfully imported\n",
+ "0 duplicates skipped\n",
+ "testing 2000\n",
+ "populating mongodb collection with compounds from SDF...\n",
+ "2000 molecules successfully imported\n",
+ "0 duplicates skipped\n",
+ "populating mongodb collection with compounds from SDF...\n",
+ "2000 molecules successfully imported\n",
+ "0 duplicates skipped\n",
+ "populating mongodb collection with compounds from SDF...\n",
+ "2000 molecules successfully imported\n",
+ "0 duplicates skipped\n",
+ "populating mongodb collection with compounds from SDF...\n",
+ "2000 molecules successfully imported\n",
+ "0 duplicates skipped\n",
+ "populating mongodb collection with compounds from SDF...\n",
+ "2000 molecules successfully imported\n",
+ "0 duplicates skipped\n",
+ "testing 3000\n",
+ "populating mongodb collection with compounds from SDF...\n",
+ "3000 molecules successfully imported\n",
+ "0 duplicates skipped\n",
+ "populating mongodb collection with compounds from SDF...\n",
+ "3000 molecules successfully imported\n",
+ "0 duplicates skipped\n",
+ "populating mongodb collection with compounds from SDF...\n",
+ "3000 molecules successfully imported\n",
+ "0 duplicates skipped\n",
+ "populating mongodb collection with compounds from SDF...\n",
+ "3000 molecules successfully imported\n",
+ "0 duplicates skipped\n",
+ "populating mongodb collection with compounds from SDF...\n",
+ "3000 molecules successfully imported\n",
+ "0 duplicates skipped\n",
+ "testing 4000\n",
+ "populating mongodb collection with compounds from SDF...\n",
+ "4000 molecules successfully imported\n",
+ "0 duplicates skipped\n",
+ "populating mongodb collection with compounds from SDF...\n",
+ "4000 molecules successfully imported\n",
+ "0 duplicates skipped\n",
+ "populating mongodb collection with compounds from SDF...\n",
+ "4000 molecules successfully imported\n",
+ "0 duplicates skipped\n",
+ "populating mongodb collection with compounds from SDF...\n",
+ "4000 molecules successfully imported\n",
+ "0 duplicates skipped\n",
+ "populating mongodb collection with compounds from SDF...\n",
+ "4000 molecules successfully imported\n",
+ "0 duplicates skipped\n",
+ "testing 5000\n",
+ "populating mongodb collection with compounds from SDF...\n",
+ "5000 molecules successfully imported\n",
+ "0 duplicates skipped\n",
+ "populating mongodb collection with compounds from SDF...\n"
+ ]
+ }
+ ],
+ "source": [
+ "repetitions = 5\n",
+ "scheme = registration.MolDocScheme()\n",
+ "scheme.add_all_hashes()\n",
+ "times = []\n",
+ "limits = [1000 + (i * 1000) for i in range(11)]\n",
+ "for number in limits:\n",
+ " temp_times = []\n",
+ " print(f'testing {number}')\n",
+ " for i in range(repetitions):\n",
+ " mol_collection = db.molecules_write_testing\n",
+ " start = time.time()\n",
+ " write.WriteFromSDF(mol_collection, chembl, scheme, limit=number)\n",
+ " end = time.time()\n",
+ " duration = end - start\n",
+ " mol_collection.drop()\n",
+ " temp_times.append(duration)\n",
+ " times.append([number, np.mean(temp_times)])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[[2000, 14.187205600738526],\n",
+ " [2100, 14.940004205703735],\n",
+ " [2200, 15.965514373779296],\n",
+ " [2300, 16.910987186431885],\n",
+ " [2400, 23.27452983856201],\n",
+ " [2500, 21.29524955749512],\n",
+ " [2600, 23.414347171783447],\n",
+ " [2700, 25.251486158370973],\n",
+ " [2800, 28.250892400741577],\n",
+ " [2900, 28.316519117355348],\n",
+ " [3000, 32.57432060241699]]"
+ ]
+ },
+ "execution_count": 17,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ "