Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

addressing issues 5, 8, (part of 13) - new #17

Merged
merged 19 commits into from
Oct 2, 2020
Merged

addressing issues 5, 8, (part of 13) - new #17

merged 19 commits into from
Oct 2, 2020

Conversation

erskordi
Copy link
Collaborator

No description provided.

@erskordi erskordi changed the title addressing issues 5, 8, (part of 13) addressing issues 5, 8, (part of 13) - new Sep 30, 2020
@@ -0,0 +1,7 @@
import alphazero.config as config
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can directly modify config variables in the run scripts, so mod.py is not needed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, mod.py was moved to stable_radical_optimization folder and renamed to stable_rad_config.py. Loading mod.py was also removed from all scripts in alphazero folder

Copy link
Collaborator

@pstjohn pstjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So one other issue: Node.reward is cached, so if we indeed use game.reset() (instead of the current new game per iteration) approach, the cached reward will be a ranked reward, which might no longer be valid. In general, I'm wondering if we want to separate node.reward from node.ranked_reward; but that might also involve changing game.mcts_step; where we propagate up leaf.reward values.

values (%s, %s, %s, %s, %s, %s);""", (
self.smiles, float(reward), atom_type,
float(spin_buried_vol), float(max_spin), atom_index))
INSERT INTO {table}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work? It might be best practice to use a psycopg2.sql object:
https://www.psycopg.org/docs/sql.html


DROP TABLE IF EXISTS {table2};

CREATE TABLE {table2} (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the game table should be a record of recently played games, with their associated (real) rewards. gameid should also be a column

with psycopg2.connect(**dbparams) as conn:
with conn.cursor() as cur:
cur.execute("""
INSERT INTO {table}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this implies that anytime a terminal state is reached (which can happen often throughout a single game) this is being added to the 'game' table. I think ranked reward is something we only want to compute from recently played, final game states. So I imagine this code would go after adding to the replay buffer table

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, the

cur.execute("""
                INSERT INTO {table}_game
                (experiment_id, gameid, real_reward) 
                values (%s, %s);""".format(table=config.sql_basename), (
                    config.experiment_id, gameid, float(reward)))

is moved after adding to the replay buffer table. In order to distinguish between real and ranked rewards,
reward = game[-1].reward returns the real reward and later in the replay buffer we store the get_ranked_rewards(reward) ranked reward.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like you've pushed any changes yet, but my only concern with this approach is that, as written, game.py expects the node.reward value to be reward that we propagate up the MCTS tree; i.e., the ranked reward. So until we refactor that (#18), we should node.get_reward() return the ranked reward, and cache the intermediate true_reward somewhere like node._true_reward.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I haven't pushed anything yet, I'll push asap, the responses to comments are mostly for me to check what I'm addressing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I see what you mean here. I'm proceeding with it now

((gameid, node.smiles, game[-1].smiles, reward, i,
"""INSERT INTO {table}
(experiment_id, gameid, smiles, final_smiles, ranked_reward, position, data) values %s;""".format(table=replay_table),
((config.experiment_id, gameid, node.smiles, game[-1].smiles, reward, i,
node.get_action_inputs_as_binary()),))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is where I'd add the game (and final, real reward) to the game table. That does involve making sure that nodes cache their true reward value, rather than just their ranked reward.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above, let me know if this is a good strategy

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See reply above. Just to be a bit more explicit; the node.value that gets estimated during MCTS should, for ranked rewards, lie between -1 and 1, and represent the probability that node is a win / loss by ranked rewards. And these are the values predicted by the policy network. But if we return the true reward in get_reward(), that calculation gets blasted with weird values (ie., 50) that are way outside that range

@@ -8,13 +8,11 @@
import rdkit.Chem
from tensorflow.keras.preprocessing.sequence import pad_sequences

from alphazero.config import AlphaZeroConfig
import alphazero.config as config
import alphazero.mod as mod
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused, remove

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -2,10 +2,9 @@
from tensorflow.keras import layers
import nfp

import alphazero.config as config
import alphazero.mod as mod
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused, remove

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -10,12 +10,16 @@
from rdkit import DataStructs
import networkx as nx

import alphazero.config as config
import alphazero.mod as mod
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused, remove

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -28,45 +32,59 @@
'options': f'-c search_path=rl',
}

reward_table = config.sql_basename + "_reward"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is where we can just modify config properties directly

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# with conn.cursor() as cur:
# cur.execute("""
# DROP TABLE IF EXISTS StableRewardPSJ;
with psycopg2.connect(**dbparams) as conn:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from psj: Maybe we want to move these things over to a separate python script? Maybe initialize.py or similar.

then the submit_mcts.sh could do something like

python initialize.py  # Run the one time
srun python run_mcts.py  # Run on all workers

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per our discussion, I created the stable_rad_config.py in directory stable_radical_optimization, where it overloads config.py and an initialize.py that creates PostgreSQL tables. I also updated submit_mcts.sh to execute:

python initialize.py
srun python run_mcts.py

with psycopg2.connect(**dbparams) as conn:
with conn.cursor() as cur:
cur.execute("""
DROP TABLE IF EXISTS {table0};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from psj: Otherwise, we'd want this not to drop tables, but CREATE TABLE IF NOT EXISTS

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check comment above, use a separate script for initializing tables

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you push the new changes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll push the updated scripts I have so far and continue working on the rest of the comments/changes

@@ -99,15 +117,44 @@ def get_reward(self):
# This is a bit of a placeholder; but the range for spin is about 1/50th that
# of buried volume.
reward = (1 - max_spin) * 50 + spin_buried_vol

with psycopg2.connect(**dbparams) as conn:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

psj: we should pull out the RR calculation into a separate function. For now, that could just live as a floating function in this file.

def get_ranked_rewards(reward, conn=None):

if conn is None:
    conn = psycopg2.connect(**dbparams)

with conn:
    ... do the calculation

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created function get_ranked_rewards, do you think is better to put it on a separate script as well? Or keep it in run_mcts.py?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd keep it in run_mcts. We'll likely want it to eventually be a class method of Node

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sounds good!


class StabilityNode(Node):

def get_reward(self):

with psycopg2.connect(**dbparams) as conn:
with conn.cursor() as cur:
cur.execute("select reward from StableRewardPSJ where smiles = %s", (self.smiles,))
cur.execute("select reward from {table} where smiles = %s".format(table=reward_table), (self.smiles,))
result = cur.fetchone()

if result:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this returns the actual reward, right? don't we want to return the ranked reward for MCTS?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure: we'll need to re-calculate the ranked reward on the fly, it doesn't make sense to store ranked rewards in the reward buffer

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values returned from the entire if-elif-else statement are all "filtered" through the get_ranked_rewards function to return ranked rewards instead of real rewards.

CREATE TABLE {table0} (
id serial PRIMARY KEY,
time timestamp DEFAULT CURRENT_TIMESTAMP,
reward real,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want ranked_reward here, too, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd disagree -- ranked rewards don't make much sense outside the context of the game they're evaluated in. If a later game discovers the same molecule, we'll want to make sure the ranked reward is current w.r.t the game buffer

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no you're right, i hadn't thought this out yet but per our teams chat i agree

id serial PRIMARY KEY,
time timestamp DEFAULT CURRENT_TIMESTAMP,
experiment_id varchar(50),
reward real);
Copy link
Collaborator

@davebiagioni davebiagioni Sep 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about ranked_reward and final smiles?

@@ -28,45 +32,59 @@
'options': f'-c search_path=rl',
}

reward_table = config.sql_basename + "_reward"
replay_table = config.sql_basename + "_replay"
game_table = config.sql_basename + "_game"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More generally, any reason to assign these to variables rather than just doing something like

"CREATE TABLE {basename}_replay".format(basename=config.sql_basename)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

values (%s, %s);""".format(table=game_table), (
config.experiment_id, float(reward)))

df = pd.read_sql_query("""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we change this to a count?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, changed to

n_rewards = pd.read_sql_query("""
        select count(*) from {table}_reward
        """.format(table=config.sql_basename), conn)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds great! another option might be to try to put this logic into a single SQL query https://www.postgresql.org/docs/9.1/functions-conditional.html. That's a minor change though, let's leave that for a later optimization

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let's keep it in mind for later : )

@@ -10,12 +10,15 @@
from rdkit import DataStructs
import networkx as nx

import stable_rad_config as config
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this probably won't work; i think you want it to read

import alphazero.config as config
import stable_rad_config

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn, you're right. And to think of it, I had it like that in the first place

## This creates the table used to store the rewards
## But, we don't want this to run every time we run the script,
## just keeping it here as a reference
def get_ranked_rewards(reward, conn=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea with the conn kwarg is that you likely already have a SQL connection open, so we might as well use it. no idea if the syntax i wrote will work though :)

Copy link
Collaborator

@pstjohn pstjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just adding some additional code comments via github. I think this is ready to merge now though


# Run the policy network to get value and prior_logit predictions
values, prior_logits = model(parent.policy_inputs_with_children())
prior_logits = prior_logits[1:].numpy().flatten()
values, prior_logits = model.predict(parent.policy_inputs_with_children())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to fix those retracing errors we were seeing

import alphazero.config as config
import stable_rad_config
# Initialize PostgreSQL tables

parser = argparse.ArgumentParser(description='Initialize the postgres tables.')
parser.add_argument("--drop", action='store_true', help="whether to drop existing tables, if found")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sometimes we won't want to erase our previous calculations -- I'm not sure this is the right solution long-term though

model = tf.keras.models.load_model(
'/projects/rlmolecule/pstjohn/models/20200923_radical_stability_model')
'/projects/rlmolecule/pstjohn/models/20200923_radical_stability_model',
compile=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this removes those 'manual compile' warnings that weren't relevant

""".format(table=config.sql_basename), conn)
with psycopg2.connect(**dbparams) as conn:
with conn.cursor() as cur:
cur.execute("select count(*) from {table}_game;".format(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this was a big issue -- the ranked rewards needs to be calculated based off recently played games; not recent rewards.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that too, game holds the actual real rewards that we want to use for calculating r_alpha. Thanks!

with conn.cursor() as cur:
cur.execute("""
select percentile_cont(%s) within group (order by real_reward)
from (select real_reward from {table}_game
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above. Also note that you can use psycopg2 directly (rather than pandas.read_sql_query) for these single-value queries.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, good to know!

self._true_reward = 0.
return rr
return config.min_reward
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is what we want -- otherwise, in the initial game playing (before we've built up a minimum buffer of games) we'll choose invalid molecules rather than those that are just not optimal

@@ -109,7 +119,8 @@ def get_reward(self):
cur.execute("""
INSERT INTO {table}_reward
(smiles, real_reward, atom_type, buried_vol, max_spin, atom_index)
values (%s, %s, %s, %s, %s, %s);""".format(table=config.sql_basename), (
values (%s, %s, %s, %s, %s, %s)
ON CONFLICT DO NOTHING;""".format(table=config.sql_basename), (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, this is an important point for future reward caching -- if two nodes calculate the same reward simultaneously, we can end up calculating a reward for a molecule already in the reward buffer. So we have to catch that conflict and 'do nothing' (i.e., don't add the duplicate)

#SBATCH -n 4
#SBATCH -c 18
#SBATCH --output=/scratch/eskordil/git-repos/rlmolecule_new/rlmolecule/mcts.%j.out
#SBATCH -n 72
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did an htop after sshing into one of the worker nodes: i don't think these games benefit from much threading. So we might as well just blast the nodes with a bunch of copies of these; this lets us run 36 concurrent games on each node.

@pstjohn pstjohn merged commit 6ecab09 into master Oct 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants