Skip to content

Commit

Permalink
refactor: Optimize rspr calls and improve visualizations (#185)
Browse files Browse the repository at this point in the history
* RSPR approx  and exact scripts

* Add functional documentation and update support threshold

* Fix rspr argument

* Use clustering information from rSPR

* Generate cluster heatmap

* Generate group heatmaps

* Create new grouping strategy

* Update manifest name in config

* Fix compare group size issue

* Save frequency table for dedugging

* Store index with pivot csv

* Convert approx rspr column to numeric for crosstab

* Save approx distance as integer

* Store group csv as well

* Use tight limits for group size

* Run approx rsrp only

* Revert "Run approx rsrp only"

This reverts commit 3ff4365.

* print frequency table

* exit after approx rspr

* Only run approx rspr

* Remove all logging changes

* use different cmap for heatmap

* Reset color map

* Update heatmap figure size

* Revert "Only run approx rspr"

This reverts commit 271d6d3.

* Remove exit in rspr

* Run approx rspr in groups

* update group size temporary

* run rspr approx only

* Fix result output parsing issue

* print total length for debuging

* Fix index out of bound issue

* print total length for debuging

* Clear temp file

* Seek to top of the file before clearing

* Remove debugging lines

* Enable run rspr exact

* Create log scale heatmaps

* Generate cluster network tree

* Pass min and max distances for heatmap generation

* Generate range heatmap if input provided

* Update sub repo config

* Add execution permission to rspr scripts

* feat: Add rooted reference tree to exact/heatmap inputs

* Revert local execution changes for exact rspr

* Debugging code for input dataframe

* Fix read csv issue and add documentation

* Fix local rspr path

* fix: Update rspr image to 1.3.6

* Add clustering part to heatmap to generate overall clustering network

* Fix cluster file name issue

* refactor: Add bio.phylo to rSPR image

Signed-off-by: jvfe <[email protected]>

* feat: Add cluster_file to rSPR workflow

- Adds cluster_file to RSPR_EXACT outputs
- Concatenate cluster_file* in the rSPR subworkflow
- Adds concatenated cluster_file.txt to RSPR_HEATMAP inputs

* Fix error generating cluster network

* Fix reading clusters issue

* Reduce figure size to half

* Use cluster probability instead of branch length

* Fix cluster branch length issue

* Fix extra parameter issue

* Update input output paths

* Fix output channels

* Do not pass reference tree in parameters

* feat: Add light profile to config (#180)

* feat: Add light profile to config

* docs: Add information about the light profile

---------

Signed-off-by: jvfe <[email protected]>

* refactor: Add process_long to evolccm

Signed-off-by: jvfe <[email protected]>

* docs: Add efaecium paper to citations.md

* docs: Add tab for known ARETE issues (#181)

Signed-off-by: jvfe <[email protected]>

* docs: Add note about the RSPR_HEATMAP issue

Signed-off-by: jvfe <[email protected]>

* docs: Add basic contribution guidelines

Signed-off-by: jvfe <[email protected]>

* docs: Add logo and favicon

Signed-off-by: jvfe <[email protected]>

* refactor: Remove old unused nf-core functions (#182)

* refactor: Remove unused nf-core functions

Signed-off-by: jvfe <[email protected]>

* fix: Remove missing functions in RGI module

Signed-off-by: jvfe <[email protected]>

---------

Signed-off-by: jvfe <[email protected]>

* refactor: Add rspr and evolccm to annotation entry (#183)

Signed-off-by: jvfe <[email protected]>

* docs: Update arete logo

Signed-off-by: jvfe <[email protected]>

* docs: Update poppunk website url

Signed-off-by: jvfe <[email protected]>

* docs: Make it clear that dynamics tools are optional

Signed-off-by: jvfe <[email protected]>

* docs(output): Add not about modules not included in report

Signed-off-by: jvfe <[email protected]>

* docs: Add link to full documentation

Signed-off-by: jvfe <[email protected]>

* docs: Add a command that runs all subworkflows

* docs(usage): Add note about storage requirements

Signed-off-by: jvfe <[email protected]>

* docs: Fix header in readme

* refactor: Add 1.0 threshold to subsetting heatmap (#184)

* refactor: Reduce number of tools on ci test

Signed-off-by: jvfe <[email protected]>

* refactor: Remove group outputs from rspr exact

Signed-off-by: jvfe <[email protected]>

* fix: Remove old fork name

Signed-off-by: jvfe <[email protected]>

---------

Signed-off-by: jvfe <[email protected]>
Co-authored-by: KARTIK KAKADIYA <[email protected]>
Co-authored-by: ktkakadiya <[email protected]>
  • Loading branch information
3 people authored Feb 5, 2024
1 parent e051b27 commit b24329d
Show file tree
Hide file tree
Showing 12 changed files with 689 additions and 70 deletions.
272 changes: 243 additions & 29 deletions bin/rspr_approx.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,9 @@
import pandas as pd
from collections import defaultdict
from matplotlib import pyplot as plt
from matplotlib.colors import LogNorm
import seaborn as sns
import tempfile


#####################################################################
Expand Down Expand Up @@ -65,6 +67,22 @@ def parse_args(args=None):
default=0.7,
help="Maximum support threshold",
)
parser.add_argument(
"-mnhar",
"--min_heatmap_approx_rspr",
dest="MIN_HEATMAP_RSPR_DISTANCE",
type=int,
default=0,
help="Minimum approximate rSPR distance used to generate heatmap.",
)
parser.add_argument(
"-mxhar",
"--max_heatmap_approx_rspr",
dest="MAX_HEATMAP_RSPR_DISTANCE",
type=int,
default=-1,
help="Maximum approximate rSPR distance used to generate heatmap.",
)
return parser.parse_args(args)


Expand Down Expand Up @@ -149,6 +167,31 @@ def extract_approx_distance(text):
return "0"


#####################################################################
### FUNCTION RUN_APPROX_RSPR
### Run approx rspr algorithm of set of input tree pairs
### results: dataframe to store the approx rspr results
### input_file: path of input trees
### lst_filename: list of file names
### rspr_path: path of rspr software
#####################################################################

def run_approx_rspr(results, input_file, lst_filename, rspr_path):
input_file.seek(0)
result = subprocess.run(
rspr_path, stdin=input_file, capture_output=True, text=True
)

cur_index = 0
output = result.stdout
if output:
for line in output.splitlines():
if "approx drSPR=" in line:
distance = line.split("approx drSPR=")[1].strip()
results.loc[lst_filename[cur_index], "approx_drSPR"] = distance
cur_index += 1


#####################################################################
### FUNCTION APPROX_RSPR
### Run approx rspr algorithm of set of input tree pairs
Expand All @@ -158,7 +201,6 @@ def extract_approx_distance(text):
### max_support_threshold: maximum branching support threshold
#####################################################################


def approx_rspr(
rooted_gene_trees_path, results, min_branch_len=0, max_support_threshold=0.7
):
Expand All @@ -170,14 +212,46 @@ def approx_rspr(
"-length " + str(min_branch_len),
"-support " + str(max_support_threshold),
]
for filename in os.listdir(rooted_gene_trees_path):
gene_tree_path = os.path.join(rooted_gene_trees_path, filename)
with open(gene_tree_path, "r") as infile:
result = subprocess.run(
rspr_path, stdin=infile, capture_output=True, text=True
)
dist = extract_approx_distance(result.stdout)
results.loc[filename, "approx_drSPR"] = dist

group_size = 10000
cur_count = 0
lst_filename = []
with tempfile.TemporaryFile(mode='w+') as temp_file:
for filename in os.listdir(rooted_gene_trees_path):
if cur_count == group_size:
run_approx_rspr(results, temp_file, lst_filename, rspr_path)
temp_file.seek(0)
temp_file.truncate()
lst_filename.clear()
cur_count = 0

gene_tree_path = os.path.join(rooted_gene_trees_path, filename)
with open(gene_tree_path, "r") as infile:
temp_file.write(infile.read() + "\n")
lst_filename.append(filename)
cur_count += 1
if cur_count > 0:
run_approx_rspr(results, temp_file, lst_filename, rspr_path)


#####################################################################
### FUNCTION GENERATE_HEATMAP
### Generate heatmap figure from frequency table
### freq_table: frequency table of tree size and approx rspr distance
### output_path: output path for storing the heatmap
#####################################################################

def generate_heatmap(freq_table, output_path, log_scale=False):
plt.figure(figsize=(12, 12))
ax = sns.heatmap(
freq_table, annot=True, fmt=".0f", norm=LogNorm() if log_scale else None
)
ax.invert_yaxis()
plt.title("Number of trees")
plt.xlabel("Tree size")
plt.ylabel("Approx rSPR distance")
plt.savefig(output_path)
plt.clf()


#####################################################################
Expand All @@ -187,34 +261,157 @@ def approx_rspr(
### output_path: output path for storing the heatmap
#####################################################################


def make_heatmap(results, output_path):
def make_heatmap(results, output_path, min_distance, max_distance):
print("Generating heatmap")

# create sub dataframe
sub_results = results[(results["approx_drSPR"] >= min_distance)]
if max_distance >= 0:
sub_results = sub_results[(sub_results["approx_drSPR"] <= max_distance)]

data = (
results.groupby(["tree_size", "approx_drSPR"]).size().reset_index(name="count")
sub_results.groupby(["tree_size", "approx_drSPR"]).size().reset_index(name="count")
)
data_pivot = data.pivot(
index="approx_drSPR", columns="tree_size", values="count"
).fillna(0)
sns.heatmap(
data_pivot.loc[sorted(data_pivot.index, reverse=True)], annot=True, fmt=".0f"
).set(title="Number of trees")
plt.xlabel("Tree size")
plt.ylabel("Approx rSPR distance")
plt.savefig(output_path)
generate_heatmap(data_pivot.loc[sorted(data_pivot.index, reverse=True)], output_path)


def make_heatmap_from_tsv(input_path, output_path, min_distance, max_distance):
print("Generating heatmap from CSV")
results = pd.read_table(input_path)
make_heatmap(results, output_path, min_distance, max_distance)

#####################################################################
### FUNCTION GET_GROUP_SIZE
### Get preferred group size for generating heatmap
### all_tree_sizes: list of all values
### max_size: maximum number of groups
#####################################################################

def get_heatmap_group_size(all_values, max_groups=15):
group_size = 1
if len(all_values) <= max_groups:
return group_size

multipliers = [2, 2.5, 2]
cur_mul_idx = 0
max_val = max(all_values)
while not (max_val/group_size) <= max_groups:
group_size *= multipliers[cur_mul_idx]
cur_mul_idx = (cur_mul_idx + 1) % len(multipliers)
return int(group_size)


#####################################################################
### FUNCTION MAKE_GROUP_HEATMAP
### Generate heatmap of tree size and approx rspr distance groups
### results: dataframe of the approx rspr distance and tree size groups
### output_path: output path for storing the heatmap
#####################################################################

def make_group_heatmap(results, output_path, min_distance, max_distance):
print("Generating group heatmap")

# create sub dataframe
sub_results = results[(results["approx_drSPR"] >= min_distance)]
if max_distance >= 0:
sub_results = sub_results[(sub_results["approx_drSPR"] <= max_distance)]

data = pd.crosstab(sub_results["approx_drSPR"], sub_results["tree_size"])

all_tree_sizes = data.columns.astype('int32')
tree_group_size = get_heatmap_group_size(all_tree_sizes)
aggregated_df = pd.DataFrame()
if tree_group_size > 1:
for i in range(1, max(all_tree_sizes) + 1, tree_group_size):
group_columns = [col for col in all_tree_sizes if i <= int(col) <= i + tree_group_size - 1]
group_sum = data[group_columns].sum(axis=1)
group_start = i if i > min(all_tree_sizes) else min(all_tree_sizes)
group_end = (i+tree_group_size-1) if (i+tree_group_size-1) < max(all_tree_sizes) else max(all_tree_sizes)
aggregated_df[f'{group_start}-{group_end}'] = group_sum
else:
aggregated_df = data

all_distances = aggregated_df.index.astype('int32')
distance_group_size = get_heatmap_group_size(all_distances)
aggregated_row_df = pd.DataFrame(columns=aggregated_df.columns)
if distance_group_size > 1:
for i in range(0, max(all_distances) + 1, distance_group_size):
group_rows = [row for row in all_distances if i <= int(row) <= i + distance_group_size - 1]
group_sum = aggregated_df.loc[group_rows].sum(axis=0)
group_start = i if i > min(all_distances) else min(all_distances)
group_end = (i+distance_group_size-1) if (i+distance_group_size-1) < max(all_distances) else max(all_distances)
aggregated_row_df.loc[f'{group_start}-{group_end}'] = group_sum
else:
aggregated_row_df = aggregated_df
generate_heatmap(aggregated_row_df, output_path, True)


#####################################################################
### FUNCTION GENERATE_GROUP_SIZES
### Generate groups sizes based on number of tree pairs avaialble
### target_sum: total number of trees available
### RETURN groups of trees
#####################################################################

def generate_group_sizes(target_sum, max_groups=500):
degree = 1
current_sum = 0
group_sizes = []
while current_sum < target_sum:
k = 0
current_sum = 0
group_sizes = []
while k < max_groups:
value = int(2 ** (k /(max_groups / degree)))
if value > 1e20:
break
group_sizes.append(value)
current_sum += value
k += 1
degree += 1
return group_sizes


#####################################################################
### FUNCTION MAKE_GROUPS
### Generate groups of tree pairs based on the approx rspr distnace
### min_limit: minimum approx rspr distance for the first group
### results: the input results dataframe
### RETURN groups of trees
#####################################################################

def make_groups_v1(results, min_limit=10):
print("Generating groups")
min_group = results[results["approx_drSPR"] <= min_limit]["file_name"].tolist()
groups = defaultdict()
first_group = "group_0"
groups[first_group] = min_group

rem_results = results[results["approx_drSPR"] > min_limit].sort_values(
by="approx_drSPR", ascending=False
)
rem_length = len(rem_results)
group_sizes = generate_group_sizes(rem_length)
cur_index, grp_idx = 0, 0
while cur_index < rem_length:
cur_group_names = rem_results.iloc[cur_index : cur_index + group_sizes[grp_idx]]["file_name"].tolist()
groups[f"group_{grp_idx+1}"] = cur_group_names
cur_index += group_sizes[grp_idx]
grp_idx += 1
return groups


#####################################################################
### FUNCTION MAKE_GROUPS
### Generate groups of tree pairs based on the approx rspr distnace
### min_limit: minimum approx rspr distance for the first group
### RETURN groups of trees
#####################################################################

def make_groups(results, min_limit=10):
print("Generating groups")
results["approx_drSPR"] = pd.to_numeric(results["approx_drSPR"])
min_group = results[results["approx_drSPR"] <= min_limit]["file_name"].tolist()
groups = defaultdict()
first_group = "group_0"
Expand All @@ -240,7 +437,7 @@ def make_groups(results, min_limit=10):

def make_groups_from_csv(input_df, min_limit):
print("Generating groups from CSV")
groups = make_groups(input_df, min_limit)
groups = make_groups_v1(input_df, min_limit)
tidy_data = [
(key, val)
for key, value in groups.items()
Expand All @@ -252,12 +449,6 @@ def make_groups_from_csv(input_df, min_limit):
return merged


def make_heatmap_from_csv(input_path, output_path):
print("Generating heatmap from CSV")
results = pd.read_csv(input_path)
make_heatmap(results, output_path)


def join_annotation_data(df, annotation_data):
ann_df = pd.read_table(annotation_data, dtype={"genome_id": "str"})
ann_df.columns = map(str.lower, ann_df.columns)
Expand Down Expand Up @@ -293,9 +484,27 @@ def main(args=None):
args.MIN_BRANCH_LENGTH,
args.MAX_SUPPORT_THRESHOLD,
)

# Generate standard heatmap
results["approx_drSPR"] = pd.to_numeric(results["approx_drSPR"])
fig_path = os.path.join(args.OUTPUT_DIR, "output.png")
make_heatmap(results, fig_path)
make_heatmap(
results,
fig_path,
args.MIN_HEATMAP_RSPR_DISTANCE,
args.MAX_HEATMAP_RSPR_DISTANCE
)

# Generate group heatmap
group_fig_path = os.path.join(args.OUTPUT_DIR, "group_output.png")
make_group_heatmap(
results,
group_fig_path,
args.MIN_HEATMAP_RSPR_DISTANCE,
args.MAX_HEATMAP_RSPR_DISTANCE
)

# Generate groups for exact rSPR
results.reset_index("file_name", inplace=True)
if args.ANNOTATION:
results = join_annotation_data(results, args.ANNOTATION)
Expand All @@ -317,7 +526,12 @@ def main(args=None):
phylo_dir = os.path.join(args.INPUT_DIR_PATH, "phylogenomics")
res_path = os.path.join(phylo_dir, 'output.csv')
fig_path = os.path.join(phylo_dir, 'output.png')
make_heatmap_from_csv(res_path, fig_path)
make_heatmap_from_csv(
results,
fig_path,
args.MIN_HEATMAP_RSPR_DISTANCE,
args.MAX_HEATMAP_RSPR_DISTANCE
)
"""


Expand Down
Loading

0 comments on commit b24329d

Please sign in to comment.