refactor: Optimize rspr calls and improve visualizations (#185)

* RSPR approx and exact scripts * Add functional documentation and update support threshold * Fix rspr argument * Use clustering information from rSPR * Generate cluster heatmap * Generate group heatmaps * Create new grouping strategy * Update manifest name in config * Fix compare group size issue * Save frequency table for dedugging * Store index with pivot csv * Convert approx rspr column to numeric for crosstab * Save approx distance as integer * Store group csv as well * Use tight limits for group size * Run approx rsrp only * Revert "Run approx rsrp only" This reverts commit 3ff4365. * print frequency table * exit after approx rspr * Only run approx rspr * Remove all logging changes * use different cmap for heatmap * Reset color map * Update heatmap figure size * Revert "Only run approx rspr" This reverts commit 271d6d3. * Remove exit in rspr * Run approx rspr in groups * update group size temporary * run rspr approx only * Fix result output parsing issue * print total length for debuging * Fix index out of bound issue * print total length for debuging * Clear temp file * Seek to top of the file before clearing * Remove debugging lines * Enable run rspr exact * Create log scale heatmaps * Generate cluster network tree * Pass min and max distances for heatmap generation * Generate range heatmap if input provided * Update sub repo config * Add execution permission to rspr scripts * feat: Add rooted reference tree to exact/heatmap inputs * Revert local execution changes for exact rspr * Debugging code for input dataframe * Fix read csv issue and add documentation * Fix local rspr path * fix: Update rspr image to 1.3.6 * Add clustering part to heatmap to generate overall clustering network * Fix cluster file name issue * refactor: Add bio.phylo to rSPR image Signed-off-by: jvfe <[email protected]> * feat: Add cluster_file to rSPR workflow - Adds cluster_file to RSPR_EXACT outputs - Concatenate cluster_file* in the rSPR subworkflow - Adds concatenated cluster_file.txt to RSPR_HEATMAP inputs * Fix error generating cluster network * Fix reading clusters issue * Reduce figure size to half * Use cluster probability instead of branch length * Fix cluster branch length issue * Fix extra parameter issue * Update input output paths * Fix output channels * Do not pass reference tree in parameters * feat: Add light profile to config (#180) * feat: Add light profile to config * docs: Add information about the light profile --------- Signed-off-by: jvfe <[email protected]> * refactor: Add process_long to evolccm Signed-off-by: jvfe <[email protected]> * docs: Add efaecium paper to citations.md * docs: Add tab for known ARETE issues (#181) Signed-off-by: jvfe <[email protected]> * docs: Add note about the RSPR_HEATMAP issue Signed-off-by: jvfe <[email protected]> * docs: Add basic contribution guidelines Signed-off-by: jvfe <[email protected]> * docs: Add logo and favicon Signed-off-by: jvfe <[email protected]> * refactor: Remove old unused nf-core functions (#182) * refactor: Remove unused nf-core functions Signed-off-by: jvfe <[email protected]> * fix: Remove missing functions in RGI module Signed-off-by: jvfe <[email protected]> --------- Signed-off-by: jvfe <[email protected]> * refactor: Add rspr and evolccm to annotation entry (#183) Signed-off-by: jvfe <[email protected]> * docs: Update arete logo Signed-off-by: jvfe <[email protected]> * docs: Update poppunk website url Signed-off-by: jvfe <[email protected]> * docs: Make it clear that dynamics tools are optional Signed-off-by: jvfe <[email protected]> * docs(output): Add not about modules not included in report Signed-off-by: jvfe <[email protected]> * docs: Add link to full documentation Signed-off-by: jvfe <[email protected]> * docs: Add a command that runs all subworkflows * docs(usage): Add note about storage requirements Signed-off-by: jvfe <[email protected]> * docs: Fix header in readme * refactor: Add 1.0 threshold to subsetting heatmap (#184) * refactor: Reduce number of tools on ci test Signed-off-by: jvfe <[email protected]> * refactor: Remove group outputs from rspr exact Signed-off-by: jvfe <[email protected]> * fix: Remove old fork name Signed-off-by: jvfe <[email protected]> --------- Signed-off-by: jvfe <[email protected]> Co-authored-by: KARTIK KAKADIYA <[email protected]> Co-authored-by: ktkakadiya <[email protected]>
beiko-lab · Feb 5, 2024 · b24329d · b24329d
1 parent e051b27
commit b24329d
Show file tree

Hide file tree

Showing 12 changed files with 689 additions and 70 deletions.
diff --git a/bin/rspr_approx.py b/bin/rspr_approx.py
@@ -12,7 +12,9 @@
 import pandas as pd
 from collections import defaultdict
 from matplotlib import pyplot as plt
+from matplotlib.colors import LogNorm
 import seaborn as sns
+import tempfile
 
 
 #####################################################################
@@ -65,6 +67,22 @@ def parse_args(args=None):
         default=0.7,
         help="Maximum support threshold",
     )
+    parser.add_argument(
+        "-mnhar",
+        "--min_heatmap_approx_rspr",
+        dest="MIN_HEATMAP_RSPR_DISTANCE",
+        type=int,
+        default=0,
+        help="Minimum approximate rSPR distance used to generate heatmap.",
+    )
+    parser.add_argument(
+        "-mxhar",
+        "--max_heatmap_approx_rspr",
+        dest="MAX_HEATMAP_RSPR_DISTANCE",
+        type=int,
+        default=-1,
+        help="Maximum approximate rSPR distance used to generate heatmap.",
+    )
     return parser.parse_args(args)
 
 
@@ -149,6 +167,31 @@ def extract_approx_distance(text):
     return "0"
 
 
+#####################################################################
+### FUNCTION RUN_APPROX_RSPR
+### Run approx rspr algorithm of set of input tree pairs
+### results: dataframe to store the approx rspr results
+### input_file: path of input trees
+### lst_filename: list of file names
+### rspr_path: path of rspr software
+#####################################################################
+
+def run_approx_rspr(results, input_file, lst_filename, rspr_path):
+    input_file.seek(0)
+    result = subprocess.run(
+        rspr_path, stdin=input_file, capture_output=True, text=True
+    )
+
+    cur_index = 0
+    output = result.stdout
+    if output:
+        for line in output.splitlines():
+            if "approx drSPR=" in line:
+                distance = line.split("approx drSPR=")[1].strip()
+                results.loc[lst_filename[cur_index], "approx_drSPR"] = distance
+                cur_index += 1
+
+
 #####################################################################
 ### FUNCTION APPROX_RSPR
 ### Run approx rspr algorithm of set of input tree pairs
@@ -158,7 +201,6 @@ def extract_approx_distance(text):
 ### max_support_threshold: maximum branching support threshold
 #####################################################################
 
-
 def approx_rspr(
     rooted_gene_trees_path, results, min_branch_len=0, max_support_threshold=0.7
 ):
@@ -170,14 +212,46 @@ def approx_rspr(
         "-length " + str(min_branch_len),
         "-support " + str(max_support_threshold),
     ]
-    for filename in os.listdir(rooted_gene_trees_path):
-        gene_tree_path = os.path.join(rooted_gene_trees_path, filename)
-        with open(gene_tree_path, "r") as infile:
-            result = subprocess.run(
-                rspr_path, stdin=infile, capture_output=True, text=True
-            )
-            dist = extract_approx_distance(result.stdout)
-            results.loc[filename, "approx_drSPR"] = dist
+
+    group_size = 10000
+    cur_count = 0
+    lst_filename = []
+    with tempfile.TemporaryFile(mode='w+') as temp_file:
+        for filename in os.listdir(rooted_gene_trees_path):
+            if cur_count == group_size:
+                run_approx_rspr(results, temp_file, lst_filename, rspr_path)
+                temp_file.seek(0)
+                temp_file.truncate()
+                lst_filename.clear()
+                cur_count = 0
+
+            gene_tree_path = os.path.join(rooted_gene_trees_path, filename)
+            with open(gene_tree_path, "r") as infile:
+                temp_file.write(infile.read() + "\n")
+                lst_filename.append(filename)
+                cur_count += 1
+        if cur_count > 0:
+            run_approx_rspr(results, temp_file, lst_filename, rspr_path)
+
+
+#####################################################################
+### FUNCTION GENERATE_HEATMAP
+### Generate heatmap figure from frequency table
+### freq_table: frequency table of tree size and approx rspr distance
+### output_path: output path for storing the heatmap
+#####################################################################
+
+def generate_heatmap(freq_table, output_path, log_scale=False):
+    plt.figure(figsize=(12, 12))
+    ax = sns.heatmap(
+        freq_table, annot=True, fmt=".0f", norm=LogNorm() if log_scale else None
+    )
+    ax.invert_yaxis()
+    plt.title("Number of trees")
+    plt.xlabel("Tree size")
+    plt.ylabel("Approx rSPR distance")
+    plt.savefig(output_path)
+    plt.clf()
 
 
 #####################################################################
@@ -187,34 +261,157 @@ def approx_rspr(
 ### output_path: output path for storing the heatmap
 #####################################################################
 
-
-def make_heatmap(results, output_path):
+def make_heatmap(results, output_path, min_distance, max_distance):
     print("Generating heatmap")
+
+    # create sub dataframe
+    sub_results = results[(results["approx_drSPR"] >= min_distance)]
+    if max_distance >= 0:
+        sub_results = sub_results[(sub_results["approx_drSPR"] <= max_distance)]
+
     data = (
-        results.groupby(["tree_size", "approx_drSPR"]).size().reset_index(name="count")
+        sub_results.groupby(["tree_size", "approx_drSPR"]).size().reset_index(name="count")
     )
     data_pivot = data.pivot(
         index="approx_drSPR", columns="tree_size", values="count"
     ).fillna(0)
-    sns.heatmap(
-        data_pivot.loc[sorted(data_pivot.index, reverse=True)], annot=True, fmt=".0f"
-    ).set(title="Number of trees")
-    plt.xlabel("Tree size")
-    plt.ylabel("Approx rSPR distance")
-    plt.savefig(output_path)
+    generate_heatmap(data_pivot.loc[sorted(data_pivot.index, reverse=True)], output_path)
+
+
+def make_heatmap_from_tsv(input_path, output_path, min_distance, max_distance):
+    print("Generating heatmap from CSV")
+    results = pd.read_table(input_path)
+    make_heatmap(results, output_path, min_distance, max_distance)
+
+#####################################################################
+### FUNCTION GET_GROUP_SIZE
+### Get preferred group size for generating heatmap
+### all_tree_sizes: list of all values
+### max_size: maximum number of groups
+#####################################################################
+
+def get_heatmap_group_size(all_values, max_groups=15):
+    group_size = 1
+    if len(all_values) <= max_groups:
+        return group_size
+
+    multipliers = [2, 2.5, 2]
+    cur_mul_idx = 0
+    max_val = max(all_values)
+    while not (max_val/group_size) <= max_groups:
+        group_size *= multipliers[cur_mul_idx]
+        cur_mul_idx = (cur_mul_idx + 1) % len(multipliers)
+    return int(group_size)
+
+
+#####################################################################
+### FUNCTION MAKE_GROUP_HEATMAP
+### Generate heatmap of tree size and approx rspr distance groups
+### results: dataframe of the approx rspr distance and tree size groups
+### output_path: output path for storing the heatmap
+#####################################################################
+
+def make_group_heatmap(results, output_path, min_distance, max_distance):
+    print("Generating group heatmap")
+
+    # create sub dataframe
+    sub_results = results[(results["approx_drSPR"] >= min_distance)]
+    if max_distance >= 0:
+        sub_results = sub_results[(sub_results["approx_drSPR"] <= max_distance)]
+
+    data = pd.crosstab(sub_results["approx_drSPR"], sub_results["tree_size"])
+
+    all_tree_sizes = data.columns.astype('int32')
+    tree_group_size = get_heatmap_group_size(all_tree_sizes)
+    aggregated_df = pd.DataFrame()
+    if tree_group_size > 1:
+        for i in range(1, max(all_tree_sizes) + 1, tree_group_size):
+            group_columns = [col for col in all_tree_sizes if i <= int(col) <= i + tree_group_size - 1]
+            group_sum = data[group_columns].sum(axis=1)
+            group_start = i if i > min(all_tree_sizes) else min(all_tree_sizes)
+            group_end = (i+tree_group_size-1) if (i+tree_group_size-1) < max(all_tree_sizes) else max(all_tree_sizes)
+            aggregated_df[f'{group_start}-{group_end}'] = group_sum
+    else:
+        aggregated_df = data
+
+    all_distances = aggregated_df.index.astype('int32')
+    distance_group_size = get_heatmap_group_size(all_distances)
+    aggregated_row_df = pd.DataFrame(columns=aggregated_df.columns)
+    if distance_group_size > 1:
+        for i in range(0, max(all_distances) + 1, distance_group_size):
+            group_rows = [row for row in all_distances if i <= int(row) <= i + distance_group_size - 1]
+            group_sum = aggregated_df.loc[group_rows].sum(axis=0)
+            group_start = i if i > min(all_distances) else min(all_distances)
+            group_end = (i+distance_group_size-1) if (i+distance_group_size-1) < max(all_distances) else max(all_distances)
+            aggregated_row_df.loc[f'{group_start}-{group_end}'] = group_sum
+    else:
+        aggregated_row_df = aggregated_df
+    generate_heatmap(aggregated_row_df, output_path, True)
+
+
+#####################################################################
+### FUNCTION GENERATE_GROUP_SIZES
+### Generate groups sizes based on number of tree pairs avaialble
+### target_sum: total number of trees available
+### RETURN groups of trees
+#####################################################################
+
+def generate_group_sizes(target_sum, max_groups=500):
+    degree = 1
+    current_sum = 0
+    group_sizes = []
+    while current_sum < target_sum:
+        k = 0
+        current_sum = 0
+        group_sizes = []
+        while k < max_groups:
+            value = int(2 ** (k /(max_groups / degree)))
+            if value > 1e20:
+                break
+            group_sizes.append(value)
+            current_sum += value
+            k += 1
+        degree += 1
+    return group_sizes
 
 
 #####################################################################
 ### FUNCTION MAKE_GROUPS
 ### Generate groups of tree pairs based on the approx rspr distnace
-### min_limit: minimum approx rspr distance for the first group
+### results: the input results dataframe
 ### RETURN groups of trees
 #####################################################################
 
+def make_groups_v1(results, min_limit=10):
+    print("Generating groups")
+    min_group = results[results["approx_drSPR"] <= min_limit]["file_name"].tolist()
+    groups = defaultdict()
+    first_group = "group_0"
+    groups[first_group] = min_group
+
+    rem_results = results[results["approx_drSPR"] > min_limit].sort_values(
+        by="approx_drSPR", ascending=False
+    )
+    rem_length = len(rem_results)
+    group_sizes = generate_group_sizes(rem_length)
+    cur_index, grp_idx = 0, 0
+    while cur_index < rem_length:
+        cur_group_names = rem_results.iloc[cur_index : cur_index + group_sizes[grp_idx]]["file_name"].tolist()
+        groups[f"group_{grp_idx+1}"] = cur_group_names
+        cur_index += group_sizes[grp_idx]
+        grp_idx += 1
+    return groups
+
+
+#####################################################################
+### FUNCTION MAKE_GROUPS
+### Generate groups of tree pairs based on the approx rspr distnace
+### min_limit: minimum approx rspr distance for the first group
+### RETURN groups of trees
+#####################################################################
 
 def make_groups(results, min_limit=10):
     print("Generating groups")
-    results["approx_drSPR"] = pd.to_numeric(results["approx_drSPR"])
     min_group = results[results["approx_drSPR"] <= min_limit]["file_name"].tolist()
     groups = defaultdict()
     first_group = "group_0"
@@ -240,7 +437,7 @@ def make_groups(results, min_limit=10):
 
 def make_groups_from_csv(input_df, min_limit):
     print("Generating groups from CSV")
-    groups = make_groups(input_df, min_limit)
+    groups = make_groups_v1(input_df, min_limit)
     tidy_data = [
         (key, val)
         for key, value in groups.items()
@@ -252,12 +449,6 @@ def make_groups_from_csv(input_df, min_limit):
     return merged
 
 
-def make_heatmap_from_csv(input_path, output_path):
-    print("Generating heatmap from CSV")
-    results = pd.read_csv(input_path)
-    make_heatmap(results, output_path)
-
-
 def join_annotation_data(df, annotation_data):
     ann_df = pd.read_table(annotation_data, dtype={"genome_id": "str"})
     ann_df.columns = map(str.lower, ann_df.columns)
@@ -293,9 +484,27 @@ def main(args=None):
         args.MIN_BRANCH_LENGTH,
         args.MAX_SUPPORT_THRESHOLD,
     )
+
+    # Generate standard heatmap
+    results["approx_drSPR"] = pd.to_numeric(results["approx_drSPR"])
     fig_path = os.path.join(args.OUTPUT_DIR, "output.png")
-    make_heatmap(results, fig_path)
+    make_heatmap(
+        results,
+        fig_path,
+        args.MIN_HEATMAP_RSPR_DISTANCE,
+        args.MAX_HEATMAP_RSPR_DISTANCE
+    )
+
+    # Generate group heatmap
+    group_fig_path = os.path.join(args.OUTPUT_DIR, "group_output.png")
+    make_group_heatmap(
+        results, 
+        group_fig_path,
+        args.MIN_HEATMAP_RSPR_DISTANCE,
+        args.MAX_HEATMAP_RSPR_DISTANCE
+    )
 
+    # Generate groups for exact rSPR
     results.reset_index("file_name", inplace=True)
     if args.ANNOTATION:
         results = join_annotation_data(results, args.ANNOTATION)
@@ -317,7 +526,12 @@ def main(args=None):
     phylo_dir = os.path.join(args.INPUT_DIR_PATH, "phylogenomics")
     res_path = os.path.join(phylo_dir, 'output.csv')
     fig_path = os.path.join(phylo_dir, 'output.png')
-    make_heatmap_from_csv(res_path, fig_path)
+    make_heatmap_from_csv(
+        results,
+        fig_path,
+        args.MIN_HEATMAP_RSPR_DISTANCE,
+        args.MAX_HEATMAP_RSPR_DISTANCE
+    )
     """