Steps to refresh all publication-ready figures. All steps assume your current directory is the top of this repository.
See these instructions to obtain the current data release.
We recommend using the download script to obtain data because this will automatically create symlinks in data/
to the latest files.
See these instructions for setting up the project Docker container.
Briefly, the latest version of the project Docker image, which is updated upon commit to master
, can be obtained and run via:
docker pull ccdlopenpbta/open-pbta:latest
docker run \
-e PASSWORD=<password> \
-p 8787:8787 \
-v $(pwd):/home/rstudio/kitematic \
ccdlopenpbta/open-pbta:latest
You may choose to use docker exec
to interact with the container from there or if you'd prefer the RStudio interface, you can navigate to localhost:8787
and enter username rstudio
and the password you set with the run
command above.
This script runs all the intermediate steps needed to generate figures starting with the original data files.
bash figures/generate-figures.sh
Figures are saved to the figures/pngs
folder and will be linked to the accompanying manuscript repository AlexsLemonade/OpenPBTA-manuscript
.
Each figure has its own script stored in the figures/scripts
.
All are called by the main bash script figures/run-figures.sh
.
However, we list information about the resources, intermediate steps, and PBTA data files required for generating each figure below for convenience.
Figure | Individual script | Notes on requirements | Linked analysis modules | PBTA data files consumed |
---|---|---|---|---|
Figure 1 | No individual script | No high RAM requirements | sample-distribution-analysis |
pbta-histologies.tsv |
Figure 2 | scripts/fig2-mutational-landscape.R |
256GB of RAM are needed due to the run_caller_consensus_analysis-pbta.sh handling of large MAF files | snv-callers mutational-signatures |
pbta-snv-lancet.vep.maf.gz pbta-snv-mutect2.vep.maf.gz pbta-snv-strelka2.vep.maf.gz pbta-snv-vardict.vep.maf.gz tcga-snv-lancet.vep.maf.gz tcga-snv-mutect2.vep.maf.gz tcga-snv-strelka2.vep.maf.gz |
CN status heatmap | analyses/copy_number_consensus_call/run_consensus_call.sh and analyses/cnv-chrom-plot/cn_status_heatmap.Rmd |
No high RAM requirements | cnv-chrom-plot |
pbta-cnv-controlfreec.tsv.gz pbta-sv-manta.tsv.gz pbta-cnv-cnvkit.seg.gz |
Figure 3 | No individual script ( analyses/focal-cn-file-preparation/run-prepare-cn.sh and analyses/oncoprint-landscape/run-oncoprint.sh scripts are used) |
24GB of RAM are needed due to the run-prepare-cn.sh handling of large copy number files |
focal-cn-file-preparation oncoprint-landscape |
pbta-histologies.tsv pbta-snv-consensus-mutation.maf.tsv.gz pbta-fusion-putative-oncogenic.tsv consensus_seg_annotated_cn_autosomes.tsv.gz independent-specimens.wgs.primary-plus.tsv |
Transcriptomic overview | scripts/transcriptomic-overview.R | Due to the GSVA steps, we recommend ~32 GB of RAM for generating this figure | transcriptomic-dimension-reduction collapse-rnaseq gene-set-enrichment-analysis immune-deconv |
pbta-histologies.tsv pbta-gene-expression-rsem-fpkm.stranded.rds |
Mutation co-occurrence | No individual script ( analyses/interaction-plots/01-create-interaction-plots.sh is used) |
No high RAM requirements | interaction-plots |
independent-specimens.wgs.primary-plus.tsv pbta-snv-consensus-mutation.maf.tsv.gz |
Telomerase activities | scripts/TelomeraseActivitites.R | No high RAM requirements | telomerase-activity-prediction |
pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds pbta-gene-expression-rsem-fpkm-collapsed.polya.rds pbta-gene-counts-rsem-expected_count-collapsed.stranded.rds pbta-gene-counts-rsem-expected_count-collapsed.polya.rds pbta-histologies.tsv |
This project has a set of unified color palettes.
There are 5 sets of hex color keys to be used for all final figures, stored as 5 TSV files in the figures/palettes
folder.
hex_codes
contains the colors to be passed to your plotting code and color_names
contains short descriptors of each color (e.g. gradient_1
, or divergent_neutral
).
Each palette contains an na_color
that is the same color in all palettes.
This color should be used for all NA
values.
na_color
is always the last value in the palette.
If na_color
is not needed or is supplied separately to a plotting function, you can use a dplyr::filter(hex_code != "na_color")
to remove na_color
.
To see a summary of what colors are used for histology labeling, see mapping-histology-labels.nb.html
Step 1) Read in color palette and select the pertinent columns
There are some extra columns in histology_label_color_table.tsv
that you don't need for plotting per se but are more record-keeping purposes.
With the code chunk below, you can import the columns you need (For example: Kids_First_Biospecimen_ID, display_group, display_order, hex_codes
or Kids_First_Biospecimen_ID, cancer_group, cancer_group_order, cancer_group_hex_codes
and then do a factor reorder to make sure the display_group
(or cancer_group
)is in the order declared by display_order
(cancer_group_order
).
# Import standard color palettes for project
histology_label_mapping <- readr::read_tsv(
file.path(figures_dir, "palettes", "histology_label_color_table.tsv")
) %>%
# Select just the columns we will need for plotting
dplyr::select(Kids_First_Biospecimen_ID, display_group, display_order, hex_codes) %>%
# Reorder display_group based on display_order
dplyr::mutate(display_group = forcats::fct_reorder(display_group, display_order))
Step 2) Use dplyr::inner_join
using Kids_First_Biospecimen_ID
to join by so you can add on the hex_codes
and display_group
for each biospecimen.
display_order
specifies what order the display_group
s should be displayed.
# Read in the metadata
metadata <- readr::read_tsv(metadata_file, guess_max = 10000) %>%
dplyr::inner_join(histology_label_mapping, by = "Kids_First_Biospecimen_ID")
Step 4) Make your plot and use the hex_codes
and display_group
columns.
Using the ggplot2::scale_fill_identity()
or ggplot2::scale_color_identity()
allows you to supply the hex_codes
column from a color palette to ggplot2
with a fill
or color
argument respectively.
For base R plots, you should be able to supply the hex_codes
column as your col
argument.
display_group
should be used as the labels in the plot.
metadata %>%
dplyr::group_by(display_group, hex_codes) %>%
dplyr::summarize(count = dplyr::n()) %>%
ggplot2::ggplot(ggplot2::aes(x = display_group, y = count, fill = hex_codes)) +
ggplot2::geom_bar(stat = "identity") +
ggplot2::scale_fill_identity()
Step 1) Import the palette.
You may want to remove the na_color
at the end of the list depending on whether your data include NA
s or if the plotting function you are using has the na_color
supplied separately.
gradient_col_palette <- readr::read_tsv(
file.path(figures_dir, "palettes", "gradient_color_palette.tsv")
)
If we need the NA
color separated, like for use with ComplexHeatmap
which has a separate argument for the color for NA
values.
na_color <- gradient_col_palette %>%
dplyr::filter(color_names == "na_color")
gradient_col_palette <- gradient_col_palette %>%
dplyr::filter(color_names != "na_color")
Step 2) Make a color function.
In this example, we are building a colorRamp2
function based on a regular interval between the minimum and maximum of our variable df$variable
by using seq
.
However, depending on your data's distribution a regular interval based palette might not represent your data well on the plot.
You can provide any numeric vector to color code a palette using circlize::colorRamp2
as long as that numeric vector is the same length as the palette itself.
gradient_col_val <- seq(from = min(df$variable), to = max(df$variable),
length.out = nrow(gradient_col_palette))
col_fun <- circlize::colorRamp2(gradient_col_val,
gradient_col_palette$hex_codes)
Step 3) Apply to numeric data, or supply to your plotting code.
This step depends on how your main plotting function would like the data supplied.
For example, ComplexHeatmap
wants a function to be supplied to their col
argument.
# Apply to variable directly and make a new column
df <- df %>%
dplyr::mutate(color_key = col_fun(variable))
## OR ##
# Some plotting packages want a color function
ComplexHeatmap::Heatmap(
df,
col = col_fun,
na_col = na_color$hex_codes
)
The non-histologies color palette TSV files are created by running scripts/color_palettes.R
, which can be called by Rscript scripts/color_palettes.R
.
Hex codes for the palettes are hard-coded in this script.
The script can be called from anywhere in this repository (will look for the .git
file).
The hex codes table in figures/README.md
and its swatches should also be updated by using the swatches_table
function at the end of the script and copy and pasting this function's output to the appropriate place in the table.
The histology color palette file is created by running Rscript -e "rmarkdown::render('figures/mapping-histology-labels.Rmd', clean = TRUE)"
.
In general, we will use the ggpubr
package with ggtheme = theme_pubr())
and color palette simpsons
from package ggsci
since it has 16 levels and can accommodate the levels in groups such as molecular_subtype
.
To view the palette:
scales::show_col(ggsci::pal_simpsons("springfield")(16))
For 2+ group comparisons, we will use violin or boxplots with jitter.
Some modules perform group-wise comparisons.
For the manuscript, we may want to output tables of the statistics and/or print the statistical test and p-value directly on the plot.
We use the functions ggpubr::compare_means()
and ggpubr::stat_compare_means()
for this.
Below are the default tests, parameters, and method options for 2 groups or more than two groups for your convenience.
Caution: the default p-values on the plots are uncorrected.
2 groups | 3+ groups | |
---|---|---|
Default test (method) | Wilcoxon | Kruskal-wallis |
Allowed methods | "wilcox.test" (non-parametric) "t.test" (parametric) | "kruskal.test" (non-parametric) "anova" (parametric) |
Default multiple testing (p.adjust.method) | NA | yes, but not bonferroni |
Allowed p.adjust.method | NA | "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none" |
Below is an example for creating a violin plot with boxplot, jitter, and appropriate statistics.
if(length(unique(df$var_x)) > 2){
method <- "kruskal.test"
} else {
method <- "wilcox.test"
}
p <- ggviolin(df, x = "var_x", y = "var_y",
color = "var_color",
palette = "simpsons",
order = c("a", "b", "c"),
add = c("boxplot", "jitter"),
ggtheme = theme_pubr()) +
# Add pairwise comparisons p-value
stat_compare_means(method = method, label.y = 1.2, label.x.npc = "center") +
xlab("xlab_text") +
ylab("ylab_text") +
rremove("legend")