Skip to content

Commit

Permalink
Adds PySCENIC wrappers (#328)
Browse files Browse the repository at this point in the history
* pyscenic grn passing tests

* Minor improvements to grn

* Shed file for pyscenic

* ctx passing tests (test files unavailable for CI though)

* AUCell passing tests locally

* Comestic changes

* Apply suggestions from bgruening's code review

Co-authored-by: Björn Grüning <[email protected]>

* Formatting and others

* Test data, formatting, arboretum option

* Fix tests

* ctx missing documentation

* Use macros

* Boolean variables

* Tool version macro and missing boolean

* Apply suggestions from Bjeorn's code review

Co-authored-by: Björn Grüning <[email protected]>

* Hopefully fixes test for ctx

* Missing boolean changes

---------

Co-authored-by: Björn Grüning <[email protected]>
  • Loading branch information
pcm32 and bgruening authored Aug 20, 2024
1 parent eea5c13 commit 4990a52
Show file tree
Hide file tree
Showing 6 changed files with 405 additions and 0 deletions.
21 changes: 21 additions & 0 deletions tools/tertiary-analysis/pyscenic/.shed.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
categories:
- Transcriptomics
- RNA
- Sequence Analysis
description: "PySCENIC scripts based on usage at https://pyscenic.readthedocs.io/"
long_description: |
pySCENIC is a lightning-fast python implementation of the SCENIC pipeline (Single-Cell rEgulatory Network Inference and Clustering)
which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data.
name: suite_pyscenic
owner: ebi-gxa
remote_repository_url: https://github.com/ebi-gene-expression-group/container-galaxy-sc-tertiary/
type: unrestricted
auto_tool_repositories:
name_template: "{{ tool_id }}"
description_template: "Wrapper for the pySCENIC tool suite: {{ tool_name }}"
suite:
name: "suite_pyscenic"
description: "PySCENIC scripts based on usage at https://pyscenic.readthedocs.io/"
long_description: |
pySCENIC is a lightning-fast python implementation of the SCENIC pipeline (Single-Cell rEgulatory Network Inference and Clustering)
which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data.
28 changes: 28 additions & 0 deletions tools/tertiary-analysis/pyscenic/get_test_data.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#!/usr/bin/env bash
TF_DATA_LINK='https://raw.githubusercontent.com/aertslab/scenic-nf/master/example/allTFs_hg38.txt'
MOTIF2TF_LINK='https://raw.githubusercontent.com/aertslab/scenic-nf/master/example/motifs.tbl'
RANKING_LINK='https://zenodo.org/records/13328724/files/genome-ranking_v2.feather'
LOOM_INPUT_LINK='https://raw.githubusercontent.com/aertslab/scenic-nf/master/example/expr_mat.loom'

REGULONS_LINK='https://zenodo.org/records/13328724/files/regulons.tsv'
TF2TARGETS_LINK='https://zenodo.org/records/13328724/files/tf2targets.tsv'

function get_data {
local link=$1
local fname=$2

if [ ! -f $fname ]; then
echo "$fname not available locally, downloading.."
wget -O $fname --retry-connrefused --waitretry=1 --read-timeout=20 --timeout=15 -t 3 $link
fi
}

# get matrix data
mkdir -p test-data
pushd test-data
get_data $TF_DATA_LINK "allTFs_hg38.txt"
get_data $MOTIF2TF_LINK "motifs.tbl"
get_data $RANKING_LINK "genome-ranking_v2.feather"
get_data $LOOM_INPUT_LINK "expr_mat.loom"
get_data $REGULONS_LINK regulons.tsv
get_data $TF2TARGETS_LINK tf2targets.tsv
15 changes: 15 additions & 0 deletions tools/tertiary-analysis/pyscenic/macros.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
<macros>
<token name="@TOOL_VERSION@">0.12.1</token>
<xml name="requirements">
<requirements>
<container type="docker">
aertslab/pyscenic:@TOOL_VERSION@
</container>
</requirements>
</xml>
<xml name="citations">
<citations>
<citation type="doi">10.1038/nmeth.4463</citation>
</citations>
</xml>
</macros>
104 changes: 104 additions & 0 deletions tools/tertiary-analysis/pyscenic/pyscenic_aucell.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
<tool id="pyscenic_aucell" name="PySCENIC AUCell" profile="21.09" version="@TOOL_VERSION@+galaxy0">
<description>calculates AUCell to find relevant regulons/gene sets</description>
<macros>
<import>macros.xml</import>
</macros>
<expand macro="requirements"/>
<command><![CDATA[
ln -s '${expression_mtx_fname}' expr_mat.loom &&
ln -s '${signatures_fname}' regulons.tsv &&
pyscenic aucell expr_mat.loom regulons.tsv
-o aucell.tsv
$transpose
$weights
--num_workers \${GALAXY_SLOTS:-1}
#if $seed
--seed '${seed}'
#end if
#if str($rank_threshold):
--rank_threshold '${rank_threshold}'
#end if
#if $auc_threshold
--auc_threshold '${auc_threshold}'
#end if
#if $nes_threshold
--nes_threshold '${nes_threshold}'
#end if
#if $cell_id_attribute
--cell_id_attribute '${cell_id_attribute}'
#end if
#if $gene_attribute
--gene_attribute '${gene_attribute}'
#end if
$sparse
&& mv aucell.tsv '${output}'
]]></command>
<inputs>
<param name="expression_mtx_fname" format="loom" type="data" label="Expression Matrix File" help="The file that contains the expression matrix for the single-cell experiment. Supported formats: csv (rows=cells x columns=genes) or loom (rows=genes x columns=cells)."/>
<param name="signatures_fname" type="data" format="csv,tabular" label="Gene Signatures/Regulons File" help="The file that contains the gene signatures (usually the precomputed regulons). Currently only csv/tsv supported, could be extended."/>
<param type="boolean" name="transpose" label="Transpose Expression Matrix" truevalue="-t" falsevalue="" help="Use this if the matrix is cell x genes instead of genes x cells as expected"/>
<param name="weights" type="boolean" label="Use Weights for Recovery Analysis" truevalue="-w" falsevalue="" help="Use weights associated with genes in recovery analysis. Relevant when gene signatures are supplied as json format."/>
<param name="seed" type="integer" label="Seed for Ranking" help="Seed for the expression matrix ranking step. The default is to use a random seed." optional="true"/>
<param name="rank_threshold" type="integer" label="Rank Threshold" help="The rank threshold used for deriving the target genes of an enriched motif (default: 5000)." optional="true"/>
<param name="auc_threshold" type="float" label="AUC Threshold" help="The threshold used for calculating the AUC of a feature as fraction of ranked genes (default: 0.05)." optional="true"/>
<param name="nes_threshold" type="float" label="NES Threshold" help="The Normalized Enrichment Score (NES) threshold for finding enriched features (default: 3.0)." optional="true"/>
<param name="cell_id_attribute" type="text" label="Cell ID Attribute" help="The name of the column attribute that specifies the identifiers of the cells in the loom file." optional="true"/>
<param name="gene_attribute" type="text" label="Gene Attribute" help="The name of the row attribute that specifies the gene symbols in the loom file." optional="true"/>
<param name="sparse" type="boolean" label="Sparse Matrix" truevalue="--sparse" falsevalue="" help="If set, load the expression data as a sparse matrix. Currently applies to the grn inference step only."/>
</inputs>
<outputs>
<data name="output" format="tabular" label="${tool.name} on ${on_string}: AUCell scores for regulons or gene sets."/>
</outputs>
<tests>
<test expect_num_outputs="1">
<param name="expression_mtx_fname" value="expr_mat.loom"/>
<param name="signatures_fname" value="regulons.tsv"/>
<output name="output">
<assert_contents>
<has_n_lines n="101"/>
<has_text text="CEBPB"/>
</assert_contents>
</output>
</test>
</tests>
<help>
<![CDATA[
Run PySCENIC aucell command to analyze single-cell gene expression data.
**Input Parameters:**
- **expression_mtx_fname**: The name of the file that contains the expression matrix for the single cell experiment. Two file formats are supported: csv (rows=cells x columns=genes) or loom (rows=genes x columns=cells).
- **signatures_fname**: The name of the file that contains the gene signatures. Three file formats are supported: gmt, yaml, or dat (pickle).
**Options:**
- **-o, --output**: Output file/stream, a matrix of AUC values. Two file formats are supported: csv or loom. If loom file is specified, it will contain the original expression matrix and the calculated AUC values as extra column attributes.
- **-t, --transpose**: Transpose the expression matrix if supplied as csv (rows=genes x columns=cells).
- **-w, --weights**: Use weights associated with genes in recovery analysis. Is only relevant when gene signatures are supplied as json format.
- **--seed**: Seed for the expression matrix ranking step. The default is to use a random seed.
**Motif Enrichment Arguments:**
- **--rank_threshold**: The rank threshold used for deriving the target genes of an enriched motif (default: 5000).
- **--auc_threshold**: The threshold used for calculating the AUC of a feature as fraction of ranked genes (default: 0.05).
- **--nes_threshold**: The Normalized Enrichment Score (NES) threshold for finding enriched features (default: 3.0).
**Loom File Arguments:**
- **--cell_id_attribute**: The name of the column attribute that specifies the identifiers of the cells in the loom file.
- **--gene_attribute**: The name of the row attribute that specifies the gene symbols in the loom file.
- **--sparse**: If set, load the expression data as a sparse matrix. Currently applies to the grn inference step only.
]]>
</help>
<expand macro="citations"/>
</tool>
139 changes: 139 additions & 0 deletions tools/tertiary-analysis/pyscenic/pyscenic_ctx.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
<tool id="pyscenic_ctx" name="PySCENIC CTX" profile="21.09" version="@TOOL_VERSION@+galaxy0">
<description>
computes active regulons based on a gene regulatory network
</description>
<macros>
<import>macros.xml</import>
</macros>
<expand macro="requirements"/>
<command><![CDATA[
#set PySCENIC_DB = "db.genes_vs_motifs.rankings.feather"
ln -s '${module_fname}' tf2targets.tsv &&
ln -s '${expression_mtx}' expr_mat.loom &&
ln -s '${database_fname}' ${PySCENIC_DB} &&
pyscenic ctx tf2targets.tsv ${PySCENIC_DB}
--expression_mtx_fname expr_mat.loom
--output regulons.tsv
$no_pruning
#if $chunk_size
--chunk_size '${chunk_size}'
#end if
--mode custom_multiprocessing
--num_workers \${GALAXY_SLOTS:-1}
$all_modules
$transpose
#if $rank_threshold
--rank_threshold '${rank_threshold}'
#end if
#if $auc_threshold
--auc_threshold '${auc_threshold}'
#end if
#if $nes_threshold
--nes_threshold '${nes_threshold}'
#end if
#if $min_orthologous_identity
--min_orthologous_identity '${min_orthologous_identity}'
#end if
#if $max_similarity_fdr
--max_similarity_fdr '${max_similarity_fdr}'
#end if
#if $annotations_fname
--annotations_fname '${annotations_fname}'
#end if
#if $thresholds
--thresholds '${thresholds}'
#end if
#if $top_n_targets
--top_n_targets '${top_n_targets}'
#end if
#if $top_n_regulators
--top_n_regulators '${top_n_regulators}'
#end if
#if $min_genes
--min_genes '${min_genes}'
#end if
$mask_dropouts
#if $cell_id_attribute
--cell_id_attribute '${cell_id_attribute}'
#end if
#if $gene_attribute
--gene_attribute '${gene_attribute}'
#end if
$sparse
]]></command>
<inputs>
<param type="data" name="module_fname" format="tabular" label="Module File" help="Signatures or the co-expression modules. Usually the output from pyscenic grn."/>
<param type="data" name="database_fname" label="Database File" help="Regulatory feature databases. Supported formats: feather"/>
<param type="data" name="annotations_fname" format="tabular" label="Annotations File" help="File that contains the motif annotations to use."/>
<param type="data" name="expression_mtx" format="loom" label="Expression Matrix" help="The expression matrix for the single cell experiment."/>
<param type="boolean" name="no_pruning" label="No Pruning" truevalue="--no_pruning" falsevalue="" help="Do not perform pruning, i.e. find enriched motifs."/>
<param type="integer" name="chunk_size" label="Chunk Size" optional="true" help="The size of the module chunks assigned to a node in the dask graph (default: 100)."/>
<param type="boolean" name="all_modules" label="All Modules" truevalue="--all_modules" falsevalue="" help="Include positive and negative regulons in the analysis (default: no, i.e. only positive)."/>
<param type="boolean" name="transpose" label="Transpose Expression Matrix" truevalue="-t" falsevalue="" help="Use this if the matrix is cell x genes instead of genes x cells as expected"/>
<param type="float" name="rank_threshold" label="Rank Threshold" optional="true" help="The rank threshold used for deriving the target genes of an enriched motif."/>
<param type="float" name="auc_threshold" label="AUC Threshold" optional="true" help="The threshold used for calculating the AUC of a feature as fraction of ranked genes."/>
<param type="float" name="nes_threshold" label="NES Threshold" optional="true" help="The Normalized Enrichment Score (NES) threshold for finding enriched features."/>
<param type="float" name="min_orthologous_identity" label="Minimum Orthologous Identity" optional="true" help="Minimum orthologous identity to use when annotating enriched motifs."/>
<param type="float" name="max_similarity_fdr" label="Maximum Similarity FDR" optional="true" help="Maximum FDR in motif similarity to use when annotating enriched motifs."/>
<param type="text" name="thresholds" label="Thresholds" optional="true" help="Thresholds to use for selecting the features (e.g., motifs)."/>
<param type="integer" name="top_n_targets" label="Top N Targets" optional="true" help="The number of top targets to retain for each feature."/>
<param type="integer" name="top_n_regulators" label="Top N Regulators" optional="true" help="The number of top regulators to retain for each feature."/>
<param type="integer" name="min_genes" label="Minimum Genes" optional="true" help="The minimum number of genes a module needs to have to be considered for regulatory network analysis."/>
<param type="boolean" name="mask_dropouts" label="Mask Dropouts" truevalue="--mask_dropouts" falsevalue="" help="Mask dropouts in the expression matrix."/>
<param type="text" name="cell_id_attribute" label="Cell ID Attribute" optional="true" help="The name of the attribute in the loom expression matrix that contains cell IDs."/>
<param type="text" name="gene_attribute" label="Gene Attribute" optional="true" help="The name of the attribute in the loom expression matrix that contains gene names."/>
<param name="sparse" type="boolean" label="Sparse Matrix" truevalue="--sparse" falsevalue="" help="If set, load the expression data as a sparse matrix. Currently applies to the grn inference step only."/>
</inputs>
<outputs>
<data name="output" format="tabular" from_work_dir="regulons.tsv" label="${tool.name} on ${on_string}: table of enriched motifs and target genes"/>
<!-- Define other output formats as needed -->
</outputs>
<tests>
<test expect_num_outputs="1">
<param name="module_fname" value="tf2targets.tsv"/>
<param name="expression_mtx" value="expr_mat.loom"/>
<param name="database_fname" value="genome-ranking_v2.feather"/>
<param name="annotations_fname" value="motifs.tbl"/>
<output name="output" file="regulons.tsv" compare="sim_size" delta_frac="0.2"/>
</test>
</tests>
<help><![CDATA[
.. class:: infomark
:name: warning
**pySCENIC ctx: Contextualize GRN**
This tool refines gene regulatory networks (GRNs) by pruning targets that do not have an enrichment for a corresponding motif of the transcription factor (TF). This process effectively separates direct from indirect targets based on the presence of cis-regulatory footprints.
**Inputs:**
- **Module File**: A file containing the signature or co-expression modules. Supported formats include CSV, TSV (adjacencies), YAML, GMT, and DAT (modules).
- **Database Files**: One or more regulatory feature databases. Supported formats include feather or db (legacy).
- **Annotations File**: A file containing the motif annotations to use.
**Optional Parameters:**
- **No Pruning**: Do not perform pruning, i.e., find enriched motifs.
- **Chunk Size**: The size of the module chunks assigned to a node in the dask graph (default: 100).
- **Mode**: The mode to be used for computing (default: custom_multiprocessing).
- **All Modules**: Include positive and negative regulons in the analysis (default: only positive).
- **Transpose**: Transpose the expression matrix (rows=genes x columns=cells).
- **Rank Threshold**: The rank threshold used for deriving the target genes of an enriched motif (default: 5000).
- **AUC Threshold**: The threshold used for calculating the AUC of a feature as a fraction of ranked genes (default: 0.05).
- **NES Threshold**: The Normalized Enrichment Score (NES) threshold for finding enriched features (default: 3.0).
- **Min Orthologous Identity**: Minimum orthologous identity to use when annotating enriched motifs (default: 0.0).
- **Max Similarity FDR**: Maximum FDR in motif similarity to use when annotating enriched motifs (default: 0.001).
- **Thresholds**: The first method to create the TF-modules based on the best targets for each transcription factor (default: 0.75 0.90).
- **Top N Targets**: The second method is to select the top targets for a given TF (default: 50).
- **Top N Regulators**: The alternative way to create the TF-modules is to select the best regulators for each gene (default: 5 10 50).
- **Min Genes**: The minimum number of genes in a module (default: 20).
- **Expression Matrix File**: The name of the file that contains the expression matrix for the single-cell experiment. Supported formats include csv (rows=cells x columns=genes) or loom (rows=genes x columns=cells). Required if modules need to be generated.
- **Mask Dropouts**: Controls whether cell dropouts (cells in which expression of either TF or target gene is 0) are masked when calculating the correlation between a TF-target pair.
- **Cell ID Attribute**: The name of the column attribute that specifies the identifiers of the cells in the loom file.
- **Gene Attribute**: The name of the row attribute that specifies the gene symbols in the loom file.
- **Sparse**: If set, load the expression data as a sparse matrix. Currently applies to the GRN inference step only.
]]></help>
<expand macro="citations"/>
</tool>
Loading

0 comments on commit 4990a52

Please sign in to comment.