diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 0000000..e69de29 diff --git a/404.html b/404.html new file mode 100644 index 0000000..a32c1ea --- /dev/null +++ b/404.html @@ -0,0 +1 @@ +
The following describe the minimum GitHub actions that should be deployed with any production pipeline. The actions are automatically provided via the cookiecutter templates: NextFlow and Snakemake.
Documentation (assumes mkdocs build
; required for all repos)
Lintr (required for CCBR projects and new pipelines)
Dry-run with test sample data for any PR to dev branch (required for CCBR projects and new pipelines)
Full-run with full sample data for any PR to main branch (required for CCBR projects and new pipelines)
Auto pull/push from source (if applicable for CCBR projects and new pipelines)
Add assigned issues & PRs to user projects.
When an issue or PR is assigned to a CCBR member, this action will automatically add it to their personal GitHub Project, if they have one. This file can be copy and pasted exactly as-is into any CCBR repo from here.
GitHub Pages is quick and easy way to build static websites for your GitHub repositories. Essentially, you write pages in Markdown which are then rendered to HTML and hosted on GitHub, free of cost!
CCBR has used GitHub pages to provide extensive, legible and organized documentation for our pipelines. Examples are included below:
Mkdocs
is the with documentation tool preferred, with the Material theme, for most of the CCBR GitHub Pages websites.
mkdocs
and the Material for mkdocs theme can be installed using the following:
pip install --upgrade pip
+pip install mkdocs
+pip install mkdocs-material
+
Also install other common dependencies:
pip install mkdocs-pymdownx-material-extras
+pip install mkdocs-git-revision-date-localized-plugin
+pip install mkdocs-git-revision-date-plugin
+pip install mkdocs-minify-plugin
+
Generally, for GitHub repos with GitHub pages:
docs
folder at the root levelgh-pages
branch at root levelThe following steps can be followed to build your first website
mkdocs.yaml
¶mkdocs.yaml
needs to be added to the root of the master branch. A template of this file is available in the cookiecutter template.
git clone https://github.com/CCBR/xyz.git
+cd xyz
+vi mkdocs.yaml
+git add mkdocs.yaml
+git commit -m "adding mkdocs.yaml"
+git push
+
Here is a sample mkdocs.yaml
:
# Project Information
+site_name: CCBR How Tos
+site_author: Vishal Koparde, Ph.D.
+site_description: >-
+ The **DEVIL** is in the **DETAILS**. Step-by-step detailed How To Guides for data management and other CCBR-relevant tasks.
+
+# Repository
+repo_name: CCBR/HowTos
+repo_url: https://github.com/CCBR/HowTos
+edit_uri: https://github.com/CCBR/HowTos/edit/main/docs/
+
+# Copyright
+copyright: Copyright © 2023 CCBR
+
+# Configuration
+theme:
+ name: material
+ features:
+ - navigation.tabs
+ - navigation.top
+ - navigation.indexes
+ - toc.integrate
+ palette:
+ - scheme: default
+ primary: indigo
+ accent: indigo
+ toggle:
+ icon: material/toggle-switch-off-outline
+ name: Switch to dark mode
+ - scheme: slate
+ primary: red
+ accent: red
+ toggle:
+ icon: material/toggle-switch
+ name: Switch to light mode
+ logo: assets/images/doc-book.svg
+ favicon: assets/images/favicon.png
+
+# Plugins
+plugins:
+ - search
+ - git-revision-date
+ - minify:
+ minify_html: true
+
+
+# Customization
+extra:
+ social:
+ - icon: fontawesome/solid/users
+ link: http://bioinformatics.cancer.gov
+ - icon: fontawesome/brands/github
+ link: https://github.com/CCRGeneticsBranch
+ - icon: fontawesome/brands/docker
+ link: https://hub.docker.com/orgs/nciccbr/repositories
+ version:
+ provider: mike
+
+
+# Extensions
+markdown_extensions:
+ - markdown.extensions.admonition
+ - markdown.extensions.attr_list
+ - markdown.extensions.def_list
+ - markdown.extensions.footnotes
+ - markdown.extensions.meta
+ - markdown.extensions.toc:
+ permalink: true
+ - pymdownx.arithmatex:
+ generic: true
+ - pymdownx.betterem:
+ smart_enable: all
+ - pymdownx.caret
+ - pymdownx.critic
+ - pymdownx.details
+ - pymdownx.emoji:
+ emoji_index: !!python/name:materialx.emoji.twemoji
+ emoji_generator: !!python/name:materialx.emoji.to_svg
+ - pymdownx.highlight
+ - pymdownx.inlinehilite
+ - pymdownx.keys
+ - pymdownx.magiclink:
+ repo_url_shorthand: true
+ user: squidfunk
+ repo: mkdocs-material
+ - pymdownx.mark
+ - pymdownx.smartsymbols
+ - pymdownx.snippets:
+ check_paths: true
+ - pymdownx.superfences
+ - pymdownx.tabbed
+ - pymdownx.tasklist:
+ custom_checkbox: true
+ - pymdownx.tilde
+
+# Page Tree
+nav:
+ - Intro : index.md
+
index.md
¶Create docs
folder, add your index.md
there.
mkdir docs
+echo "### Testing" > docs/index.md
+git add docs/index.md
+git commit -m "adding landing page"
+git push
+
mkdocs
can now be used to render .md
to HTML
mkdocs build
+INFO - Cleaning site directory
+INFO - Building documentation to directory: /Users/$USER/Documents/GitRepos/parkit/site
+INFO - Documentation built in 0.32 seconds
+
The above command creates a local site
folder which is the root of your "to-be-hosted" website. You can now open the HTMLs in the site
folder locally to ensure that that HTML is as per you liking. If not, then you can make edits to the .md
files and rebuild the site.
NOTE: You do not want to push the site
folder back to GH and hence it needs to be added to .gitignore
file:
echo "**/site/*" > .gitignore
+git add .gitignore
+git commit -m "adding .gitignore"
+git push
+
The following command with auto-create a gh-pages
branch (if it does not exist) and copy the contents of the site
folder to the root of that branch. It will also provide you the URL to your newly created website.
mkdocs gh-deploy
+INFO - Cleaning site directory
+INFO - Building documentation to directory: /Users/kopardevn/Documents/GitRepos/xyz/site
+INFO - Documentation built in 0.34 seconds
+WARNING - Version check skipped: No version specified in previous deployment.
+INFO - Copying '/Users/kopardevn/Documents/GitRepos/xyz/site' to 'gh-pages' branch and pushing to
+ GitHub.
+Enumerating objects: 51, done.
+Counting objects: 100(51/51), done.
+Delta compression using up to 16 threads
+Compressing objects: 100(47/47), done.
+Writing objects: 100(51/51), 441.71 KiB | 4.29 MiB/s, done.
+Total 51 (delta 4), reused 0 (delta 0), pack-reused 0
+remote: Resolving deltas: 100(4/4), done.
+remote:
+remote: Create a pull request for 'gh-pages' on GitHub by visiting:
+remote: https://github.com/CCBR/xyz/pull/new/gh-pages
+remote:
+To https://github.com/CCBR/xyz.git
+ * [new branch] gh-pages -> gh-pages
+INFO - Your documentation should shortly be available at: https://CCBR.github.io/xyz/
+
Now if you point your web browser to the URL from gh-deploy
command (IE https://CCBR.github.io/xyz/) you will see your HTML hosted on GitHub. After creating your docs, the cookiecutter template includes a GitHub action which will automatically perform the above tasks whenever a push is performed to the main branch.
gear
icon next to About
Website
, select Use your GitHub Pages website
.Save Changes
All CCBR developed pipelines, and techdev efforts should be created under CCBR's GitHub Org Account and all CCBR team members should have a minimal of read-only permission to the repository.
Note: The above templates are themselves under active development! As we continue building a multitude of analysis pipelines, we keep expanding on the list of "commonalities" between these analysis pipelines which need to be added to the template itself. Hence, templates are updated from time-to-time.
To create a new repository on Github using gh cli, you can run the following command on Biowulf after you update the new repository name (<ADD NEW REPO NAME>
) and the repository description (<ADD REPO DESCRIPTION>
) commands below.
Naming Nomenclature: - All Repositories: Do not remove the CCBR/
leading the repository name, as this will correctly place the repository under the CCBR organization account. - CCBR Projects: These should be named with the CCBR#. IE CCBR/CCBR1155
- CCBR Pipelines: These should be descriptive of the pipelines main process; acronyms are encouraged. IE CCBR/CARLISLE
- TechDev projects: These should be descriptive of the techDev project to be performed and should begin with techdev_
. IE CCBR/techdev_peakcall_benchmarking
gh repo create CCBR/<ADD NEW REPO NAME> \
+--template CCBR/CCBR_NextflowTemplate \
+--description "<ADD REPO DESCRIPTION>" \
+--public \
+--confirm
+
gh repo create CCBR/<ADD NEW REPO NAME> \
+--template CCBR/CCBR_SnakemakeTemplate \
+--description "<ADD REPO DESCRIPTION>" \
+--public \
+--confirm
+
gh repo create CCBR/techdev_<ADD NEW REPO NAME> \
+--template CCBR/CCBR_TechDevTemplate \
+--description "<ADD REPO DESCRIPTION>" \
+--public \
+--confirm
+
Once the repo is created, then you can clone a local copy of the new repository:
gh repo clone CCBR/<reponame>.git
+
If you start from one of the above templates, you'll have these files already. However, if you're updating an established repository, you may need to add some of these manually.
The changelog file should be at the top level of your repo. One exception is if your repo is an R package, it should be called NEWS.md
instead. You can see an example changelog of a pipeline in active development here.
The version file should be at the top level of your repo. If your repo is a Python package, it can be at at the package root instead, e.g. src/pkg/VERSION
. If your repo is an R package, the version should be inside the DESCRIPTION
file instead. Every time a PR is opened, the PR owner should add a line to the changelog to describe any user-facing changes such as new features added, bugs fixed, documentation updates, or performance improvements.
You will also need a CLI command to print the version. The implementation will be different depending on your repo's language and structure.
The citation file must be at the top level of your repo. See template here.
You will also need a CLI command to print the citation. The implementation will be different depending on your repo's language and structure.
Pre-commit config files
Pre-commit hooks provide an automated way to style code, draw attention to typos, and validate commit messages. Learn more about pre-commit here. These files can be customized depending on the needs of the repo. For example, you don't need the hook for formatting R code if your repo does not and will never contain any R code.
.github/PULL_REQUEST_TEMPLATE.md
The Pull Request template file helps developers and collaborators remember to write descriptive PR comments, link any relevant issues that the PR resolves, write unit tests, update the docs, and update the changelog. You can customize the Checklist in the template depending on the needs of the repo.
Issue templates (optional)
Issue templates help users know how they can best communicate with maintainers to report bugs and request new features. These are helpful but not required.
GitHub Actions are automated workflows that run on GitHub's servers to execute unit tests, render documentation, build docker containers, etc. Most Actions need to be customized for each repo. Learn more about them here.
The following outlines basic GitHub function to push
and pull
from your repository. It also includes information on creating a new branch and deleting a branch. These commands should be used in line with guidance on GitHub Repo Management.
Check which files have been changed.
git status
+
Stage files that need to be pushed
git add <thisfile>
+git add <thatfile>
+
Push changes to branch named new_feature
git push origin new_feature
+
Pull changes from branch new_feature
into your branch old_feature
git checkout old_feature
+git pull new_feature
+
If you have non-compatible changes in the old_feature
branch, there are two options: 1) ignore local changes and pull remote anyways. This will delete the changes you've made to your remote respository.
git reset --hard
+git pull
+
git stash
+git pull
+git stash pop
+
This is a two step process.
Create the branch locally
git checkout -b <newbranch>
+
Push the branch to remote
git push -u origin <newbranch>
+
git push -u origin HEAD
+
origin
and track it so that you don't need to specify origin HEAD
in the future. git branch -d <BranchName>
+
git push origin --delete <BranchName>
+
Pre-commit should be added to all GitHub repositories on Biowulf and any clones created elsewhere to ensure cohesive and informative commit messages. After the creating the repository the following commands should be run in order to initialize the pre-commit hook, and establish the following requirements for all commit messages:
Pre-commit has been installed as a module on Biowulf. Set up an interactive session, and follow the steps below. A pre-commit configuration file is needed, and can be copied from the CCBR template repo.
# load module on biowulf
+module load precommit
+
+# CD into the GitHub repo
+cd <repo_name>
+
+# install precommit in the repo - this is the only time you need to do this, per local repo location
+pre-commit install
+
+# copy and update the precommit config
+touch .pre-commit.config
+
Commits must follow the format listed below, as designated by Angular:
<type>(<scope>): <subject>
+<body>
+<BLANK LINE>
+<footer>
+
The type
must be one of the following options:
The scope
must be one of the following options:
The subject
must be a succinct description of the change. It should follow the following rules:
The body
should include the motivation for the change and contrast this with previous behavior. It should follow the following rule:
The footer
should contain any information about Breaking Changes
and is also the place to reference GitHub issues that this commit Closes.
Closed bugs
should be listed on a separate line in the footer from Breaking Changes
, prefixed with "Closes" keyword (Closes #234 or Closes #123, #245, #992).Below are some examples of properly formatted commit messages
# example
+docs(changelog): update changelog to beta.5
+
+# example
+fix(release): need to depend on latest rxjs and zone.js
+
+# example
+feat($browser): onUrlChange event (popstate/hashchange/polling)
+Added new event to $browser:
+- forward popstate event if available
+- forward hashchange event if popstate not available
+- do polling when neither popstate nor hashchange available
+
+Breaks $browser.onHashChange, which was removed (use onUrlChange instead)
+
+# example
+fix($compile): couple of unit tests for IE9
+Older IEs serialize html uppercased, but IE9 does not...
+Would be better to expect case insensitive, unfortunately jasmine does
+not allow to user regexps for throw expectations.
+
+Closes #392
+Breaks foo.bar api, foo.baz should be used instead
+
The gh
is installed on Biowulf at /data/CCBR_Pipeliner/db/PipeDB/bin/gh_1.7.0_linux_amd64/bin/gh
. You can run the following lines to edit your ~/.bashrc
file to add gh
to your $PATH
:
echo "export PATH=$PATH:/data/CCBR_Pipeliner/db/PipeDB/bin/gh_1.7.0_linux_amd64/bin" >> ~/.bashrc
+source ~/.bashrc
+
Alternatively, you can use the git
commands provided through a Biowulf module
module load git
+
Personal Access Token (PAT) is required to access GitHub (GH) without having to authenticate by other means (like password) every single time. You will need gh cli installed on your laptop or use /data/CCBR_Pipeliner/db/PipeDB/bin/gh_1.7.0_linux_amd64/bin/gh
on Biowulf, as described above. You can create a PAT by going here. Then you can copy the PAT and save it into a file on Biowulf (say ~/gh_token
). Next, you can run the following command to set everything up correctly on Biowulf (or your laptop)
gh auth login --with-token < ~/git_token
+
If you hate to re-enter (username and) password every time you push/pull to/from github (or mkdocs gh-deploy), then it is totally worthwhile to spend a couple minutes to set up SSH keys for auto-authentication. The instructions to do this are available here.
Users should follow these links to learn more about setting up the repository, before reviewing the best practices below:
All pipelines should provide users with documentation for usage, test data, expected outputs, and troubleshooting information. Mkdocs is the recommended tool to perform this action, however, other tools may be utilized. The template's (NextFlow, Snakemake) were written for mkdocs, and provide basic yaml markdown files provided for this use. They should be edited according to the pipelines function and user needs. Examples of the requirements for each page are provided in the templates.
ref:https://nvie.com/posts/a-successful-git-branching-model/
pass
from GitHub actions. See GitHub actions #4 for more information testing requirements for mergepass
from GitHub actions. See GitHub actions #3 for more information testing requirements for mergegit flow feature start unique_feature_name
followed by git flow feature publish unique_feature_name
git flow hotfix start unique_hotfix_name
NOTE: While the
git flow feature start
command is recommended for feature branch merging, thegit flow feature finish
is not. Using thefinish
command will automatically merge thefeature
branch into thedev
branch, without any testing, and regardless of divergence that may have occured during feature development.
git flow init
git flow feature start unique_feature_name
and then git flow feature publish unique_feature_name
. The unique_feature_name will be created from the develop branch.The following format of versioning should be followed:
v.X.Y.Z
+
The following rules should be applies when determining the version release:
Other notes:
The following information is meant to outline test_data requirements for all pipelines, however, should be altered to fit the needs of the specific pipeline or project developed.
.test
directory, as found in all templates. .test
directory, to include the following information:At a minimum three test sets should be available:
1) Should include sub-sampled inputs, to test the pipelines functionality, and to be used as the tutorial test set
. 2) Should include full-sample inputs, of high quality, to test the robustness of the pipelines resources 3) Should include full-sample inputs, of expected project-level quality, to test the robustness of the pipelines error handling - Test data should come from a CCBR project or a a publicly available source. Care should be taken when choosing test data sets, to ensure that the robustness of the pipeline will be tested, as well as the ability of the pipeline to handle both high and low quality data. Multiple test sets may need to be created to meet these goals.
Users should follow these links to learn more about setting up the repository, before reviewing the best practices below:
All pipelines should provide users with:
Markdown pages can be hosted directly within the repo using GH Pages. Mkdocs is the recommended tool to perform this action, however, other tools may be utilized. The templates (TechDev) template's (written for mkdocs) provided have basic yaml markdown files provided for this use, and should be edited according to the pipelines function and user needs. Also, track blockers/hurdles using GitHub Issues.
Analysis
2.1. Background
2.2. Resources - This page should include any relevant tools that were used for testing. If these tools were loaded via Biowulf, include all version numbers. If they were installed locally, provide information on installation, source location, and any reference documents used.
2.3. Test Data - This will include information on the test data included within the project. Species information, source information, and references should be included. Any manipulation performed on the source samples should also be included. Manifest information should also be outlined. Provide location to toy dataset or real dataset or both. It is best to park these datasets at a commonly accessible location on Biowulf and share that location here. Also, make the location read-only to prevent accidental edits by other users.
2.4. Analysis Plan - Provide a summary of the plan of action. - If benchmarking tools, include the dataset used, where the source material was obtained, and all relevant metadata. - Include the location of any relevant config / data files used in this analysis - Provide information on the limits of the analysis - IE this was only tested on Biowulf, this was only performed in human samples, etc.
2.5. Results - What were the findings of this TechDev exercise? Include links to where intermediate files may be located. Include tables, plots, etc. This will serve as a permanent record of the TechDev efforts to be shared within the team and beyond.
2.6 Conclusions - Provide brief, concise conclusion drawn from the analysis, including any alterations being made to pipelines or analysis.
Contributions
We encourage the use of the Git Flow tools for some actions, available on Biowulf. Our current branching strategy is based off of the Git Flow strategy shown below :
Master (named main or master)
pass
from GitHub actions. See GitHub actions #4 for more information testing requirements for mergepass
from GitHub actions. See GitHub actions #3 for more information testing requirements for merge.test
directory, as found in all templates. .test
directory, to include the following information:HPC_DME_APIs provides command line utilities or CLUs to interface with HPCDME. This document describes some of the initial setup steps to get the CLUs working on Biowulf.
The repo can be cloned at a location accessible to you:
cd /data/$USER/
+git clone https://github.com/CBIIT/HPC_DME_APIs.git
+
mkdir -p /data/$USER/HPCDMELOG/tmp
+touch /data/$USER/HPCDMELOG/tmp/hpc-cli.log
+
hpcdme.properties
is the file that all CLUs look into for various parameters like authentication password, file size limits, number of CPUs, etc. Make a copy of the template provided and prepare it for customization.
cd /data/$USER/HPC_DME_APIs/utils
+cp hpcdme.properties-sample hpcdme.properties
+
Some of the parameters in this file have become obsolete over the course of time and are commmented out. Change paths and default values, as needed
Note: replace
$USER
with your actual username in the properties file. Bash variables will not be interpolated.
#HPC DME Server URL
+#Production server settings
+hpc.server.url=https://hpcdmeapi.nci.nih.gov:8080
+hpc.ssl.keystore.path=hpc-client/keystore/keystore-prod.jks
+#hpc.ssl.keystore.password=hpcdmncif
+hpc.ssl.keystore.password=changeit
+
+#UAT server settings
+#hpc.server.url=https://fr-s-hpcdm-uat-p.ncifcrf.gov:7738/hpc-server
+#hpc.ssl.keystore.path=hpc-client/keystore/keystore-uat.jks
+#hpc.ssl.keystore.password=hpc-server-store-pwd
+
+#Proxy Settings
+hpc.server.proxy.url=10.1.200.240
+hpc.server.proxy.port=3128
+
+hpc.user=$USER
+
+#Globus settings
+#default globus endpoint to be used in registration and download
+hpc.globus.user=$USER
+hpc.default.globus.endpoint=ea6c8fd6-4810-11e8-8ee3-0a6d4e044368
+
+#Log files directory
+hpc.error-log.dir=/data/$USER/HPCDMELOG/tmp
+
+###HPC CLI Logging START####
+#ERROR, WARN, INFO, DEBUG
+hpc.log.level=ERROR
+hpc.log.file=/data/$USER/HPCDMELOG/tmp/hpc-cli.log
+###HPC CLI Logging END####
+
+#############################################################################
+# Please use caution changing following properties. They don't change usually
+#############################################################################
+#hpc.collection.service=collection
+#hpc.dataobject.service=dataObject
+#Log files directory
+#hpc.error-log.dir=.
+
+#Number of thread to run data file import from a CSV file
+hpc.job.thread.count=1
+
+upload.buffer.size=10000000
+
+#Retry count and backoff period for registerFromFilePath (Fixed backoff)
+hpc.retry.max.attempts=3
+#hpc.retry.backoff.period=5000
+
+#Multi-part upload thread pool, threshold and part size configuration
+#hpc.multipart.threadpoolsize=10
+#hpc.multipart.threshold=1074790400
+#hpc.multipart.chunksize=1073741824
+
+#globus.nexus.url=nexus.api.globusonline.org
+#globus.url=www.globusonline.org
+
+#HPC DME Login token file location
+hpc.login.token=tokens/hpcdme-auth.txt
+
+#Globus Login token file location
+#hpc.globus.login.token=tokens/globus-auth.txt
+#validate.md5.checksum=false
+
+# JAR version
+#hpc.jar.version=hpc-cli-1.4.0.jar
+
NOTE: The current java version used is:
bash java -version openjdk version "1.8.0_181" OpenJDK Runtime Environment (build 1.8.0_181-b13) OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
~/.bashrc
¶Add the CLUs to PATH by adding the following to ~/.bashrc
file
# export environment variable HPC_DM_UTILS pointing to directory where
+# HPC DME client utilities are, then source functions script in there
+export HPC_DM_UTILS=/data/$USER/HPC_DME_APIs/utils
+source $HPC_DM_UTILS/functions
+
Next, source it
source ~/.bashrc
+
Now, you are all set to generate a token. This prevents from re-entering your password everytime.
dm_generate_token
+
If the token generation takes longer than 45 seconds, check the connection:
ping hpcdmeapi.nci.nih.gov
+
If the connection responds, try to export the following proxy, and then re-run the dm_generate_tokens command
:
export https_proxy=http://dtn01-e0:3128
+
Done! You are now all set to use CLUs.
Rawdata or Project folders from Biowulf can be parked at a secure location after the analysis has reached an endpoint. Traditionally, CCBR analysts have been using GridFTP Globus Archive for doing this. But, this Globus Archive has been running relatively full lately and it is hard to estimate how much space is left there as the volume is shared among multiple groups.
parkit is designed to assist analysts in archiving project data from the NIH's Biowulf/Helix systems to the HPC-DME storage platform. It provides functionalities to package and store data such as raw FastQ files or processed data from bioinformatics pipelines. Users can automatically: - create tarballs of their data (including .filelist
and .md5sum
files), - generate metadata, - create collections on HPC-DME, and - deposit tar files into the system for long-term storage. parkit also features comprehensive workflows that support both folder-based and tarball-based archiving. These workflows are integrated with the SLURM job scheduler, enabling efficient execution of archival tasks on the Biowulf HPC cluster. This integration ensures that bioinformatics project data is securely archived and well-organized, allowing for seamless long-term storage.
NOTE: HPC DME API CLUs should already be setup as per these instructions in order to use parkit
NOTE:
HPC_DM_UTILS
environment variable should be set to point to theutils
folder under theHPC_DME_APIs
repo setup. Please see these instructions.
parkit_folder2hpcdme
is the preferred parkit command to completely archive an entire folder as a tarball on HPCDME using SLURM.
parkit_folder2hpcdme
usage¶parkit_folder2hpcdme --help
+
usage: parkit_folder2hpcdme [-h] [--restartfrom RESTARTFROM] [--executor EXECUTOR] [--folder FOLDER] [--dest DEST]
+ [--projectdesc PROJECTDESC] [--projecttitle PROJECTTITLE] [--rawdata] [--cleanup] [--makereadme]
+ --hpcdmutilspath HPCDMUTILSPATH [--version]
+
+End-to-end parkit: Folder 2 HPCDME
+
+options:
+ -h, --help show this help message and exit
+ --restartfrom RESTARTFROM
+ if restarting then restart from this step. Options are: createemptycollection, createmetadata, deposittar
+ --executor EXECUTOR slurm or local
+ --folder FOLDER project folder to archive
+ --dest DEST vault collection path (Analysis goes under here!)
+ --projectdesc PROJECTDESC
+ project description
+ --projecttitle PROJECTTITLE
+ project title
+ --rawdata If tarball is rawdata and needs to go under folder Rawdata
+ --cleanup post transfer step to delete local files
+ --makereadme make readme file with destination location on vault
+ --hpcdmutilspath HPCDMUTILSPATH
+ what should be the value of env var HPC_DM_UTILS
+ --version print version
+
parkit_folder2hpcdme
testing¶# make a tmp folder
+mkdir -p /data/$USER/parkit_tmp
+# copy dummy project folder into the tmp folder
+cp -r /data/CCBR/projects/CCBR-12345 /data/$USER/parkit_tmp/CCBR-12345-$USER
+# check if HPC_DM_UTILS has been set
+echo $HPC_DM_UTILS
+
# source conda
+. "/data/CCBR_Pipeliner/db/PipeDB/Conda/etc/profile.d/conda.sh"
+# activate parkit or parkit_dev environment
+conda activate parkit
+# check version of parkit
+parkit --version
+
v2.0.2-dev
+
parkit_folder2hpcdme
¶parkit_folder2hpcdme --folder /data/$USER/parkit_tmp/CCBR-12345-$USER --dest /CCBR_Archive/GRIDFTP/Project_CCBR-12345-$USER --projectdesc "some_description" --projecttitle "some_title" --makereadme --hpcdmutilspath $HPC_DM_UTILS --executor local
+
################ Running createtar #############################
+parkit createtar --folder "/data/$USER/parkit_tmp/CCBR-12345-kopardevn"
+tar cvf /data/$USER/parkit_tmp/CCBR-12345-kopardevn.tar /data/$USER/parkit_tmp/CCBR-12345-kopardevn > /data/$USER/parkit_tmp/CCBR-12345-kopardevn.tar.filelist
+createmetadata: /data/$USER/parkit_tmp/CCBR-12345-kopardevn.tar file was created!
+createmetadata: /data/$USER/parkit_tmp/CCBR-12345-kopardevn.tar.filelist file was created!
+createmetadata: /data/$USER/parkit_tmp/CCBR-12345-kopardevn.tar.md5 file was created!
+createmetadata: /data/$USER/parkit_tmp/CCBR-12345-kopardevn.tar.filelist.md5 file was created!
+################################################################
+############ Running createemptycollection ######################
+parkit createemptycollection --dest "/CCBR_Archive/GRIDFTP/Project_CCBR-12345-kopardevn" --projectdesc "description" --projecttitle "title"
+module load java/11.0.21 && source $HPC_DM_UTILS/functions && dm_register_collection /dev/shm/995b4648-08c2-44b7-a728-470408cb539a.json /CCBR_Archive/GRIDFTP/Project_CCBR-12345-kopardevn
+cat /dev/shm/995b4648-08c2-44b7-a728-470408cb539a.json && rm -f /dev/shm/995b4648-08c2-44b7-a728-470408cb539a.json
+module load java/11.0.21 && source $HPC_DM_UTILS/functions && dm_register_collection /dev/shm/f2d4badf-b7e6-4e10-8e93-2df9da6cdbbf.json /CCBR_Archive/GRIDFTP/Project_CCBR-12345-kopardevn/Analysis
+module load java/11.0.21 && source $HPC_DM_UTILS/functions && dm_register_collection /dev/shm/f2d4badf-b7e6-4e10-8e93-2df9da6cdbbf.json /CCBR_Archive/GRIDFTP/Project_CCBR-12345-kopardevn/Rawdata
+cat /dev/shm/f2d4badf-b7e6-4e10-8e93-2df9da6cdbbf.json && rm -f /dev/shm/f2d4badf-b7e6-4e10-8e93-2df9da6cdbbf.json
+################################################################
+########### Running createmetadata ##############################
+parkit createmetadata --tarball "/data/$USER/parkit_tmp/CCBR-12345-kopardevn.tar" --dest "/CCBR_Archive/GRIDFTP/Project_CCBR-12345-kopardevn"
+createmetadata: /data/$USER/parkit_tmp/CCBR-12345-kopardevn.tar.metadata.json file was created!
+createmetadata: /data/$USER/parkit_tmp/CCBR-12345-kopardevn.tar.filelist.metadata.json file was created!
+################################################################
+############# Running deposittar ###############################
+parkit deposittar --tarball "/data/$USER/parkit_tmp/CCBR-12345-kopardevn.tar" --dest "/CCBR_Archive/GRIDFTP/Project_CCBR-12345-kopardevn"
+module load java/11.0.21 && source $HPC_DM_UTILS/functions && dm_register_dataobject /data/$USER/parkit_tmp/CCBR-12345-kopardevn.tar.filelist.metadata.json /CCBR_Archive/GRIDFTP/Project_CCBR-12345-kopardevn/Analysis/CCBR-12345-kopardevn.tar.filelist /data/$USER/parkit_tmp/CCBR-12345-kopardevn.tar.filelist
+module load java/11.0.21 && source $HPC_DM_UTILS/functions && dm_register_dataobject_multipart /data/$USER/parkit_tmp/CCBR-12345-kopardevn.tar.metadata.json /CCBR_Archive/GRIDFTP/Project_CCBR-12345-kopardevn/Analysis/CCBR-12345-kopardevn.tar /data/$USER/parkit_tmp/CCBR-12345-kopardevn.tar
+################################################################
+
NOTE: change
--executor local
to--executor slurm
when submitting to SLURMNOTE: add
--rawdata
when folder contains raw fastqs
Transfer can be verified by logging into HPC DME web interface.
Delete unwanted collection from HPC DME.
# load java
+module load java
+# delete collection recursively
+dm_delete_collection -r /CCBR_Archive/GRIDFTP/Project_CCBR-12345-$USER
+
Reading properties from /data/kopardevn/GitRepos/HPC_DME_APIs/utils/hpcdme.properties
+WARNING: You have requested recursive delete of the collection. This will delete all files and sub-collections within it recursively. Are you sure you want to proceed? (Y/N):
+Y
+Would you like to see the list of files to delete ?
+N
+The collection /CCBR_Archive/GRIDFTP/Project_CCBR-12345-kopardevn and all files and sub-collections within it will be recursively deleted. Proceed with deletion ? (Y/N):
+Y
+Executing: https://hpcdmeapi.nci.nih.gov:8080/collection
+Wrote results into /data/kopardevn/HPCDMELOG/tmp/getCollections_Records20241010.txt
+Cmd process Completed
+Oct 10, 2024 4:43:09 PM org.springframework.shell.core.AbstractShell handleExecutionResult
+INFO: CLI_SUCCESS
+
Reach out to Vishal Koparde in case you run into issues.
Host data on Helix or Biowulf, that is publicly accessible through a URL
# Helix
+ssh -Y username@helix.nih.gov
+
+# Biowulf
+ssh -Y username@biowulf.nih.gov
+
Create a new dir (tutorial
) in the datashare path:
cd /data/CCBR/datashare/
+mkdir tutorial
+
NOTE: For all steps below, an example is shown for Helix, but the same process is applicable for Biowulf, after changing the helix.nih.gov
to biowulf.nih.gov
Now you can transfer your data to the new directory. One method is to use scp
to copy data from your local machine to Helix.
Here is an example of using scp
to copy the file file.txt
from a local directory to Helix.
scp /data/$USER/file.txt username@helix.nih.gov:/data/CCBR/datashare/tutorial/
+
To copy multiple directories recursively, you can also include the -r
command with scp
and from the top level directory:
scp -r /data/$USER/ username@helix.nih.gov:/data/CCBR/datashare/tutorial/
+
When the data has been successully copied, we need to open the permissions.
NOTE: This will give open access to anyone with the link. Ensure this is appropriate for the data type
# cd to the shared dir
+cd /data/CCBR/datashare/
+
+# run CHMOD, twice
+chmod -R 777 tutorial
+chmod -R 777 tutorial/*
+
+# run SETFACL
+setfacl -m u:webcpu:r-x tutorial/*
+
NOTE: You must be logged into HPC in order to access these files from a web browser.
Files will be available for access through a browser, via tools like wget
and UCSC genome track browser
via the following format:
http://hpc.nih.gov/~CCBR/tutorial/file.txt
For more information and a tutorial for creating UCSC tracks, visit the CCBR HowTo Page.
Submit a pipeline to Zenodo in order to create a DOI for publication.
Use the link for full information, summarized below: https://www.youtube.com/watch?v=A9FGAU9S9Ow
The GitHub repository should include the following:
CITATION.cff
; Example hereGo to Zenodo
Select username
in the top right >> Profile
. Select `GitHub``
Click Sync Now
(top right) to update repos. NOTE: You may have to refresh the page
Toggle the On
button on the repo you wish to publish. This will move the pipeline to the Enable Repositories
list.
Go to GitHub and find the repository page.
Select Releases
>> Draft a new release
Create a tag, following naming semantics described here
Describe the tag with the following: "Connecting pipeline to Zenodo"
Go to Zenodo
Select My dashboard
>> Edit
Update the following information:
Go to Zenodo
Select username
in the top right >> Profile
. Select `GitHub``
Click Sync Now
(top right) to update repos. NOTE: You may have to refresh the page
Copy the DOI for the repository
Return to the GitHub repository and edit the README
of the GitHub repo, adding the DOI link.
Update the CITATION.cff
as needed.
Create a new tagged version.
To create a package with Conda, you first need to make a new release and tag in the github repo of the R package you would like to create into a Conda Package.
For more information on best practices in R package creation, review this documentation.
Before creating the release on github, please check for proper dependencies listed in the files NAMESPACE
and DESCRIPTION
.
These files should have the same list of dependencies, and the version numbers for dependencies can be specified in the DESCRIPTION
file. The DESCRIPTION
file must be edited manually, while the NAMESPACE file should not be edited manually, but rather created automatically using the document() function.
The DESCRIPTION
file must also be correctly formatted. For more information, see the following website.
To download the most recent release from the most recent tag on Github, activate Conda then use Conda skeleton
to pull the correct URL. In the example below, replace $githubURL
with the URL to your R package's github repo.
conda activate
+
+conda skeleton cran $githubURL
+
A folder is then created for the downloaded release. Ror example running the following:
conda skeleton cran https://github.com/NIDAP-Community/DSPWorkflow
+
Creates the folder
r-dspworkflow
+
Within this newly created folder is a file named meta.yaml
. You will need to edit this file to include the channels and edit any information on the the package version number or dependency version numbers.
Here is an example of the top of the meta.yaml
file with the channels section added:
{% set version = '0.9.5.2' %}
+
+{% set posix = 'm2-' if win else '' %}
+{% set native = 'm2w64-' if win else '' %}
+
+package:
+ name: r-dspworkflow
+ version: {{ version|replace("-", "_") }}
+
+channels:
+ - conda-forge
+ - bioconda
+ - default
+ - file://rstudio-files/RH/ccbr-projects/Conda_package_tutorial/local_channel/channel
+
+source:
+
+ git_url: https://github.com/NIDAP-Community/DSPWorkflow
+ git_tag: 0.9.5
+
+build:
+ merge_build_host: True # [win]
+ # If this is a new build for the same version, increment the build number.
+ number: 0
+ # no skip
+
+ # This is required to make R link correctly on Linux.
+ rpaths:
+ - lib/R/lib/
+ - lib/
+
+ # Suggests: testthat (== 3.1.4)
+requirements:
+ build:
+ - {{ posix }}filesystem # [win]
+ - {{ posix }}git
+ - {{ posix }}zip # [win]
+
Here is an example of the sections for specifying dependency versions from the meta.yaml
file:
host:
+ - r-base =4.1.3=h2f963a2_5
+ - bioconductor-biobase =2.54.0=r41hc0cfd56_2
+ - bioconductor-biocgenerics =0.40.0=r41hdfd78af_0
+ - bioconductor-geomxtools =3.1.1=r41hdfd78af_0
+ - bioconductor-nanostringnctools =1.2.0
+ - bioconductor-spatialdecon =1.4.3
+ - bioconductor-complexheatmap =2.10.0=r41hdfd78af_0
+ - r-cowplot =1.1.1=r41hc72bb7e_1
+ - r-dplyr =1.0.9=r41h7525677_0
+ - r-ggforce =0.3.4=r41h7525677_0
+ - r-ggplot2 =3.3.6=r41hc72bb7e_1
+ - r-gridextra =2.3=r41hc72bb7e_1004
+ - r-gtable =0.3.0=r41hc72bb7e_3
+ - r-knitr =1.40=r41hc72bb7e_1
+ - r-patchwork =1.1.2=r41hc72bb7e_1
+ - r-reshape2 =1.4.4=r41h7525677_2
+ - r-scales =1.2.1=r41hc72bb7e_1
+ - r-tibble =3.1.8=r41h06615bd_1
+ - r-tidyr =1.2.1=r41h7525677_1
+ - r-umap =0.2.9.0=r41h7525677_1
+ - r-rtsne =0.16=r41h37cf8d7_1
+ - r-magrittr =2.0.3=r41h06615bd_1
+ - r-rlang =1.1.0=r41h38f115c_0
+
+ run:
+ - r-base =4.1.3=h2f963a2_5
+ - bioconductor-biobase =2.54.0=r41hc0cfd56_2
+ - bioconductor-biocgenerics =0.40.0=r41hdfd78af_0
+ - bioconductor-geomxtools =3.1.1=r41hdfd78af_0
+ - bioconductor-nanostringnctools =1.2.0
+ - bioconductor-spatialdecon =1.4.3
+ - bioconductor-complexheatmap =2.10.0=r41hdfd78af_0
+ - r-cowplot =1.1.1=r41hc72bb7e_1
+ - r-dplyr =1.0.9=r41h7525677_0
+ - r-ggforce =0.3.4=r41h7525677_0
+ - r-ggplot2 =3.3.6=r41hc72bb7e_1
+ - r-gridextra =2.3=r41hc72bb7e_1004
+ - r-gtable =0.3.0=r41hc72bb7e_3
+ - r-knitr =1.40=r41hc72bb7e_1
+ - r-patchwork =1.1.2=r41hc72bb7e_1
+ - r-reshape2 =1.4.4=r41h7525677_2
+ - r-scales =1.2.1=r41hc72bb7e_1
+ - r-tibble =3.1.8=r41h06615bd_1
+ - r-tidyr =1.2.1=r41h7525677_1
+ - r-umap =0.2.9.0=r41h7525677_1
+ - r-rtsne =0.16=r41h37cf8d7_1
+ - r-magrittr =2.0.3=r41h06615bd_1
+ - r-rlang =1.1.0=r41h38f115c_0
+
In the above example, each of the dependencies has been assigned a conda build string, so that when conda builds a conda package, it will only use that specific build of the dependency from the listed conda channels. The above example is very restrictive, the dependencies can also be listed in the "meta.yaml" file to be more open--it will choose a conda build string that fits in with the other resolved dependency build strings based on what is available in the channels.
Also note that the "host" section matches the "run" section.
Here is some examples of a more open setup for these dependencies:
host:
+ - r-base >=4.1.3
+ - bioconductor-biobase >=2.54.0
+ - bioconductor-biocgenerics >=0.40.0
+ - bioconductor-geomxtools >=3.1.1
+ - bioconductor-nanostringnctools >=1.2.0
+ - bioconductor-spatialdecon =1.4.3
+ - bioconductor-complexheatmap >=2.10.0
+ - r-cowplot >=1.1.1
+ - r-dplyr >=1.0.9
+ - r-ggforce >=0.3.4
+ - r-ggplot2 >=3.3.6
+ - r-gridextra >=2.3
+ - r-gtable >=0.3.0
+ - r-knitr >=1.40
+ - r-patchwork >=1.1.2
+ - r-reshape2 >=1.4.4
+ - r-scales >=1.2.1
+ - r-tibble >=3.1.8
+ - r-tidyr >=1.2.1
+ - r-umap >=0.2.9.0
+ - r-rtsne >=0.16
+ - r-magrittr >=2.0.3
+ - r-rlang >=1.1.0
+
When the meta.yaml
has been prepared, you can now build the Conda package. To do so, run the command, replacing
$r-package
with the name of the R package folder that was created after running conda skeleton (the folder where the meta.yaml is located).$build_log_name.log
with the name for the log file, such as the date, time, and initials.conda-build $r-package 2>&1|tee $build_log_name.log
+
Example
conda-build r-dspworkflow 2>&1|tee 05_12_23_330_nc.log
+
The log file will list how conda has built the package, including what dependencies version numbers and corresponding build strings were used to resolve the conda environment. These dependencies are what we specified in the "meta.yaml" file. The log file will be useful troubleshooting a failed build.
Be aware, the build can take anywhere from several minutes to an hour to complete, depending on the size of the package and the number of dependencies.
The conda package will be built as a tar.bz2 file.
An important consideration for Conda builds is the list of dependencies, specified versions, and compatibility with each of the other dependencies.
If the meta.yaml
and DESCRIPTION
file specify specific package versions, Conda's ability to resolve the Conda environment also becomes more limited.
For example, if the Conda package we are building has the following requirements:
Dependency A version == 1.0
+Dependency B version >= 2.5
+
And the Dependencies located in our Conda channel have the following dependencies:
Dependency A version 1.0
+ - Dependency C version == 0.5
+
+Dependency A version 1.2
+- Dependency C version >= 0.7
+
+Dependency B version 2.7
+ - Dependency C version >= 0.7
+
As you can see, the Conda build will not be able to resolve the environment because Dependency A version 1.0 needs an old version of Dependency C, while Dependency B version 2.7 needs a newer version.
In this case, if we changed our package's DESCRIPTION
and meta.yaml
file to be:
Dependency A version >= 1.0
+Dependency B version >= 2.5
+
The conda build will be able to resolve. This is a simplified version of a what are more often complex dependency structures, but it is an important concept in conda package building that will inevitably arise as a package's dependencies become more specific.
To check on the versions of packages that are available in a Conda channel, use the command:
conda search $dependency
+
Replace $dependency with the name of package you would like to investigate. There are more optional commands for this function which can be found here
To check the dependencies of packages that exist in your Conda cache, go to the folder specified for your conda cache (that we specified earlier). In case you need to find that path you can use:
conda info
Here there will be a folder for each of the packages that has been used in a conda build (including the dependencies). In each folder is another folder called "info" and a file called "index.json" that lists information, such as depends for the package.
Here is an example:
``` cat /rstudio-files/ccbr-data/users/Ned/conda-cache/r-ggplot2-3.3.6-r41hc72bb7e_1/info/index.json
{ "arch": null, "build": "r41hc72bb7e_1", "build_number": 1, "depends": [ "r-base >=4.1,<4.2.0a0", "r-digest", "r-glue", "r-gtable >=0.1.1", "r-isoband", "r-mass", "r-mgcv", "r-rlang >=0.3.0", "r-scales >=0.5.0", "r-tibble", "r-withr >=2.0.0" ], "license": "GPL-2.0-only", "license_family": "GPL2", "name": "r-ggplot2", "noarch": "generic", "platform": null, "subdir": "noarch", "timestamp": 1665515494942, "version": "3.3.6" } ```
If you would like to specify an exact package to use in a conda channel for your conda build, specify the build string in your "meta.yaml" file. In the above example for ggplot version 3.3.6, the build string is listed in the folder name for package as well as in the index.json file for "build: "r41hc72bb7e_1".
You will first need to set up Conda in order to use the Conda tools for creating your Conda package.
The documentation for getting started can be found here, including installation guidelines.
In a space shared with other users that may use Conda, your personal Conda cache needs to be specified. To edit how your cache is saved perform the following steps:
1) Create a new directory where you would like to store the conda cache called 'conda-cache'
mkdir conda/conda-cache
+
2) In your home directory, create the file .condarc
touch ~/.condarc
+
3) Open the new file .condarc
and add the following sections:
pkgs_dirs
envs_dirs
conda-build
channels
In each section you will add the path to the directories you would like to use for each section.
Example:
kgs_dirs:<br>
+ - /rstudio-files/ccbr-data/users/Ned/conda-cache<br>
+envs_dirs:<br>
+ - /rstudio-files/ccbr-data/users/Ned/conda-envs<br>
+conda-build:<br>
+ root-dir: /rstudio-files/ccbr-data/users/Ned/conda-bld<br>
+ build_folder: /rstudio-files/ccbr-data/users/Ned/conda-bld
+ conda-build<br>
+ output_folder: /rstudio-files/ccbr-data/users/Ned/conda-bld/conda-output<br>
+channels:<br>
+ - file://rstudio-files/RH/ccbr-projects/Conda_package_tutorial/local_channel/channel<br>
+ - conda-forge<br>
+ - bioconda<br>
+ - defaults<br>
+
To check that conda has been setup with the specified paths from .condarc
start conda:
conda activate
+
Then check the conda info:
conda info
+
Step-by-step guide for setting up and learning to use Snakemake, with examples and use cases
In order to use the genomic broswer features, sample files must be created.
For individual samples, where peak density is to be observed, bigwig formatted files must be generated. If using the CCBR pipelines these are automatically generated as outputs of the pipeline (WORKDIR/results/bigwig
). In many cases, scaling or normalization of bigwig is required to visualize multiple samples in comparison with each other. See various deeptools options for details/ideas. If not using CCBR pipelines, example code is provided below for the file generation.
modue load ucsc
+
+fragments_bed="/path/to/sample1.fragments.bed"
+bw="/path/to/sample1.bigwig"
+genome_len="numeric_genome_length"
+bg="/path/to/sample1.bedgraph"
+bw="/path/to/sample2.bigwig"
+
+# if using a spike-in scale, the scaling factor should be applied
+# while not required, it is recommended for CUT&RUN experiements
+spikein_scale="spike_in_value"
+
+# create bed file
+bedtools genomecov -bg -scale $spikein_scale -i $fragments_bed -g $genome_len > $bg
+
+# create bigwig file
+bedGraphToBigWig $bg $genome_len $bw
+
For contrasts, where peak differences are to be observed, bigbed formatted files must be generated. If using the CCBR/CARLISLE pipeline these are automatically generated as outputs of the pipeline (WORKDIR/results/peaks/contrasts/contrast_id/). If not using this pipeline, example code is provided below for the file generation.
module load ucsc
+
+bed="/path/to/sample1_vs_sample2_fragmentsbased_diffresults.bed"
+bigbed="/path/to/output/sample1_vs_sample2_fragmentsbased_diffresults.bigbed"
+genome_len="numeric_genome_length"
+
+# create bigbed file
+bedToBigBed -type=bed9 $bed $genome_len $bigbed
+
For all sample types, data must be stored on a shared directory. It is recommended that symlnks be created from the source location to this shared directory to ensure that minial disc space is being used. Example code for creating symlinks is provided below.
# single sample
+## set source file location
+source_loc="/WORKDIR/results/bigwig/sample1.bigwig "
+
+## set destination link location
+link_loc="/SHAREDDIR/bigwig/sample1.bigwig"
+
+## create hard links
+ln $source_loc $link_loc
+
# contrast
+## set source file location
+source_loc="WORKDIR/results/peaks/contrasts/sample1_vs_sample2/sample1_vs_sample2_fragmentsbased_diffresults.bigbed "
+
+## set destination link location
+link_loc="/SHAREDDIR/bigbed/sample1_vs_sample2.bigbed"
+
+## create hard links
+ln $source_loc $link_loc
+
Once the links have been generated, the data folder must be open to read and write access.
## set destination link location
+link_loc="/SHAREDDIR/bigbed/"
+
+# open dir
+chmod -R a+rX $link_loc
+
It's recommended to create a text file of all sample track information to ease in editing and submission to the UCSC browser website. A single line of code is needed for each sample which will provide the track location, sample name, description of the sample, whether to autoscale the samples, max height of the samples, view limits, and color. An example is provided below.
track type=bigWig bigDataUrl=https://hpc.nih.gov/~CCBR/ccbr1155/${dir_loc}/bigwig/${complete_sample_id}.bigwig name=${sample_id} description=${sample_id} visibility=full autoScale=off maxHeightPixels=128:30:1 viewLimits=1:120 color=65,105,225
+
Users may find it helpful to create a single script which would create this text file for all samples. An example of this is listed below, which assumes that input files were generated using the CARLISLE pipeline. It can be edited to adapt to other output files, as needed.
Generally, each "track" line should have at least the following key value pairs: - name : label for the track - description : defines the center lable displayed - type : BAM, BED, bigBed, bigWig, etc. - bigDataUrl : URL of the data file - for other options see here
# input arguments
+sample_list_input=/"path/to/samples.txt"
+track_dir="/path/to/shared/dir/"
+track_output="/path/to/output/file/tracks.txt
+peak_list=("norm.relaxed.bed" "narrowPeak" "broadGo_peaks.bed" "narrowGo_peaks.bed")
+method_list=("fragmentsbased")
+dedup_list=("dedup")
+
+# read sample file
+IFS=$'\n' read -d '' -r -a sample_list < $sample_list_input
+
+run_sample_tracks (){
+ sample_id=$1
+ dedup_id=$2
+
+ # sample name
+ # eg siNC_H3K27Ac_1.dedup.bigwig
+ complete_sample_id="${sample_id}.${dedup_id}"
+
+ # set link location
+ link_loc="${track_dir}/bigwig/${complete_sample_id}.bigwig"
+
+ # echo track info
+ echo "track type=bigWig bigDataUrl=https://hpc.nih.gov/~CCBR/ccbr1155/${dir_loc}/bigwig/${complete_sample_id}.bigwig name=${sample_id} description=${sample_id} visibility=full autoScale=off maxHeightPixels=128:30:1 viewLimits=1:120 color=65,105,225" >> $track_output
+}
+
+# iterate through samples
+# at the sample level only DEDUP matters
+for sample_id in ${sample_list[@]}; do
+ for dedup_id in ${dedup_list[@]}; do
+ run_sample_tracks $sample_id $dedup_id
+ done
+done
+
It's recommended to create a text file of all sample track information to ease in editing and submission to the UCSC browser website. A single line of code is needed for each contrast which will provide the track location, contrast name, file type, and whether to color the sample. An example is provided below.
track name=${sample_id}_${peak_type} bigDataUrl=https://hpc.nih.gov/~CCBR/ccbr1155/${dir_loc}/bigbed/${complete_sample_id}_fragmentsbased_diffresults.bigbed type=bigBed itemRgb=On
+
Users may find it helpful to create a single script which would create this text file for all contrasts. An example of this is listed below, which assumes that input files were generated using the CARLISLE pipeline. It can be edited to adapt to other output files, as needed.
# input arguments
+sample_list_input=/"path/to/samples.txt"
+track_dir="/path/to/shared/dir/"
+track_output="/path/to/output/file/tracks.txt
+peak_list=("norm.relaxed.bed" "narrowPeak" "broadGo_peaks.bed" "narrowGo_peaks.bed")
+method_list=("fragmentsbased")
+dedup_list=("dedup")
+
+# read sample file
+IFS=$'\n' read -d '' -r -a deg_list < $deg_list_input
+
+run_comparison_tracks (){
+ peak_type=$1
+ method_type=$2
+ dedup_type=$3
+ sample_id=$4
+
+ # sample name
+ # eg siSmyd3_2m_Smyd3_0.25HCHO_500K_vs_siNC_2m_Smyd3_0.25HCHO_500K__no_dedup__norm.relaxed
+ complete_sample_id="${sample_id}__${dedup_type}__${peak_type}"
+
+ # set link location
+ link_loc="${track_dir}/bigbed/${complete_sample_id}_${method_type}_diffresults.bigbed"
+
+ # echo track info
+ echo "track name=${sample_id}_${peak_type} bigDataUrl=https://hpc.nih.gov/~CCBR/ccbr1155/${dir_loc}/bigbed/${complete_sample_id}_fragmentsbased_diffresults.bigbed type=bigBed itemRgb=On" >> $track_info
+}
+
+# iterate through samples / peaks / methods / dedup
+for sample_id in ${deg_list[@]}; do
+ for peak_id in ${peak_list[@]}; do
+ for method_id in ${method_list[@]}; do
+ for dedup_id in ${dedup_list[@]}; do
+ run_comparison_tracks $peak_id $method_id $dedup_id $sample_id
+ done
+ done
+ done
+done
+
Users can also change the colors of the tracks using standard HTML color features. Common colors used are provided below:
Red=205,92,92
+Blue=65,105,225
+Black=0,0,0
+
Biowulf/Helix hosts its own instance of the UCSC Genome Browser which is behind the NIH firewall.
TIP: Unindexed file formats like
bed
andgtf
take significantly longer to load in the genome Browser and it is recommended to convert them to indexed formats likebigBed
andbigWig
prior to adding them to your session.
The UCSC Genome Browser allows for visualization of genomic data in an interactive and shareable format. User's must create accounts with their NIH credentials, and have an active Biowulf account to create the tracks. In addition users have to be connect to VPN in order to view and create the tracks. Once bigwig files are generated and stored in a shared data location, genomic tracks can be edited, and permanent links created, accessible for collaborators to view.
{"use strict";/*!
+ * escape-html
+ * Copyright(c) 2012-2013 TJ Holowaychuk
+ * Copyright(c) 2015 Andreas Lubbe
+ * Copyright(c) 2015 Tiancheng "Timothy" Gu
+ * MIT Licensed
+ */var Va=/["'&<>]/;qn.exports=za;function za(e){var t=""+e,r=Va.exec(t);if(!r)return t;var o,n="",i=0,a=0;for(i=r.index;i