Welcome to the course web page for Computational Approaches in Molecular Ecology. The course was previously taught as an experimental course titled Advanced Computing and Bioinformatics for Conservation Genomics. We will be covering computing, analysis and data-organization strategies for bioinformatics and analysis of high-throughput sequencing data for ecology, evolution, and conservation.
Modern high-throughput sequencing can provide extraordinary amounts of data, enabling researchers to tackle a wide range of questions and problems in ecology, evolution, conservation, and fisheries and wildlife management. Preparing and processing these data for use, however, requires multiple bioinformatic steps, and subsequent analysis of these large, complex data sets must rely on specialized computer programs. Mastering these skills presents a high bar for students originating from outside of computer science and related fields. At present, in many institutions, such skills are typically learned from peers within experienced laboratories, or through a variety of workshops. This course aims to comprehensively teach the computing and analytical skills necessary to use genomic data from high-throughput sequencing in the context of ecological research. During the first 2/3 of the course, the focus is on aligning DNA sequence data and identifying variants across multiple individuals. In the last 1/3 of the course we consider a series of case studies in how such data are used to make inference for applications in fisheries, wildlife, and conservation. Outside of the bioinformatic utilities that run within a Unix framework, emphasis is placed on using the R programming language and RStudio for project management and documentation.
The proposed course topics appear, by week, in the table below. Each week of the course is structured as a different chapter in the navigation panel on the left. In order to figure out what we are doing in the course each week, that will be the first place to check. The week's objectives, readings, and exercises will be listed there.
The course schedule is:
- Tuesdays 10:00-10:50 in BIO 133. Lecture
- Thursdays 10:00-11:50 in BIO 133. Discussion/Computer lab
Students are expected to bring a laptop to both lectures and labs
Wednesdays 10:00-11:30 AM.
Eric holds his office hours in the comfy chairs in the northeast corner of the 3rd floor of the Biology Building.
Although this course takes place in person, there will be zoom links for several remote students to connect. If you are in Fort Collins, you are expected to be in class, in person. The links are primarily available for our remote students; however in-person students can take advantage of them on a limited basis, for example, if they are isolating with COVID, if the roads are incredibly ice and treacherous, etc.
For the remote sessions, we will use Google Meetings. You can log in to https://meet.google.com/vwh-xbhs-tzi during class meeting times and/or office hours if you can't be here in person.
Upon successful completion of the course, students will be able to:
- Organize and execute a complex bioinformatic data-analysis project in a manner that makes it easily understood and reproduced by others.
- Describe the main data formats used in genomic analysis, and know how to generate and manipulate them.
- Work with a wide range of the bioinformatic tools available in the Unix environment and understand how to script these tools into pipelines for DNA sequence alignment, variant calling, and analysis.
- Understand how to break down complex genomic analysis projects into small, independent chunks and execute those using job arrays or a workflow management system on a high performance computing cluster.
- Perform a variety of computational analyses central to molecular ecology and conservation genetics.
Assessment will be based mostly on problem sets. These will sometimes require considerable time and thought, but they will be critical for solidifying the concepts and procedures in the course. Students will also be undertaking individual projects in which they apply the skills they have learned in the course to a data set relevant in some way to their own research or to an interesting question relevant to some existing data, after discussion with the instructors (see below). Finally, students are expected to actively engage in the reading material (and will be assessed on that with short quizzes) and to contribute to discussion and participation in the course, including (and most importantly) being helpful to one another in order to learn challenging material, together, in a supportive environment.
Assessment Component | Percentage of Grade |
---|---|
Problem sets | 45% |
Individual analysis projects | 30% |
Quizzes | 15% |
Class participation | 10% |
The schedule is subject to change as the semester proceeds, but this is what we are shooting for.
Week | Lecture Component | Lab Component |
---|---|---|
1 | Introduction, Rstudio Projects, Rmarkdown | git and GitHub, connecting to Alpine, srun interactive |
2 | Unix, directory structure, utilities | Unix, .bashrc, tmux, file transfer |
3 | sequence data, alignment conventions, FASTA, SAM | conda/mamba, Samtools (faidx), bwa, map one individual |
4 | Shell scripting, regular expressions, sed and awk | Shell scripting, bwa map everyone. Mark duplicates, samtools stats |
5 | High Performance Computing Clusters, SLURM | map, sort, compress, via sbatch and a job array |
6 | Snakemake | setting up Snakemake and running it on Alpine |
7 | Variant calling and genotype likelihoods, fundamental concepts | GATK, gVCFs, VCFs |
8 | Variant calling and genotype likelihoods | using ANGSD |
9 | The Coalescent and the site-frequency spectrum | Exploring the coalescent |
10 | Estimating the SFS with lcWGS | ANGSD, winsfs |
11 | Application of SFS: Fst in sliding windows | doing SFS in sliding windows. |
12 | PCA with lcWGS PCANGSD | PCA with PCANGSD |
13 | Genome wide association studies | ANGSD doAsso |
14 | Inbreeding, runs of homozygosity | bcftools roh |
The purpose of the individual projects is to allow the students to use many of the skills learned, and to gain experience in preparing a reproducible research project. Some students likely already have their own data sets that they are working on, but we expect that many will not. We will be able to provide data and interesting questions to tackle from our own research. Additionally, we will encourage students to take on related projects so that they can work together on different parts of a single question.
Today there was a campus closure due to weather, so students will have a little more self-guided prep to do for the next class.
RStudio Projects, RMarkdown, git/GitHub, and connecting to Alpine.
Click here for full details
RStudio and git preps:- Ensure that you have a recent version of R, and the latest version of RStudio installed on your laptop.
- Make sure that you have
git
installed on your laptop. Good instructions (possibly a little dated) can be found in Jenny Bryan's HappyGitWithR web book. - "Introduce Yourself To Git" following the directions here
- If you don't already have one, get yourself an account on GitHub, and once you have successfully logged onto GitHub in your web browser, send your email account name to [email protected].
- Download the RStudio IDE Cheatsheet and study it. Especially the "Version Control" section
Rmarkdown preps:
- Download the RMarkdown cheatsheet and study it.
- In R, do
install.packages("bookdown")
That will trigger the download of a lot of other packages that are useful to have.
Connecting to Alpine (or another cluster) preps:
Click here for full details
Introductions:- Introductions of the course and students
- Brief overview of tentative syllabus
- Example data (pre-introduction)
RStudio/GitHub:
-
Rstudio and Git Configs:
-
Check git/GitHub connectivity, etc
-
Create 2 RStudio projects, commit them and push them
-
Create a con-gen-csu-githubname project and repo and push that up
-
RMarkdown. Make sure everyone is compiling the template.
- Get the repo with:
git clone https://github.com/eriqande/bioinf-rmarkdown-introduction
- Get the repo with:
-
Introduce the assignment.
-
Make sure everyone can log on to Alpine (or their respective Unix cluster)
- Getting and account on Alpine for CSU students
- Logging in if you already have an account: go the page above and find the section on "Remote Login"
Click here for full details
**This will be described in class, and is due by Monday, January 22nd at 6 PM.**For a video demonstration of the execution of these steps (and a better explanation of how to Push commits to GitHub from RStudio) you can check out this 11 minute video on how to complete and turn in your "About Me" assignment
- Create an Rstudio project on your laptop called
con-gen-csu-githubusername
where you replacegithubusername
with your actual GitHub name/handle. - Create an empty GitHub repository named
con-gen-csu-githubusername
and push the contents of your Rstudio project to it. - Clone the repository https://github.com/eriqande/bioinf-rmarkdown-introduction, open the RStudio project and make sure you can knit the
about-me-example.Rmd
file, at least to HTML. - Try knitting that to PDF or DOCX format as well.
- Copy the
about-me.Rmd
template file and the filereferences.bib
from the repo to a directory called001-about-me
in your owncon-gen-csu-githubname
project, and edit it to provide information about yourself, and to practice using RMarkdown. - When you are done editing it and you have knitted it:
- email the
about-me.html
file to [email protected] with subject line. "About me!" - commit your
about-me.Rmd
to your con-gen-csu-githubname repo and push it up to GitHub.
- email the
Logging into Alpine; Git on Alpine; Basic Unix Stuff;
Click here for full details
- Read from the eca-bioinf-handbook from the beginning of [Chapter 4](https://eriqande.github.io/eca-bioinf-handbook/essential-unixlinux-terminal-knowledge.html#essential-unixlinux-terminal-knowledge) up to and including [Handling, Manipulating, and Viewing files and streams](https://eriqande.github.io/eca-bioinf-handbook/essential-unixlinux-terminal-knowledge.html#handling-manipulating-and-viewing-files-and-streams)Click here for full details
For a video demonstration of the execution of these steps see the 9 minute video about logging into Alpine and getting setup with git and some SSH keys for GitHub.
- Logging into Alpine
ssh [email protected]@login11.rc.colorado.edu
# password is eidpassword,push
# (you have to add ,push to the end, then use the DUO app)
- Setting up
git
on Alpine:
git config --global user.name "Your Name"
git config --global user.email "[email protected]"
git config --global core.editor nano
- setting up SSH key pair on Alpine for GitHub
# if you already have ~/.ssh/id_ed25519 and ~/.ssh/id_ed25519.pub
# then you don't have to set these up, just go to the next step.
# If not, then it is simple, do this:
ssh-keygen -t ed25519 -C "FOR GITHUB"
# when prompted, save in default location and leave password
# blank by just hitting return.
- Copy the public key and put it on GitHub
cat ~/.ssh/id_ed25519.pub
# then copy it and go to GitHub->Settings->SSH and GPG keys
# and add the Key.
- Add a command to your
~/.bashrc
to wake up the ssh daemon on the CURC. Donano ~/.bashrc
and add the following lines to the file and then save it:
alias gitup='eval "$(ssh-agent -s)"; ssh-add ~/.ssh/id_ed25519'
- Source your
~/.bashrc
and then run the gitup command.
source ~/.bashrc
gitup
- Test your connection to GitHub:
ssh -T [email protected]
For a video demonstration of these steps, check out the 6.5 minute video on forking and cloning the class repository
Forking is the process of making a clone of somebody else's repository on GitHub in your own GitHub account. We will use this to deploy homework, moving forward.
The steps here are:
-
Make sure you are signed into GitHub on your browser, then navigate to our course repository at https://github.com/eriqande/con-gen-csu.
-
Find the "Fork" button in the upper right and click it. When this is done with the forking process, your browser should be at your own copy of the repository.
-
cd to your
projects
directory. This is very important! Do the following, but replaceCSUeid
with your CSU eid.
cd /projects/[email protected]/
- Once again verify that you are in your projects directory:
pwd
- Clone your fork of the
con-gen-csu
repository to Alpine: Verify your browser is at your own fork of the repository, then, from the green "Code" button, get the SSH address for the repository and put it aftergit clone
in the terminal. That will look like:
git clone [email protected]:YourGitHubHandle/con-gen-csu.git
with YourGitHubHandle
actually being your true GitHub handle.
Click here for full details
This is due by 5 PM Friday, January 26, 2024
Here are the steps to start working on your homework. We will go over these in class. Don't start doing them until we have discussed it in class.
Here is a video that explains all the steps below: Video about running two clones of your repo and editing the homework file in RStudio, while doing the Unix commands on the cluster.
- Sync your fork of the course repository on GitHub with any changes I have made on the original
- Pull down any changes to the copy of your fork of the class repo that is on Alpine in your projects directory.
git pull origin main
# if that doesn't work you might need to do:
source ~/.bashrc
gitup
# and then try again
- You will be working on the command line on Alpine to test things out, but I think it might be easier to edit your homework files using RStudio. Also, doing this will help you get comfortable with having two separate clones of a repo---something that I end up doing quite a bit. So, I will walk you through that in the following.
- Clone your fork of the class repo onto your laptop and open it with RStudio.
- Copy the file
002-unix-intro/unix-intro-TEMPLATE.sh
to
002-unix-intro/unix-intro.sh
. The latter is the file that you will be modifying and ultimately committing and pushing back to your fork. Please be careful not to modify002-unix-intro/unix-intro-TEMPLATE.sh
. - Then start working through the homework on RStudio, making changes to
002-unix-intro/unix-intro.sh
. I will give an example of the first few on the video.
Instructions for submitting the homework
When you are done with everything, you are going to submit this by:
- In RStudio, making a new branch on your laptop clone of your fork of the repository
called
unix-intro
. - Committing the completed state of your homework to that branch.
- Pushing that commit on that branch to GitHub
- sending me a pull request from GitHub.
That is a lot of weird steps, so here is a video to see what I mean: Submitting the unix-intro homework.
Unix discussion period; installing mamba
Click here for full details
- Read the rest of chapter 4 in the handbook from [Chapter 4.5 to the end](https://eriqande.github.io/eca-bioinf-handbook/essential-unixlinux-terminal-knowledge.html#unix-env) - It is recommended to read the introductory section of the handbook in [Section 7.6](https://eriqande.github.io/eca-bioinf-handbook/working-on-remote-servers.html#installing-software-on-an-hpcc).Click here for full details
Here are the steps to install mamba on your Alpine account:For a video of these steps, see: Installing mamba via miniforge
- Check to see if you have conda already. Just type
conda
at the command line:
conda
if that returns help information, that you already have conda.
- Check to see if you have mamba already:
mamba
If that returns help information, then you have mamba and there is not much to do. If you have conda but not mamba, then things are a little more complex. Conda is quite slow, and you really need to have mamba, but the cleanest way to do this is to entirely remove your conda installation. This can have unforeseen consequences, so talk to eric about how to proceed.
If you have neither mamba nore conda, then proceed with installing mamba.
- Get onto the compile node on Alpine after logging in:
module load slurm/alpine
acompile
- Download mamba from miniforge. This can be done with a few shell commands:
curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).sh
-
During the installation procedure, you will be prompted a few times for input.
It is important to NOT PUT THE miniforge3 DIRECTORY in your home directory (which is the default). It should go in/projects/[email protected]/miniforge3
, where you replacecsu_eID
with your actual CSU eid. -
At the end it asks if you want to update your shell profile to automatically activate conda. You want to type
yes
. -
Then logout of the
acompile
nodes by typecntrl-d
. Then logout of your login session withcntrl-d
. -
Finally, log back into alpine, and you should have conda/mamba.
Click here for full details
Due at the beginning of class on Tuesday January 30, 2024
Complete the questions in Eric's captioned video of Illumina sequencing.
These are questions you can see in the video, but they are in text format in the course repo to make it easier to complete.
Note that the readings which are prep for the following class are very relevant to these questions, so it is probably best to do them before or while watching the video:
-
Chapters 16.1 through 16.3 in the handbook.
-
Chapter 17.2 through 17.2.2, inclusive, then all of 17.3 in the Handbook.
Homework Directions:
For a video running through these steps, see: How to get and submit the Illumina sequencing homework.
- Sync your fork of the course repo on GitHub with the original course repo.
- On your laptop clone of your fork, in RStudio, make sure that you are on the main branch. Doing this might involve changing back to main from unix-intro.
- Pull any changes from your fork down to the main branch of your laptop clone of your fork.
- Once that is done, make a new branch called
illumina-seq
, and switch to it. - Copy the file
assignments/003-illumina-sequencing-questions/illumina-seq-homework-TEMPLATE.md
toassignments/003-illumina-sequencing-questions/illumina-seq-homework.md
- Add answers to
assignments/003-illumina-sequencing-questions/illumina-seq-homework.md
and save the file. - When you are done, commit
assignments/003-illumina-sequencing-questions/illumina-seq-homework.md
to theillumina-seq
branch. - Push the
illumina-seq
branch back up to GitHub. - Send me a pull request for your changes on the
illumina-seq
branch.
Sequencing technologies and FASTA and FASTQ format.
Click here for full details
Read:-
Chapters 16.1 through 16.3 in the handbook.
-
Chapter 17.2 through 17.2.2, inclusive, then all of 17.3 in the Handbook.
Describing our course example data set and discussing trimming and batch effects
Click here for full details
Read:- Thompson et al, 2020. A complex phenotype in salmon controlled by a simple change in migratory timing. Science (This is where our example data come from).
- Lou and Therkildsen. 2021. Batch effects in population... Mol Ecol Res (This is a classic paper on why you might want to trim your sequence data).
- Handbook section 7.4, up through and including 7.4.1. This is just an avuncular overview of how tmux works.
Click here for full details
- Start with a discussion of the starting point of branches, keeping
main
synced tomain
and then branching off ofmain
. - Also, discuss Illumina Seq Homework #6
- I will present a little about the Chinook salmon data.
- We will discuss the batch effects paper.
- We might encourage the Mac users to start getting tmux integrated into iTerm. They can do this by using the instructions in Handbook section 7.5, and following along with the video Setting up tmux integration with iTerm2 to access Alpine or any other remote server.
Click here for full details
- Read Bolger et al. 2014, the academic paper about Trimmomatic.
Read the Trimmomatic manual (Click the download link to get a proper PDF version).(IF you already read this, then it is good to know, but is no longer required reading).- Read Chen et al. 2018, the academic paper about
the
fastp
aligner. - Read the updated, 2023 paper about fastp: Chen 2023
- Read the manual for fastp that is the README on their Github page
Click here for full details
-
Team quiz.
-
Running fastp on some data on Alpine. Instructions are below, and you can check out the short video, Running fastp on one pair of fastq files on Alpine.
module load slurm/alpine srun --partition atesting -t 2:00:00 --pty /bin/bash # if you don't have a fastp environment already mamba create -n fastp -c bioconda fastp conda activate fastp cd INTO_YOUR_CSU_CON_GEN_DIRECTORY mkdir -p results/trimmed results/qc/fastp fastp -i data/fastqs/DPCh_plate1_B10_S22_R1.fq.gz -I data/fastqs/DPCh_plate1_B10_S22_R2.fq.gz \ -o results/trimmed/DPCh_plate1_B10_S22_R1.fq.gz -O results/trimmed/DPCh_plate1_B10_S22_R2.fq.gz \ -h results/qc/fastp/DPCh_plate1_B10_S22.html -j results/qc/fastp/DPCh_plate1_B10_S22.json \ --adapter_sequence=AGATCGGAAGAGCACACGTCTGAACTCCAGTCA --adapter_sequence_r2=AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \ --detect_adapter_for_pe \ --cut_right --cut_right_window_size 4 --cut_right_mean_quality 20
-
Transferring files from the cluster to your laptop. In order to view the html report from fastp, we can bring it to our laptop. How? There are many ways, but one easy GUI way that is endorsed by CURC, here is to use FileZilla. The directions to do so are at the link above, and I made a short FileZilla video, that walks through these steps:
# on alpine cd ~ # change to your home directory # make symbolic links to your projects and scratch directories # to make it easy to get to them from your home directory ln -s /projects/USERNAME projects ln -s /scratch/alpine/USERNAME scratch # Then, on your laptop, download and install FileZilla # and follow the steps on the CURC page.
-
Let's think about what it would take to do this for every pair of fastq files (which will get us to thinking about shell scripting and more!).
Shell programming.
Click here for full details
- Read the eca-bioinf-handbook section from section [5.2](https://eriqande.github.io/eca-bioinf-handbook/shell-programming.html#the-structure-of-a-bash-script) through section 5.10, inclusive.Click here for full details
- We will work together through the Shell Programming section.
For this to work well for you, you will need to sync the main branch of your fork (on the GitHub website), and then pull that down into the main
branch of your clone on your cluster:
After the shell programming "hands-on" we will talk a little about Shell Scripts
git switch main git pull origin main
Assignment due at beginning of class, Tuesday Feb. 13 (make a shell script to automate running fastp on multiple files)
Click here for full details
This assignment is about making a shell script to automate running fastp on the multiple files in data/fastqs
in the
course repository. Detailed instructions for the assignment are in the README for assignment 004.
And a short video showing all the steps (except actually doing the homework) is available here.
SLURM intro
Click here for full details
- Read about HPCCs and SLURM in the handbook. [Chapter 8, up through and including all of 8.2](https://eriqande.github.io/eca-bioinf-handbook/chap-HPCC.html)Click here for full details
- We will work together through the [SLURM Intro](https://eriqande.github.io/con-gen-csu/nmfs-bioinf/slurm.html). This is about a cluster called SEDNA, but many of the principles are the same for Alpine---they both use SLURM.SLURM: sbatch
and slurm job arrays (applied to sequence alignment)
Click here for full details
- Read [Chapter 19](https://eriqande.github.io/eca-bioinf-handbook/alignment-of-sequence-data-to-a-reference-genome-and-associated-steps.html) of the handbook, up to, and including section 19.3. - If you want to (i.e., if you are excited by mathematical notation) feel free to peruse some papers about `bwa`: + [BWA original paper](https://academic.oup.com/bioinformatics/article/25/14/1754/225615) + [BWA-mem Paper on Arxiv](https://arxiv.org/abs/1303.3997)Click here for full details
- Quiz: Sync your fork; navigate to assignments/quizzes/read-groups-and-bwa-quiz.md; edit the file to turn `- [ ]` into `- [x]` where appropriate; commit changes to a new branch called `read-groups-quiz`; send eric a pull request from that branch to `main`. - We will be going through one section of the udpated NMFS workshop notes: + [Submitting jobs with sbatch](https://eriqande.github.io/con-gen-csu/nmfs-bioinf/sbatch.html)Dispatching jobs via slurm job arrays.
Click here for full details
- [Slurm Job Arrays](https://eriqande.github.io/con-gen-csu/nmfs-bioinf/slurm-arrays.html)Click here for full details
- We have an assignment due Friday, Feb 23, 2024. The instructions for it are [here](https://github.com/eriqande/con-gen-csu/tree/main/assignments/005-slurm-and-bwa-mem2/README.md). The instructions are not super explicit. This is a chance for everyone to go from a verbal description to the finished assignment.Alignment; SAM/BAM formats; samtools
Click here for full details
- Read (and follow along with the computer if you are inclined) the entire [Alignments](https://eriqande.github.io/con-gen-csu/nmfs-bioinf/bioinf-formats.html#sambamfiles) section (i.e., all of section 9.4). - Read (and follow along with the computer if you are inclined) the entire [Processing alignment output with `samtools`](https://eriqande.github.io/con-gen-csu/nmfs-bioinf/sequence-alignment.html#samtools) section (i.e., all of section 10.4).Click here for full details
- We will be going through the readings and discussing then to cement those ideas.Snakemake Concepts and Basics
Click here for full details
- Read the first full section of the [snakemake overview chapter](https://eriqande.github.io/con-gen-csu/nmfs-bioinf/snake.html)Click here for full details
- This is all done in the form of quarto/revealjs [slides](https://eriqande.github.io/con-gen-csu/snake-slides.html#/section).Click here for full details
- The explanation of the assignment is in the [README](https://github.com/eriqande/con-gen-csu/blob/main/assignments/006-simple-snakemake-maneuvers/README.md) of the assignment directory.Continuring with Snakemake, and also using Open OnDemand on Alpine
Click here for full details
- Continuing through these: [slides](https://eriqande.github.io/con-gen-csu/snake-slides.html#/section).More Snakemake
Click here for full details
- For people with access to Alpine, read the [Chapter on OpenOnDemand Browser and RStudio Server Access to Alpine](https://eriqande.github.io/con-gen-csu/nmfs-bioinf/open-on-demand-alpine.html), and definitely do the steps in there to get onto Open OnDemand on Alpine. I guarantee you _this will change your life_. - For everyone, read [Snakemake-relevant Python for R Users](https://eriqande.github.io/con-gen-csu/nmfs-bioinf/snakemake-relevant-python.html). This is a minimal introduction to python for people that speak R already. It will help you to make the most of Snakemake when writing your own workflows. Please work through the examples in your RStudio python REPL and get familiar with python if you have not yet used it much. - Note that these are fairly newly-written sections. Please help me out by sending me a pull request to fix any typos/errors/illogical things. You can do this by clicking the GitHub "Edit this page" link at the bottom of the right column of the page. That will let you edit the page on your fork of the repo and then send a pull request for the edits. Special bonus points and recognition to whomever makes the most corrections.Click here for full details
- Continuing through these: [slides](https://eriqande.github.io/con-gen-csu/snake-slides.html#/section)More Snakemake. SLURM profiles, snakemake configuration files, and input functions
Click here for full details
- The [Snakemake Embellishments](https://eriqande.github.io/con-gen-csu/nmfs-bioinf/snakemake-embellishments.html) chapter.Click here for full details
- We will be talking a lot about making snakemake talk to SLURM, and also a bit about YAML and tabular configuration of Snakemake workflows. Here are the [slides](https://eriqande.github.io/con-gen-csu/snake-embellish-slides.html#/section)Assignment (due the first Tuesday after spring break at the start of class---a reading and voting assignment)
- First, read this excellent, and very thorough paper. The citation is: Pečnerová, P., Garcia-Erill, G., Liu, X., Nursyifa, C., Waples, R. K., Santander, C. G., ... & Hanghøj, K. (2021). High genetic diversity and low differentiation reflect the ecological versatility of the African leopard. Current Biology, 31(9), 1862-1871. The link will download the paper and the supplements, all in one document. Be sure to read through the supplement, and especially read through the STAR*Methods section.
- Once you have read through the paper, marvel at how thorough the authors were, and then think about which parts of the STAR*Methods were most interesting to you or most relevant to your own work.
- Sync your fork and then go to
assignments/007-vote-for-topics
and follow the directions in the README there.
Variant Calling
- We will just be working through select sections of this variant calling section
Hard filtering of GATK-produced VCF files; the disastrous decision by GATK's developer to violate the VCF specification
Reading
- The standard GATK recommendations for Hard filtering
- The saga of GATK versions calling missing data as reference homozygotes:
- Start with some messages I sent them: start here and read a few exchanges
- The above apparently inspired a GATK blog post
- Then not much happened apparently until people started upgrading their GATK versions and found many of their pipelines were broken. For that start reading the comments to the blog post that started coming in around summer of 2023 and Feb of 2024
Hands On Learning The materials for the hands-on can be obtained in RStudio with the following steps:
if(!("usethis" %in% rownames(installed.packages()))) {
install.packages("usethis")
}
usethis::use_course("eriqande/ngs-genotype-models")
- You will need to answer the “Yes” response to a few questions. This will download an RStudio project and open it.
- From this RStudio project’s file browser, you can open the RMarkdown files, like:
001-allele-freq-estimation.Rmd
. - If the message at the top of the file says you need some new packages, click the install option.
- Then Click the “Run Document” button at the top middle of the source code file. This runs an interactive shiny program
Read and do the activities in the Shiny Notebooks: 001-allele-freq-estimation.Rmd
and 002-genotype-likelihoods-from-reads.Rmd
There is nothing to turn in, but please read through these and play with them for a couple hours.
- Summarizing everyone's votes on topics for the rest of the semester.
- The votes are summarized in syllabus-ranks.csv and leopard-ranks.csv
- Learning about VCF files and the format:
- Reading through this
- Playing around with and looking at:
data/vcf/all.vcf.gz
(i.e.,bcftools view data/vcf/all.vcf.gz | less -S
- Working on hard filtering of our example data.
- We will be applying hard filtering to our example Snakemake workflow. Click here for full details
- Read the chapter on bcftools. You can also run through the examples if you would like. We will be going over this and discussing it in class.
- I am traveling to Santa Cruz. Hopefully I can reserve the fishbowl. Colorado (and Montana), use the google video links at the top of this README.
- TURN IN A BRIEF SKETCH OF YOUR CLASS PROJECT This doesn't have to be more than a couple of paragraphs, but you can also go
into more detail if you want. Turn it in by emailing me at
[email protected]
. - In class:
- We will talk about projects.
- I will give everyone some reading on our analysis topics
- Work on your projects.
- Eric will post some readings.
- Eric is figure skating in front of a panel of judges.
- Work on your projects.
- Eric will post some readings.
- Eric is a scientist-in-residence at the mobile High Altitude Venue for Ecological Analysis, Genetics, and Statistics, mHAVEAGAS, working on parentage in admixed populations.
Before class, please read two papers. They are pretty mathematical, so it is OK if you don't get it all easily. You can skim the mathematical sections!
- Nielsen et al. 2012, SNP Calling, Genotype Calling, and Sample Allele Frequency Estimation from New-Generation Sequencing Data
- ]Rasmussen et al. 2022. Estimation of site frequency spectra from low-coverage sequencing data using stochastic EM reduces overfitting, runtime, and memory usage](https://academic.oup.com/genetics/article/222/4/iyac148/6730749)
Eric will give a brief overview of the SFS and SFS estimation. We will explore these steps in Eric's workflow for estimatng Fst from lcWGS data: https://github.com/eriqande/mega-lcwgs-pw-fst-snakeflow. See https://eriqande.github.io/con-gen-csu/nmfs-bioinf/sfs-fst.html
We will look at https://github.com/eriqande/mega-post-bcf-exploratory-snakeflows
Homework: Make a Repo for your project and share it with me. No huge data sets, but commit and push a README and a Skeleton. Then share it with Eric. That means making me a collaborator if it is a private repo. Otherwise just send me a link to it, (if it is public).
In class: following along with: https://eriqande.github.io/con-gen-csu/nmfs-bioinf/mega-post.html