Skip to content

Commit

Permalink
Add new F2 material
Browse files Browse the repository at this point in the history
  • Loading branch information
blue-moon22 committed Mar 15, 2022
1 parent bbdc163 commit b97b9a7
Show file tree
Hide file tree
Showing 28 changed files with 319 additions and 3 deletions.
Binary file modified .DS_Store
Binary file not shown.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.DS_Store
35 changes: 33 additions & 2 deletions F2/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,34 @@
# F2
<h1 style="text-align:center"><span style="color:#246CAA; font-size:1.70em">Welcome to Fundamental 2</span></h1>

Coming soon!
This training is the second fundamental training module. The module is designed for participants with either no prior experience or minimal knowledge of sequence data analysis.We recommend that you complete [Fundamental 1 (F1)](https://training.bactgen.sanger.ac.uk/#/F1/), or have experience using tools covered in F1, before proceeding with this module. In F2, you will use online tools to analyse genome sequence data. For the most part, these are drag & drop style web tools that will take an input file (fastq or fasta) and analyse the genomic data. The output files can then be saved and used as input files for visualisation tools such as [Microreact](https://microreact.org/showcase) and [Phandango](https://github.com/jameshadfield/phandango/wiki). The tools and methods described here may not be suitable for carrying out large scale data analysis, and will be addressed in upcoming advanced level bioinformatic courses A1 and A2.

# What this course covers
We begin with introducing you to fastq format data, which is the default output format from the majority of the sequencers. Next, we show you how to assess the quality of sequence data, understand base quality scores and how to spot contamination of sequence reads. The sequence reads will then be used for generating an assembly, which will be used as input for the downstream data analysis.

We will then introduce you to genotyping which involves characterising bacterial strains based on their DNA sequence, including the presence of genes that encode phenotypes of interest such as antimicrobial resistance. You will then use a free online tool to genotype some example isolates, including identifying the bacterial species, multilocus sequence typing (MLST), antimicrobial resistance (AMR) and assigning lineage using a clustering method. In the Fundamental 1 module we covered the basics of phylogenetics and interpretation of phylogenetic trees. In this module we will build on your knowledge, and you will visualise the genotype data that you will generate alongside a phylogeny in Microreact.

## Duration
Approximately 3 hours.

## Files for this module
Please find the required files for this module here.

## Support
There is a Slack channel available for you to ask questions and discuss your thoughts. From date1 to date2 there will be members of the core GPS and Juno project teams available to answer your questions. Access to the Slack channel is only available for GPS and Juno project partners. If you are not involved in either project, you are welcome to use the training materials, however no support will be provided. We will have a ‘wrap-up’ session via a webinar on 27th April to address any outstanding questions; please fill in the short questionnaire at the end of the module to let us know of any questions or topics that you would like to hear more about.

>**Educators**
<br/>Narender Kumar, Kate Mellor, Stephanie W. Lo, Victoria Carr, Uzma Khan, Jolynne Mokaya, Ana Ferreira, Gemma Murray.
>**Contributors**
<br/>Narender Kumar, Kate Mellor, Christine Boinett, Victoria Carr, Nil Shchelov, Jolynne Mokaya, Ana Ferreira, Gareth Peat, John Lees, Stephanie W. Lo, Dorota Jamrozy, Neil McAlisdair, and Stephen Bentley.
>**Funding**
<br/>The training is provided as part of the[ Juno](https://www.gbsgen.net/) and[ GPS2](https://www.pneumogen.net/gps/) projects funded by[ The Bill and Melinda Gates Foundation](https://www.gatesfoundation.org/).
>**Educators**
<br/>Christine Boinett (Lead educator), Stephanie W. Lo, Dorota Jamrozy and Stephen Bentley.
>**Contributors**
<br/>Christine Boinett, Gareth Peat, Stephanie W. Lo, Dorota Jamrozy, Neil MacAlasdair, Kate Baker and Stephen Bentley.
</br>&copy; [Wellcome Sanger Institute](https://www.sanger.ac.uk/)
9 changes: 8 additions & 1 deletion F2/_sidebar.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
[Home](/)

# [F2](/F2/)
# [F2](/F2/)

* [Welcome to the Module](/F2/)
* [Introduction](/F2/introduction.md)
* [Sequence quality](/F2/quality.md)
* [Assembly](/F2/assembly.md)
* [Genotyping](/F2/genotyping.md)
* [End of module](/F2/endF2.md)
45 changes: 45 additions & 0 deletions F2/assembly.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
<h1 style="text-align:center"><span style="color:#246CAA; font-size:1.5em">Assembly</span></h1>

In the previous section, we learnt how to assess the quality of raw reads. In this section we will use the reads to perform the process of assembly.

Genome assembly refers to the process of putting nucleotides in the correct order. Assembly is required as the sequence reads generated by the sequencers are a lot smaller than a genome or even gene. For example, the read length generated by illumina sequencers ranges from 100 to 300 bases while the average genome size of _S. pneumoniae_ is ~2 Mb. Traditionally, the assembly process involved joining the reads together based on the overlapping sequence into [contigs](https://en.wikipedia.org/wiki/Contig) (long continuous sequence). With the evolution of rapid sequencers, the reads have become shorter and high throughput has increased making it challenging to perform assembly. A number of bioinformatic tools such as SPAdes and Velvet have been developed, which use one of three approaches to perform assembly: greedy, overlap-layout-consensus or de Bruijn graph. The details of these algorithms are beyond the scope of this module but if you are interested in finding out more, then please refer to[ Advances in Genetics](https://www.sciencedirect.com/topics/agricultural-and-biological-sciences/genome-assembly), chapter 7.2 on sequence assembly. The majority of these tools are command-line based and require some command-line skills to be able to use them. However, there are some web-based tools that can take in sequence reads and perform assembly. Here we will be using Pathogenwatch, which was introduced in the [F1 module](https://training.bactgen.sanger.ac.uk/#/F1/), to perform assembly and assess assembly quality.

# Example: Assembly of the sequence reads

In this section we will perform assembly of _S. pneumoniae_ isolate (21999_7#106).

Firstly, navigate to the [Pathogenwatch website](https://pathogen.watch/). Then click on the 'upload' button on the Pathogenwatch home dashboard (Figure 6). You will then be required to sign in.

![alt_text](img/figure6.png "Pathogenwatch")
**Figure 6: Landing page of Pathogenwatch website**

You can then upload your fastq data. We recommend that you watch the Pathogenwatch video ([sequence assembly](https://training.bactgen.sanger.ac.uk/#/F1/pathogenwatch), time stamp 0:54 onwards) to learn how to upload your reads and assemble the contigs.

Once the process is complete you can select “view genomes” (Figure 7) to see a tabulated summary of the uploaded genomes as shown in Figure 8.

![alt_text](img/figure7.png "Complete analysis")
**Figure 7: Completed analysis of the sequence data**

Now select the isolate you just uploaded (21999_7#106) (circle 1, Figure 8) and click on the “selected genomes” tab on the top-right corner (circle2, Figure 8). In order to access the assembly statistics data click on the option “Download file”(circle3, Figure 8).

![alt_text](img/figure8.png "Accessing results")
**Figure 8. Accessing the analysis results**

This will open another small window (Figure 9) with many features about the isolate that Pathogenwatch has identified from the sequence reads. In order to access the assembly statistics select the “stats” option (circle1, Figure 9) and save the resulting .csv file on your computer.

![alt_text](img/figure9.png "Downloading the assembly statistics")
**Figure 9: Downloading the assembly statistics**

Open the .csv file you just downloaded, and you will see the assembly statistics as shown in Figure 10. Important metrics which can reflect on the quality of the assembly are highlighted by red boxes in the Figure 10.

![alt_text](img/figure10.png "Assembly statistics of the sequence data")
**Figure 10: Assembly statistics of the sequence data**

A good assembly is categorised as those having a number of contigs &lt;500, total length of the assembled sequences is within a certain range and GC% that matched the species of our interest. For _S. pneumoniae_ the assembled length (genome length) should be between 1.9 -2.3 MB. Since the example used here is of _S. pneumoniae_ and the number of contigs is &lt;500, the total length is between 1.9 - 2.3 MB and the GC% is between 38.5 - 40.0, the contigs assembled are good and can be used for further downstream analysis. Higher number of contigs (>500) could also indicate possible contamination of the sequence reads. Additionally, having assembled genome size lower than the threshold could indicate lower overall sequencing coverage or throughput while higher genome size could indicate contamination.

><span style="color:#FC8E22; font-size:1.5em">Exercise 2</span>
<br/>**Now perform the assembly of the sequence reads for the isolates (21999_7#180, 28184_2#97 and 33816_1#103 ) provided to you and answer the following questions:**
<br/>1. Do all the three isolated belong to _S. pneumoniae_ ?
<br/>2. Does the sequence assembly is of good quality for all the three?
</br>&copy; [Wellcome Sanger Institute](https://www.sanger.ac.uk/)
7 changes: 7 additions & 0 deletions F2/endF2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
<h1 style="text-align:center"><span style="color:#246CAA; font-size:1.70em">End of module</span></h1>

# Module Feedback

Please complete this short [Google Form](https://docs.google.com/forms/d/1mjLhb3JeLrYZsajrglXVJHW99z9viCQ1UJ8l773_Q18/edit) to let us know how you found this training module, what topics you would like to be covered in the ‘wrap-up’ webinar and to register your interest for advanced genomics training (in-person or online).

</br>&copy; [Wellcome Sanger Institute](https://www.sanger.ac.uk/)
Loading

0 comments on commit b97b9a7

Please sign in to comment.