Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding image assets and analysis tool pages #42

Merged
merged 3 commits into from
Dec 31, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions docs/analysis-tools/abba.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@

## ABBA

Given a set of interesting genes, do other genes have similar relationships to known sets of genes? For example, given a set of genes known to be related to drug abuse, what other genes share similar expression patterns in drug abuse gene sets? By answering this question, it becomes possible to elucidate under-studied or obfuscated genes that may play a role in complex phenotypes.

We have developed a new GeneWeaver tool to address this question, which we call __Anchored Biclique of Biomolecular Associations (ABBA)__. This tool takes advantage of the large number of collected data and cross-species integration to find new genes for investigation.

The search begins with a user-provided list of genes of interest, such as highly-studied genes with known pathways and relationships. The database then finds any gene sets that contain at least N of the genes in the provided list. From the resulting list of gene sets, ABBA then isolates any genes that occur in at least M GeneSets but not in the initial list. These resulting genes share similar gene set overlap with the original input set, but may not have been previously considered in relation to the gene set of interest.

!["ABBA applied to a set of 4 genes of interest"](../assets/images/abba.png)

In the above figure, the lighter nodes indicate less overlap. Using N=2 produces a collection of 37 GeneSets as of 7 July 2010. For brevity, only the top 5 results are shown above. With M=15, the following table lists genes in the result having similar relationships to the input set.


![](../assets/images/abba_2.png)


Without reasonable thresholds, the results quickly become overwhelming. As of this writing, a simple set of 4 genes of interest results in 555 GeneSets and over 38,000 genes in the candidate list. Increasing the input set to 7 genes of interest results in 983 GeneSets and almost 40,000 genes. Simply requiring gene sets to contain at least 3 genes significantly reduces the search space to 11 and 37 GeneSets, respectively.

![](../assets/images/abba_3.png)
91 changes: 91 additions & 0 deletions docs/analysis-tools/boolean-algebra.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
## Boolean Algebra

The Boolean Algebra Tool performs basic set operations on at least two Gene Sets.
Results are displayed as lists of genes beloging to one of the three different types of
bergsalex marked this conversation as resolved.
Show resolved Hide resolved
set operations: Union, Intersect, and Symmetric Difference. Furthermore, results allow
users to quickly determine new relationships between Gene Sets and create a new Gene Set
based on set-derived findings.

### Using the Boolean Algebra Tool

Access the Boolean Algebra Tool through
the [Analyze Genesets](index.md#analyze-gene-sets-tab) tab, located in the left-hand
column and distinguished by the Venn diagram icon.

![](../assets/images/boolean_algebra_options.png)

To generate Boolean Algebra results, select either a Project of two or more Gene Sets or
at least two individual Gene Sets from a project. Next, select the appropriate Boolean
Algebra function. These functions are based on basic _Set Algebra_: **Union**,
**Intersection**, **Symmetric Difference**.

* **Union**: This tool generates a set of all genes located in all sets. It removes
duplicates by default. The results will display what homology mapping was used to
generate a gene entry.

This result shows the union of three Gene Sets, two mouse and one human.

![](../assets/images/boolean_algebra_union.png)

* **Intersection**: This option will cause the Boolean tool to return all genes in
common with the selected Gene Set inputs. It has an additional option (_"Genes must
intersect in at least X"_) that specifies the minimal amount of overlaps required to
return a result. If a minimal overlap is set to _3_, for example, only Gene Sets that
intersect with 3 or more genes will be evaluated, and only the intersecting genes will
be returned. In addition, results are divided into separate groups based on the number
of genes in their intersections.

These three Gene Sets have 4 genes in common. All of them are homologs between mouse and
human.

![](../assets/images/boolean_algebra_intersect.png)

Changing the overlap to 2 created two sets of results, those in all 3 Gene Sets and
those in only 2 of the Gene Sets.

![](../assets/images/boolean_algebra_intersect3.png)

* **Symmetric Difference**: This tool will create a set of genes that are unique to the
Gene Sets selected as input. It effectively finds the Union of all Gene Sets minus the
intersection of those Gene Sets.

In this example, there is a result set of unique genes for each input Gene Set.

![](../assets/images/boolean_algebra_except.png)

### Managing Results

A table located just below the circle overlap diagram and above the results is intended
to display a broad survey of genes included in the input Gene Sets, categorized by
species. It lists: _Genes Specific to Species_, _Genes In Common with at Least One Other
Species_, and _Total Number of Genes_. These values are based on the total number of
genes in the input sets, and may not specifically represent results. The table is
intended to help aid in the selection of which species to map the results in cases where
new Gene Sets are created.

![](../assets/images/boolean_algebra_table.png)

Genes returned by the Boolean Algebra tool can be added to new Gene Sets. To do this,
click on the **Create New Gene Set From Results** button for the group you want.

Since results can contain genes from a mixed set of species, a species must be selected
for mapping the genes in the new Gene Set.

![](../assets/images/boolean_algebra_select_species.png)

The standard Upload GeneSet page will open. The genes will be listed in the gene
information section. If no species is selected, no genes will be listed. You can now
edit any of the fields to change the Gene Set name, description, etc. Follow
the [Upload GeneSet](#uploading-gene-sets) procedure. It is also important to note that
very large gene lists may take a few moments to load, during which time the user may
experience a dimmed 'Loading' screen.

### Circle Overlap Diagram

If the user selects 10 or fewer Gene Sets, a gene overlap diagram will appear near the
top of the results page. The **Circle Overlap** representation is an approximation of
Euler fractional overlaps. It represents how the input genesets relate to each other. It
uses the same homology mapping as the Boolean Algebra tool to render the approximate
fractional overlap of the genes shared between each set.

![](../assets/images/bool_image.png)
239 changes: 239 additions & 0 deletions docs/analysis-tools/clustering.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,239 @@
**Clustering**
==============


Why Use the Clustering Tool
----------

Clustering is one of the most powerful tools in bioinformatics, where classifications are too strict for data distinction, clustering helps give the user an evaluation that is not so distinct.


### Using the Clustering Tool

1. Select the gene sets from your list of projects that you would like
to analyze.
- You need a minimum of 3 gene sets in total to run the tool.

2. Select if homology is to be included or excluded.
- Homology is included by default.

3. Select the method of clustering.
- Average is the default method of clustering.
- There are five methods of clustering. They are listed in the
methods section.

### Understanding your Results

#### Visualization Types

There are two methods for visualizing your clustering results.

**Force Directed Graph**

![](../assets/images/Forced-directed-graph.png "fig:Forced-directed-graph.png")

- Tree representation of each cluster.
- Clear depiction of hierarchy.
- The most opaque node of a tree represents the clusters root.

- Each node is classified as one of the following:
- **Cluster** - Grouping of gene sets
- The opacity of the nodes is based on the Jaccard Similarity of its children. The more similar the gene sets, the darker the cluster.
- On Hover: Reveals Jaccard Similarity of its child nodes. Reveals set notation of the containing hierarchy.

![](../assets/images/Cluster-onHover.png "fig:Cluster-onHover.png")

* On Click: Collapses (absorbs its children).


![](../assets/images/Cluster-onClick.png "fig:Cluster-onClick.png")

- **Gene Set** - A set of genes
- Colored based on the species contained in the gene set study.
- Sized based on the relative size of the gene set.
- On Hover: Reveals abbreviated gene set information.
- On Click: Reveals and cycles through genes in groups of ten.
- On Double Click: Opens a new page containing extensive gene set information.

- **Gene**
- On Hover: Reveals the name of the gene.

- **Edges**
- Connects nodes to its children.
- The opacity of edges leading from cluster nodes is based on the cluster nodes Jaccard Similarity, following the same scale as above.

**Partitioned Sunburst**

![](../assets/images/Partitioned-sunburst.png "fig:Partitioned-sunburst.png")

- Top-down view of each tree.
- Center represents the root.
- Partitioned sub-circles represent clusters, gene set or gene.

- **Partition**
- Partitions are the equivalent to nodes in a tree
- Each parition is classified as one of the following:
- **Cluster** - Grouping of gene sets
- On Hover: Reveals Jaccard Similarity of its child
partition and highlights all nodes within the cluster.
- On Right Click: Opens a new "View GeneSet Overlap" page
using all gene sets in the cluster as input.
- **Gene Set** - A set of genes
- Colored based on the species contained in the gene
set study.
- Drawn arc sizes are based on the relative size of the
gene set.

- On Right Click: Opens a new "View GeneSet Details" page for the
given gene set.
- **Rings**
- Each Ring represents a level in the tree.
- The outer most levels are gene sets.
- The levels leading up to a gene set represents the hierarchy of
the cluster.


Clustering Methods
------------------

Listed below are the six different methods that the user can choose from
while running the tool. The first five are different clustering methods
that will run on the selected genesets and display a force directed tree
and a partitioned sunburst based on the clustered genesets.

All five of the given clustering methods are agglomerative hierarchical
clustering methods that start with each geneset belonging to its own
cluster. They then combine the clusters at each iteration based off of a
described linkage method that determines how the distance between two
clusters is defined. The clusters are combined until there are no more
clusters that are similar to each other (the distance between them is
too large).

### McQuitty

The McQuitty clustering method uses a linkage method where distance
depends on the combination of clusters instead of the individual
genesets within each cluster. When two clusters are joined together, the
distance of the new cluster to any other cluster is calculated as the
average distance between the two clusters that are being joined and the
other cluster. For example, if clusters 2 and 4 have the greatest
similarity and we are going to combine them into a new cluster called
2+4, then the distance from 2+4 to 1 is the average of the distances
from 2 to 1 and 4 to 1.

- **Algorithm**
- Each gene set is initialized as its own cluster.
- The initial similarity between each cluster is the Jaccard
Similarity of the two genesets.
- While we still have similar clusters:
- Clusters with highest similarity are clustered together.
- Calculates the similarity between the new cluster and all
the rest based on the McQuitty linkage method
- **Time Complexity**
- O(n^2^ log n)
- This method is the most time efficient.

### Ward

The Ward clustering method uses a linkage method where the distance
between two clusters is based off of the Jaccard Similarity score
between them. When two clusters are joined together, the new cluster
will take the union of the genesets in the two clusters that are being
joined and set that as its geneset. It will then calculate the new
geneset's similarity score against all the other cluster's genesets and
that will be set as the distance between the new cluster and all the
other clusters.

- **Algorithm**
- Each gene set is initialized as its own cluster
- The initial distance between clusters is the Jaccard Similarity
score between each of the cluster's genesets
- While we have clusters that are similar to each other:
- Clusters with highest similarity are clustered together.
- The new cluster contains a geneset which is the union of its
children's genesets
- Recalculates the Jaccard Similarity score between the new
cluster and all the other clusters
- **Time Complexity**
- O(n^3^)

### Complete

The Complete clustering method uses a linkage method where the distance
between two clusters is the lowest similarity score between any of the
genesets in one cluster compared to any of the genesets in the other
cluster. When two clusters are combined, the genesets within each of the
clusters are put into a new cluster. No new calculations are needed at
each iteration because we are simply reusing the similarity scores of
all the genesets compared to each other.

- **Algorithm**
- Each gene set is initialized as its own cluster.
- The similarity scores off all the genesets compared to each
bergsalex marked this conversation as resolved.
Show resolved Hide resolved
other are saved in a matrix
- While we still have clusters that are similar:
- Determine which two clusters to join:
- The distance between two clusters is the lowest
similarity score between a geneset in one cluster and a
geneset in the other cluster
- The highest of these distances determines which two
clusters will be joined
- Combines the two clusters to create a new cluster that has
all the genesets that were present in the two children
clusters
- **Time Complexity**
- O(n^3^)

### Average

The Average clustering method uses a linkage method where the distance
between two clusters is the average similarity score between all of the
genesets in one cluster compared to all of the genesets in the other
cluster. When two clusters are combined, the genesets within each of the
clusters are put into a new cluster. No new calculations are needed at
each iteration because we are simply reusing the similarity scores of
all the genesets compared to each other.

- **Algorithm**
- Each gene set is initialized as its own cluster.
- The similarity scores off all the genesets compared to each
bergsalex marked this conversation as resolved.
Show resolved Hide resolved
other are saved in a matrix
- While we still have clusters that are similar:
- Determine which two clusters to join:
- The distance between two clusters is the average
similarity score between every geneset in one cluster
and every geneset in the other cluster
- The highest of these distances determines which two
clusters will be joined
- Combines the two clusters to create a new cluster that has
all the genesets that were present in the two children
clusters
- **Time Complexity**
- O(n^3^)

### Single

The Single clustering method uses a linkage method where the distance
between two clusters is the highest similarity score between any of the
genesets in one cluster compared to any of the genesets in the other
cluster. When two clusters are combined, the genesets within each of the
clusters are put into a new cluster. No new calculations are needed at
each iteration because we are simply reusing the similarity scores of
all the genesets compared to each other.

- **Algorithm**
- Each gene set is initialized as its own cluster.
- The similarity scores off all the genesets compared to each
bergsalex marked this conversation as resolved.
Show resolved Hide resolved
other are saved in a matrix
- While we still have clusters that are similar:
- Determine which two clusters to join:
- The distance between two clusters is the highest
similarity score between any geneset in one cluster and
any geneset in the other cluster
- The highest of these distances determines which two
clusters will be joined
- Combines the two clusters to create a new cluster that has
all the genesets that were present in the two children
clusters
- **Time Complexity**
- O(n^3^)
Loading
Loading