Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New GeometricSeqPhenotype class #22

Merged
merged 65 commits into from
Jun 20, 2023
Merged

New GeometricSeqPhenotype class #22

merged 65 commits into from
Jun 20, 2023

Conversation

thienktran
Copy link
Collaborator

@thienktran thienktran commented Aug 14, 2022

Description

GeometricSeqPhenotype is a subclass of GeometricPhenotype. Along with the parameters that determines the position of the object in Euclidean space, it stores nucleotide sequence information and the number of epitope and non-epitope mutations in fields. Only the nucleotide sequence is being stored to improve space and time complexity. The translation of a nucleotide sequence to a protein sequence can be done after the simulation. However, this does not mean amino acids don't play an important role in our model. For each site in the sequence, a matrix of vectors is precomputed before the simulation runs. The vectors are drawn from a gamma distribution, whose parameters can be changed in parameters.yml. The number of epitope sites, which ranges from 0 to (the starting sequence length / 3), is parameterized in that file as well. Epitope sites remain the same until the program exits. Epitope sites have corresponding matrices with vectors drawn from a gamma distribution with different parameters than non-epiope sites.

mutate()

This method randomly selects an index in the nucleotide sequence to mutate. Based on the transition-transversion ratio, a cumulative sum distribution array is used to determine what the nucleotide will mutate to. If the mutant amino acid is a stop codon, the process repeats starting with randomly selecting an index to mutate. Once a valid mutation occurs, where the object moves in space is determined by the vectors in the site's matrix at entry m, n which represents the index of the wild type and mutant amino acids in Parameters.AMINO_ACIDS, respectively. this does not update. Instead, a new GeometricSeqPhenotype with the updated nucleotide sequence, new position parameters, and number of epitope and non-epitope mutations is created. There must be a mutation if this method is called, so the nucleotide sequence must be different and the number of epitope mutations xor non-epitope mutations must be updated by 1. However, the new position parameters might not change since a nucleotide mutation does not necessarily cause a mutation in the protein sequence.

Closes #17

Tests

GeometricSeqPhenotype’s representation invariant, a condition that must be true throughout an object’s existence, is verified throughout an entire simulation using the debug flag in GeometricSeqPhenotype.java. Whenever a method of the object is called, the representation invariant is checked to confirm that the nucleotide sequence doesn’t change in length (point mutation rather than frameshift mutation) or contain any stop codons.

The JUnit tests in TestGeometricSeqPhenotype.java are used to make sure the constructors and methods of GeometricSeqPhenotype.java are working as expected.
GeometricSeqPhenotype():

  • getTraits()
  • getSequence()
  • distance()
  • mutate()
  • riskOfInfection()
  • toString()

The values returned by the constructors and methods are tested against the results calculated by hand. These JUnit tests only review logic errors. For example, testMutate() makes sure that a nucleotide at the given site in the nucleotide sequence is mutating to the given nucleotide. It does not test that the ratio of the number of transitions to the number of transversions.

In order to do sanity checks that Antigen is actually taking biology and statistics into account, we instead have to visualize data from the simulation separately.

Transition-Transversion Ratio

The transition-transversion ratio can be specified in parameters.yml. The default value is 5.0. Each nucleotide mutation in a simulation is recorded and saved in a CSV file, mutations.csv. mutations.csv has one column with the following format: XY, where X is the wild type nucleotide and Y is the mutant nucleotide.

The graph below shows the frequency of each possible nucleotide mutation. Notice the number of transition mutation occurs more frequently than the number of transversion mutation. The calculated ratio is 5.087. It’s not exactly 5.0 for various reasons. The starting sequence doesn’t contain the same number of each nucleotide, and some mutations cause stop codons so it must be mutated again.
image

Epitope and Non-epitope sites ~Gamma

The effects of an epitope or non-epitope mutation can be specified in parameters.yml. For each amino acid site’s corresponding matrix of vectors, the mutation notation, size of vector, and theta of each entry are recorded and saved in a CSV file, test/valuesGammaDistribution/0_siteX.csv where X is the amino acid site number.

The graph for non-epitope sites (meanStep: 0.0001 and sdStep: 0.0001) show that mutations that occur in non-epitope sites don’t move the phenotype very far in antigenic space.
image

The graph for epitope sites (meanStep: 2.0 and sdStep: 1.0) are slightly different for each amino acid site, which is what we want. All epitope site distributions are consistent. (The orange line is the distribution used in the original Antigen and is used for reference).
image

Checklist:

  • The code uses informative and accurate variable and function names
  • The functionality is factored out into functions and methods with logical interfaces
  • Comments are up to date, document intent, and there are no commented-out code blocks
  • Commenting and/or documentation is sufficient for others to be able to understand intent and implementation
  • TODOs have been eliminated from the code
  • The corresponding issue number (e.g. #278) has been searched for in the code to find relevant notes
  • Documentation has been redeployed

(Sorry about all the nontrivial updates. Here’s a reminder on how to hide white space changes).

Biology.java Outdated Show resolved Hide resolved
Parameters.java Outdated Show resolved Hide resolved
Parameters.java Outdated Show resolved Hide resolved
Parameters.java Outdated Show resolved Hide resolved
Biology.java Show resolved Hide resolved
Biology.java Show resolved Hide resolved
Biology.java Outdated Show resolved Hide resolved
Biology.java Outdated Show resolved Hide resolved
Biology.java Show resolved Hide resolved
Biology.java Outdated Show resolved Hide resolved
Biology.java Outdated Show resolved Hide resolved
Biology.java Outdated Show resolved Hide resolved
Biology.java Outdated Show resolved Hide resolved
Biology.java Outdated Show resolved Hide resolved
Copy link
Collaborator

@zorian15 zorian15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was nice to refresh on this code -- nice job @thienktran !
General thoughts:

  • We should think about adding support for reading in fasta files or the like so users don't have to copy and paste the genetic sequence into a params.yml file.
  • In the eventual documentation for antigen-prime -- or even just now in the README, we should add some details about the kinds of formats we expect for certain file parameters (i.e., the DMS file, epitope sites (if we support making that a file), etc.

@zorian15 zorian15 merged commit 25ad694 into main Jun 20, 2023
@zorian15 zorian15 deleted the 17-geometric-sequence branch June 20, 2023 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Combining GeometricPhenotype with SequencePhenotype
4 participants