Skip to content

Commit

Permalink
Excluded few questions
Browse files Browse the repository at this point in the history
  • Loading branch information
vishnushiri Shyamsaisundar authored and vishnushiri Shyamsaisundar committed Jan 30, 2024
1 parent e00fbb6 commit d004565
Show file tree
Hide file tree
Showing 2 changed files with 11 additions and 19 deletions.
14 changes: 6 additions & 8 deletions notes/Work_documented.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
id: r423m96u71ix4pb458fk8u2
title: Work_documented
desc: 'This is file contains all the steps done for the master thesis'
updated: 1705505300109
updated: 1706599860070
created: 1700240700998
---
# Objective
Expand Down Expand Up @@ -126,7 +126,7 @@ ten_country_mut_data/* \
## Finding positions under pressure (BIG GOAL)
The big goal is to find the positions under pressure. To obtain this, firstly the frequency of each position(RBD spike mutations in position 330-530)([[question on the position|Work_documented.possible_questions#7-aaccording-to-uniprot-the-rbd-region-in-spike--is-319-541aa]]) in the aa_substitution has to be first calculated and interpolated to get the daily data.
The big goal is to find the positions under pressure. To obtain this, firstly the frequency of each position(RBD spike mutations in position 330-530)([[Question on the position|Work_documented.possible_questions#5-aaccording-to-uniprot-the-rbd-region-in-spike--is-319-541aa]]) in the aa_substitution has to be first calculated and interpolated to get the daily data.
![frequency interpolation](assets/Pics/Frequency_interpolation.png)
- For Each country the mutation data from GISAID has been used as the input.
Expand All @@ -139,9 +139,7 @@ The big goal is to find the positions under pressure. To obtain this, firstly th
> Frequency of pos_373 on 01-01-2022 = $\frac{count\space of\space pos\_373\space on\space 01-01-2022}{Number\space of\space sequences\space on\space 01-01-2022}$
- To have the frequency data for everyday from Jan 1 2022 - Oct 31 2023, linear [[interpolation|Glossary#interpolation]] is done using the approximate method. The result of the interpolation is added to the corresponding date in the country data_frame.
>- [[Question on combining daywise|Work_documented.possible_questions#4-where-could-combining-the-data-on-daily-basis-and-then-interpolating-them-to-get-the-missing-day-data-go-wrong]]
>- [[Question on week-day interpolation|Work_documented.possible_questions#5-if-data-is-combined-weekly-how-should-this-frequency-be-distributed-among-the-week-to-get-the-week-daily-interpolation]]
>- [[Question on the interpolation method|Work_documented.possible_questions#6-why-do-we-do-linear-interpolation-why-not-spline-interpolation]]
[[Question on interpolation|Work_documented.possible_questions#4-why-do-we-do-linear-interpolation-why-not-spline-interpolation]]
## Computing the pressure on the position
Expand All @@ -151,15 +149,15 @@ $\\ P(pos,s)=\sum_{s=t_0}^{t}\exp^{-k[t-s]}\times f(pos,s) \\$
- Where the f(pos,s) is the frequency of the position on time s.
- $exp^{-k[t-s]}$ is the discount factor - mutation frequencies that occurred [t-s] days ago get discounted by the half life of neutralising antibodies.
- $k\sim \frac{ln(2)}{45+14}$
- By suggestion of the Prof. the vector for discount factor was first computed. For this the date range for each of the country_df was found. If the difference between the sart day and the end day is 9 then [t-s] could be in the range 0-9. Hence with this as base the discount factor was computed for [t-s] ranging 0-[difference between the start day to end day in the dataframe]. All these values are stored in a vector.
- By suggestion of the Prof. the vector for discount factor was first computed. For this the date range for each of the country_df was found. If the difference between the start day and the end day is 9 then [t-s] could be in the range 0-9. Hence with this as base the discount factor was computed for [t-s] ranging 0-[difference between the start day to end day in the dataframe]. All these values are stored in a vector.
- According to the selected $t_0$, $t$ the discount factor vector was sliced and the frequency of the particular position in the time duration $t_0$ to $t$ was matrix multiplied to get the pressure on the position.
- This was done for all the RBD positions in a country and repeated for all the 10 countries.
- The output will have two columns - RBD posisition and the pressure on the position.
## Masking
- The objective of masking to find the exposed positions among the RBD spike positions.
- To know the exposed residues the solvent accessibility of each of the residues in the spike protein were found. [[question on mutations and solvent accessibility|Work_documented.possible_questions#8-if-a-rbd-spike-position-in-the-wildtype-is-occupied-by-a-hydrophobic-residue-and-it-is-replaced-by-hydrophilic-residue-the-solvent-accessibility-might-changes-right-due-to-possible-difference-in-the-fold-in-that-case-should-we-study-these-positions-in-each-of-the-voi]]
- To know the exposed residues the solvent accessibility of each of the residues in the spike protein were found. [[Question on solvent acessibility|Work_documented.possible_questions#6-if-a-rbd-spike-position-in-the-wildtype-is-occupied-by-a-hydrophobic-residue-and-it-is-replaced-by-hydrophilic-residue-the-solvent-accessibility-might-change-probably-due-to-the-difference-in-the-fold--in-that-case-should-we-study-these-positions-in-each-of-the-voi]]
- To find solvent accessibility of the protein various tools were utilitsed. This can be found here[[Work_documented.Finding_surface_residues]]
- From the output of each of the tool Spike RBD surface positions were found. This process was direct in the case of the output from GetArea and Netsurf3.0 but in the case of DSSP, relative solvent accessibility was computed from the absolute solvent accessibility in the dssp output file. Using this computed relative solvent accessibility the surface residues were found.
Expand All @@ -174,7 +172,7 @@ $\\ P(pos,s)=\sum_{s=t_0}^{t}\exp^{-k[t-s]}\times f(pos,s) \\$
## Visualization of the positions under pressure
- To know how the positions under pressure would differ by country I thought heat map will be the good choice
- To do this a dataframe was created with a column of all the Spike RBD positions given by the tool. Other 10 columns belonging to each of the 10 countries. These columns contain the pressure for each of the positions that was computed using the mutation . If a position has no record of mutation in a country it is assigned to zero.
- To do this a dataframe was created with a column of all the exposed Spike RBD positions given by the tool. Other 10 columns belonging to each of the 10 countries. These columns contain the pressure on each of the positions that was computed earlier . If a position has no record of mutation in a country it is assigned to zero.
- This dataframe is reshaped to make it usable for geom_tile.
- The heat map is then plotted on this reshaped dataframe. ![Netsurf output heatmap](assets/plots/netsurf_based_output.png)
- This was done for the outputs from all 3 tools and the heatmap is saved in different pdfs with ```Work/Data_Analysis/netsurf_based_output.pdf```,```Work/Data_Analysis/dssp_based_output.pdf```,```Work/Data_Analysis/getArea_based_output.pdf```.
Expand Down
16 changes: 5 additions & 11 deletions notes/Work_documented.possible_questions.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
id: xhw6w5ghbjhkzbo7huxzhwg
title: Possible_questions
desc: 'This note is regarding all the question that are to be rised to understand the work'
updated: 1705315915317
updated: 1706599314359
created: 1701937898390
---

Expand Down Expand Up @@ -31,20 +31,14 @@ australia|13257|13261

- ANS: Jaccard index is not the right way to go about, we have to get the distance, create a distance table based on the spike mutations.

### 4. Where could combining the data on daily basis and then interpolating them to get the missing day data go wrong?
### 4. Why do we do linear interpolation, why not spline interpolation?

-ANS:

### 5. If data is combined weekly, how should this frequency be distributed among the week to get the week-daily interpolation?
### 5. Aaccording to Uniprot the RBD region in spike is 319-541aa ![spike rbd uniprot](assets/Pics/uniprot_spikeRBD.png)

-ANS:

### 6. Why do we do linear interpolation, why not spline interpolation?

-ANS:

### 7. Aaccording to Uniprot the RBD region in spike is 319-541aa ![spike rbd uniprot](assets/Pics/uniprot_spikeRBD.png)
ANS:

### 8. If a RBD spike position in the wildtype is occupied by a hydrophobic residue and it is replaced by hydrophilic residue, the solvent accessibility might changes right due to possible difference in the fold? In that case should we study these positions in each of the VOI?
### 6. If a RBD spike position in the wildtype is occupied by a hydrophobic residue and it is replaced by hydrophilic residue, the solvent accessibility might change probably due to the difference in the fold. In that case should we study these positions in each of the VOI?

ANS:

0 comments on commit d004565

Please sign in to comment.