visualise_implicon_methylation_consistency_mouse_non_allele_specific.Rmd

---
title: "Visualisation of methylation consistency in imprinted Amplicons"
output:
  html_document: default
  html_notebook: default
  author: Felix Krueger
---


```{r}
library(tidyverse)
```


## Data Import

Data come in the following form:

```
readID	sample	implicon	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31
1	BxC_2_clone_A5	Impact	0	0	0	0	0	0	0	NA	NA	NA	0	0	0	0	0
2	BxC_2_clone_A5	Prickle1	NA	1	1	1	1	1	1	0	1	1	1
3	BxC_2_clone_A5	Commd1	NA	0	1	0	0	0	0	0	0	0	0	0	0
4	BxC_2_clone_A5	Gnas	NA	0	0	0	0	0	0	0	0	0
```

Each read comes annotated with its sample name, the implicon it overlaps, and the positions that were found methylated ('1') or unmethylated ('0). If a position of the implicon was not covered for any reason, a value of 'NA' was assigned. The read IDs are all unique.

```{r}
# This file is a collation of all reads that overlap targetted implicons for all samples
# Edit 26 March 2021: We had to change the input column type to "double", as a few columns were auto-detected as "character"

read_tsv("methylation_state_consistency.txt", col_types = cols(.default = "d", sample = col_character(), implicon = col_character())) -> input
colour.meth   <- "#3e4444"
colour.unmeth <- "#82b74b"
colours <- c(colour.meth,colour.unmeth)
```


## Defining a function to do the following
The function takes in a certain Sample name as well as an Implicon name, and then:

- filters the comprehensive data table for a single Implicon
- filters for a single Sample
- transform the data to long format
- group the data by annotated CpG position, and exclude positions that were not covered at all (NA call)

- extract up to 5000 reads, 
- sort reads from highly to lowly methylated, and finally
- plot reads per gene, per sample


```{r}

plot_implicons <- function(sample.name,gene.name){
  #sample.name <- ""
  #gene.name <- ""
 
  print (paste("Now processing data set: ",sample.name,", Implicon name: ", gene.name))
         
  # Filtering the Comprehensive Input table
  input %>% 
    filter(implicon == gene.name ) %>% 
      filter(sample == sample.name) %>%
          sample_n(min(5000,nrow(.))) -> filtered   # random subset of 5000 rows
  filtered
  nrow(filtered) -> rows.filtered
  
  # Transform the filtered data set to Tidyverse Long Format
  filtered %>% 
    gather(key = position,value=state, -readID, -implicon, -sample) %>% 
    type_convert(cols(position=col_double())) -> filtered.tidy

  # further filter tidied dataset to remove positions that were not covered at all and sort reads by readID
  filtered.tidy %>%
    group_by(position) %>% 
    mutate(keep = any(!is.na(state))) %>% # if a position contained only NAs, 
    # then FALSE is set for any () of the rows or that position
    filter(keep) %>%                      # this removes positions from the dataset that were set to FALSE
    select(-keep) %>%                     # remove the column 'keep' again
    ungroup() %>%                         # remove grouping by position
    group_by(readID) %>%                  # now group by readID
    mutate(sum=sum(state,na.rm=T)) %>%    # count up all methylated calls, and save in new column called 'sum'
    ungroup() %>%                         # remove grouping by readID
    arrange(readID) -> filtered.tidy2  # now sort by readID

  
  # Now plot the data

  filtered.tidy2 %>% 
    arrange(sum) %>%    # reads are sorted from methylated to unmethylated
    mutate(readID = factor(readID,levels=unique(readID))) %>%  # set readID as level so they don't get sorted alphabetically
    ggplot(aes(position, readID,fill=factor(state,levels=c("1","0")))) +
    geom_tile() +
    labs(title= paste0("Gene: ", gene.name), 
        subtitle = paste0("# reads: ",rows.filtered),
        caption = paste0("Sample name: ", sample.name),
         x = "\nCpG position") +
    ylab(element_blank()) +
    coord_cartesian(expand=F) + # this will prevent leaving empty values on either side of the plot
    scale_fill_manual(values=colours, name="Methylation State",labels = c("meth", "unmeth")) +
    theme(plot.title = element_text(hjust = 0.5), 
          legend.title = element_text(colour = "steelblue4",  face = "bold"),
          legend.text  = element_text(face = "italic", colour="steelblue4"),
          axis.line.y = element_blank(),
          axis.ticks.y = element_blank(),
          axis.text.x = element_text(vjust = 0.5),
          axis.text.y = element_blank(),
          axis.ticks.x = element_line()) -> plot.mouse

  print (plot.mouse)
 
    
}
```
## Loop through all Implicon and sample names

```{r}
input %>%
  distinct(sample) %>% 
  pull(sample) -> samplenames
  # as_vector()-> samplenames    ### as_vector() works as well.

input %>%
  distinct(implicon) %>% 
  pull(implicon) -> implicons

# Looping through all Implicons, and then plotting all samples for each Implicon
for (imp in implicons){
#  # print (imp) 
  invisible(sapply(samplenames, plot_implicons,imp))
}

# Just a single sample for testing purposes
# invisible(sapply("Tx", plot_implicons, "snrpn"))

```