figures/6_single_cell_hnsc.Rmd

---
title: "Figure 6: Application of DualSimplex to single cell data"
output:
  html_document:
    df_print: paged
    self_contained: yes
---

# Setup
```{r setup}
source("../R/setup.R")
library(ggplot2)
library(Seurat)
library(systemfonts)
library(magrittr)

```


# Read original data (Seurat object)
You actually can use just matrix (see other examples)
```{r load}
so <- readRDS("../data/large/GSE103322.rds")
```

## Set proper names for annotation
Here we just rename some annotation names, to make it more readable.

```{r, fig.width=7, fig.height=5}
ct_mapping = list(
  "0" = "Cancer",
  "Fibroblast" = "Fibroblast",
  "B cell" = "B cell",
  "myocyte" = "Myocyte",
  "-Fibroblast" = "CAF",
  "Macrophage" = "Macrophage",
  "Endothelial" = "Endothelial",
  "T cell" = "T cell",
  "Dendritic" = "Dentritic",
  "Mast" = "Mast"
)
so@meta.data[, "ct"] <- unname(sapply(so@meta.data$ct, function(x) ct_mapping[[x]]))
Idents(so) <- "ct"
DimPlot(so)
```

## Linearize the data
```{r}
data <- so@assays$RNA@counts
data <- as.matrix(data)
data <- 2^data
dim(data)
```

## Subset variable features
Here we take only 3000 most variable genes to make caluclations faster. 
We use Seurat method to do this for now. 
You can play with different types of filtering (e.g. MAD filtering).

### Find variable features (genes)
```{r}
so <- Seurat::FindVariableFeatures(so, nfeatures = 3000)
flt_genes <- so@assays$RNA@var.features
flt_genes <- flt_genes[flt_genes %in% rownames(data)]
```

### Plot selected genes on VariableFeature plot
```{r, fig.width=7, fig.height=5}
top10 <- head(VariableFeatures(so), 10)
plot <- VariableFeaturePlot(so)
plot <- LabelPoints(plot = plot, points = top10, repel = TRUE)
plot
```
## Finally, subset the data which we take for analysis

```{r}
data <- data[flt_genes, ]
```


# DualSimplex method
## Create an object
```{r}
dso <- DualSimplexSolver$new()
dso$set_data(data) # This  will perform sinkhorn transformation. Can be time consuming for big matrices and weak PCs
dim(dso$get_data())
```

## Inspect the data. Plot SVD to estimate number of clusters
Usually it is not clearly visible how many clusters there are without additional filtering. 
But our goal is to get some kind of elbow plot. The if for specific K the change in variance becomes minor, this is our K parameter.
```{r}
dso$plot_svd(1:100)
```
## Project data into low dimensional space.
Let's start with number of clusters equal 16. 
Since we Sinkhorn transformed matrix, the 16-dimensional space of  SVD vectors should represent simplex structures of our data points

### Projection itself
This is how we project our points
```{r}
K <- 16
dso$project(K)
```

#### Visualize projeciton
Since we can not plot or interpret 16 dimensions directly, we will use UMAP for visualization purposes. It seems that UMAP is able to catch spatial distribution of simplexes. (corner points on UMAP are usually corners in original space)

```{r}
dso$run_umap(neighbors_X = 30, neighbors_Omega = 15)
```

Make plot
```{r, fig.width=12, fig.height=6}
dso$plot_projected(use_dims = NULL)
```

### Make several diagnostic plots to decide if we need to filter something
Once we got some projection we now are interested in how good this projection is.
Question we try to answer is: Should we remove some outlier points to make projection look better?
Here we combined several plots to look at
- Histograms of zero/plain distance for both genes and samples
Here we can see if some points are way too far from the zero than others or they just don't fit into K-dimensional space. 
Such points are usually outliers and we want to remove them
- Plane/Zero distance dimplot
This is how to look into these number jontly. Point which is just far from the zero could be a corner of a simplex.
But point which is far zero and also far from the hypothetical simplex is considered as outlier which ruins projection by pulling all the hyperplane to this point.
- Points colored by these numbers
This is again same thing but on UMAP Sometimes you can visually see some small clusters which should be outliers.
- SVD plot
After removing noise in the data we should observe some improvement in this plot as well.

```{r, fig.width=12, fig.height=6}
dso$plot_projection_diagnostics()
```

### Filtering
Let's now start filtering iterations. 
For big and noisy data it is usually multiple iterations
After each filtering we do projection again, since we removed poitns which were ruining the projecion and it could improve.

#### Filtering step 1. Basic filter
Here we remove some predefined "bad" genes. These are non-coding, mito and house keeping genes.
In general, we have a set of regular expression filters for gene names which we applied in this paper

Note: we also have MAD filtering here, but since single cell is very different from the bulk data, we are not applying it by setting -inf threshold
Since we removed some points from original data we should repeat umap with new data
```{r}
dso$basic_filter(remove_true_cols_default=c("RPLS", "LOC", "ORF", "SNOR", "MT", "RRNA"), log_mad_gt = -Inf)
dso$project(K)
dso$run_umap(neighbors_X = 20, neighbors_Omega = 15)
```

Let's see how projection changed so far
```{r, fig.width=12, fig.height=6}
dso$plot_projection_diagnostics()
DimPlot(so, cells.highlight = list(filtered = colnames(so)[!colnames(so) %in% colnames(dso$get_data())]))

```


####  Filtering step 2. Distance filter for genes
Let's remove genes which are far from zero (it is possible that these genes are just outliers)
```{r}
dso$distance_filter(zero_d_lt = 0.1, genes = T) # apply distance filter
dso$project(K)
dso$run_umap(neighbors_X = 20, neighbors_Omega = 15)
```
Let's see how projection changed so far
```{r fig.width=12, fig.height=6}
dso$plot_projection_diagnostics()

DimPlot(so,
        cells.highlight = list(filtered = colnames(so)[!colnames(so) %in% colnames(dso$get_data())]),
        sizes.highlight = 0.1)
```
Looks like after gene-filtering, some extreme outlier cells are more close to the center.

####  Filtering step 3. Distance filter for genes
Let's make one more distance filtering of genes for this new projection
```{r}
dso$distance_filter(zero_d_lt = 0.05, genes = T) # apply distance filter
dso$project(K)
dso$run_umap(neighbors_X = 20, neighbors_Omega = 15)
```
Let's see how projection changed so far
```{r fig.width=12, fig.height=6}
dso$plot_projection_diagnostics()

DimPlot(so,
        cells.highlight = list(filtered = colnames(so)[!colnames(so) %in% colnames(dso$get_data())]),
        sizes.highlight = 0.1)
```
Looks like it doesn't change much. Now let's try filtering cells.


####  Filtering step 4. Distance filter for cells
Now we will remove cells which are far from zero point and basically disturbing the simplex structure
```{r}
dso$distance_filter(zero_d_lt = 0.1, genes = F) # apply distance filter
dso$project(K)
dso$run_umap(neighbors_X = 20, neighbors_Omega = 15)
```
Let's see how projection changed so far
```{r fig.width=12, fig.height=6}
dso$plot_projection_diagnostics()

DimPlot(so,
        cells.highlight = list(filtered = colnames(so)[!colnames(so) %in% colnames(dso$get_data())]),
        sizes.highlight = 0.1)
```

####  Filtering step 5. Distance filter for genes
One more zero distance filter for genes
```{r}
dso$distance_filter(zero_d_lt = 0.05, genes = T) # apply distance filter
dso$project(K)
dso$run_umap(neighbors_X = 20, neighbors_Omega = 15)
```
Let's see how projection changed so far
```{r fig.width=12, fig.height=6}
dso$plot_projection_diagnostics()

DimPlot(so,
        cells.highlight = list(filtered = colnames(so)[!colnames(so) %in% colnames(dso$get_data())]),
        sizes.highlight = 0.1)
```

####  Filtering step 6. Distance filter for genes
Now let's remove not only genes far from zero but also those points which are far from the hypothetical hyperplane in (R1,R2,R3). Such points has some variance incorporated in additional coordinates (R4, R5, ...)
```{r}
dso$distance_filter(zero_d_lt = 0.06, plane_d_lt = 0.12, genes = T) # apply distance filter
dso$project(K)
dso$run_umap(neighbors_X = 20, neighbors_Omega = 15)
```
Let's see how projection changed so far
```{r fig.width=12, fig.height=6}
dso$plot_projection_diagnostics()

DimPlot(so,
        cells.highlight = list(filtered = colnames(so)[!colnames(so) %in% colnames(dso$get_data())]),
        sizes.highlight = 0.1)
```
It looks like the genes are kinda stable. Now let's do the final filtering on samples.


####  Filtering step 7. Distance filter for samples
Same for cells. We remove cells which are both far from zero and far from the hyperplane.
```{r}
dso$distance_filter(zero_d_lt = 0.075, plane_d_lt = 0.2, genes = F) # apply distance filter
dso$project(K)
dso$run_umap(neighbors_X = 20, neighbors_Omega = 15)
```
Let's see how projection changed so far
```{r fig.width=12, fig.height=6}
dso$plot_projection_diagnostics()

DimPlot(so,
        cells.highlight = list(filtered = colnames(so)[!colnames(so) %in% colnames(dso$get_data())]),
        sizes.highlight = 0.1)
```


### Summarize filtering
Just print all the steps we have done
```{r}
dso$get_filtering_stats()
```

## Subsample Seurat Object according to the filtering
```{r}
so_flt <- so[, colnames(so) %in% colnames(dso$get_data())]
```

```{r, fig.width=7, fig.height=5}
DimPlot(so_flt)
```


## Finalize projection
### Track SVD history for each filtering step done
```{r}
dso$plot_svd_history(n_dims = 30)
```
Observed that we obtain a steeper elbow plot after filtering, which indicates first few components
explains more variance.

### Final projection for points (just to be sure that every filtering step was taken into account)
```{r}
dso$project(K)
```
### Run the final UMAP and plot it

```{r fig.width=13, fig.height=6}
set.seed(24)
dso$run_umap(neighbors_X = 20, neighbors_Omega = 10)
dso$plot_projected(use_dims = NULL, with_solution = F, with_history = F)
dso$plot_projected(
  color_genes = "zero_distance",
  color_samples = as.factor(so_flt[, colnames(dso$get_data())]$ct),   # Add annotation colors to cells
  use_dims = NULL,
  pt_size = 0.5,
  with_legend = T,
  with_solution = F,
  with_history = F
)
```

### Plot final projection diagnostic plots

```{r, fig.width=12, fig.height=6}
# Plot zero_distance and plane_distance in the original UMAP
so_flt$zero_distance <- pData(dso$get_data())[colnames(so_flt), "zero_distance"]
so_flt$plane_distance <- pData(dso$get_data())[colnames(so_flt), "plane_distance"]
pal <- rev(RColorBrewer::brewer.pal(11, "Spectral"))
cowplot::plot_grid(
  FeaturePlot(so_flt, feature = "zero_distance", cols = pal) + NoLegend(),
  FeaturePlot(so_flt, feature = "plane_distance", cols = pal) + NoLegend()
)
```

### Save result dso object for further usage
Now we finished all the filtering. We still don't know coordinates for vertices of the simplex we will find them in the next section.
For now just save filered object
```{r save-state}
dso$save_state("../out/single_cell_hnsc_save_3kg_16ct/")
```


# Optimizing deconvolution solution and clustering of cells

## Load the model with data already filtered
```{r load-state}
dso <- DualSimplexSolver$from_state("../out/single_cell_hnsc_save_3kg_16ct/")
```
## Initialize initial solution, define optimization parameters and train
Now we can start analyses with filtered data. Here we will use an adaptive learning rate and 
coefficients for lambda and beta constrains.
```{r}
# Initialization
set.seed(42)
dso$init_solution("random")

# Training params
start_with_hinge_H <-  1 
start_with_hinge_W <-  1
params_increase <- 100
PARAMETERS_INCREASE_STEPS <-  3

lr_x <-  0.1
lr_omega <-  0.1
lr_decay <- 0.1
LR_DECAY_STEPS <-  12

RUNS_EACH_STEP <-  2000

original_lambda_term <- start_with_hinge_H  #coef_hinge_H
original_beta_term <- start_with_hinge_W #coef_hinge_W
original_lr_x <-  lr_x
original_lr_omega <-  lr_omega

# Training loop. Start with defined parameters and change them in the process of training.
## For all learning rate steps
for  (lr_step in 1:LR_DECAY_STEPS) {
  lambda_term <-  original_lambda_term * lr_x * lr_x
  beta_term <- original_beta_term   * lr_omega * lr_omega
  RUNS_EACH_STEP <-  RUNS_EACH_STEP * 0.7
  ## For all constraints increase steps
  for (incr_step in 1:PARAMETERS_INCREASE_STEPS) {
    ## Do training
    ## This is the main training method, you can just run this
    dso$optim_solution(RUNS_EACH_STEP, 
                       optim_config(coef_hinge_H = lambda_term,
                                    coef_hinge_W =beta_term,
                                    coef_der_X = lr_x, 
                                    coef_der_Omega = lr_omega,
                                    coef_pos_D_w = 0,
                                    coef_pos_D_h = 0
                       ))
    lambda_term <- lambda_term * params_increase
    beta_term <-  beta_term * params_increase
  }
    lr_x <- lr_x * lr_decay
    lr_omega <-  lr_omega * lr_decay
}
solution <- dso$finalize_solution()
```
## Plot result solution on umap
Here we plot several plots to ensure that we reached some kind of convergence
- plot projected points with solution points added to the plot
- plot error history for each term
- plot changes in negativity of basis and coefficients
```{r fig.width = 13, fig.height = 6}
dso$plot_projected(use_dims = NULL, with_legend = T)

dso$plot_error_history()

dso$plot_negative_basis_change()
dso$plot_negative_proportions_change()
```


## Check corner points
### Assign clusters
Now we want to validate points we found.
We will use solution points as a centroids for clustering of the data. This is done with simple KNN procedure
```{r, fig.width=12, fig.height=6}
# Assign clusters to genes simplex with centroids provided by our solution
knns <- FNN::get.knnx(dso$st$proj$X, dso$st$solution_proj$X, k = 40)
corner_genes <- as.list(as.data.frame(
  apply(knns$nn.index, 1, function(x) rownames(dso$get_data())[x])
))
# Assign clusters to samples simplex with centroids provided by our solution
knns_cells <- FNN::get.knnx(dso$st$proj$Omega, dso$st$solution_proj$Omega, k = 70)
corner_cells <- as.list(as.data.frame(
  apply(knns_cells$nn.index, 1, function(x) colnames(dso$get_data())[x])
))
kmeans_cells_clustering <- stats::kmeans(dso$st$proj$Omega, centers = dso$st$solution_proj$Omega)
```

### Plot clusters on projection plot
```{r, fig.width=12, fig.height=6}

# this is plot with original annotation for cells
dso$plot_projected(
  color_genes = "zero_distance",
  color_samples = as.factor(so_flt[, colnames(dso$get_data())]$ct),
  use_dims = NULL,
  pt_size = 0.5,
  with_legend = T,
  with_solution = F,
  with_history = F
)
# this is plot with corner genes and cells highlighted
dso$plot_projected(
  color_genes = which_marker(rownames(dso$get_data()), corner_genes),
  color_samples = which_marker(colnames(dso$get_data()), corner_cells),
  with_solution = T,
  with_history = F,
  with_legend = F,
  use_dims = NULL
)
# this is plot with result clusters highlighted for cells
dso$plot_projected(
  color_genes = which_marker(rownames(dso$get_data()), corner_genes),
  color_samples = as.factor(kmeans_cells_clustering$cluster),
  with_solution = T,
  with_history = F,
  with_legend = F,
  use_dims = NULL
)
```

### Plot clusters on original tSNE plot of seurat
```{r fig.width=5, fig.height=5}
gg_color_hue <- function(n) {
  hues = seq(15, 375, length = n + 1)
  hcl(h = hues, l = 65, c = 100)[1:n]
}

so_flt$marker_cell <- which_marker(colnames(so), corner_cells)
so_flt$kmeans_cluster <- kmeans_cells_clustering$cluster[colnames(so_flt)]

DimPlot(so_flt, group.by = "marker_cell", na.value = "grey92", cols = gg_color_hue(dso$get_n_cell_types())) + NoLegend() + ggtitle("") + NoAxes()
DimPlot(so_flt, group.by = "kmeans_cluster", label=T) + NoLegend() + ggtitle("") + NoAxes()
DimPlot(so, label = T) + NoLegend() + ggtitle("") + NoAxes()
DimPlot(so)
```

```{r save-solution}
dso$save_solution("../out/single_cell_hnsc_save_3kg_16ct/")
```

We will also saved the pseudo-bulk expression according to the clustering results.
```{r}
ae <- AverageExpression(so_flt, group.py = "kmeans_cluster")
write.csv(ae$RNA, "../out/pseudo_bulk_16ct.csv", quote = F)
```


# Reproducing figure 6
```{r, fig6a, fig.width=7, fig.height=5}
plt <- DimPlot(so_flt, raster = T) +
  Seurat::NoAxes() +
  ggtitle("") +
  theme(text=element_text(family = "Helvetica", size = 12))

ggsave("../out/sc_a_gse103322_umap.svg", plt, width = 7, height = 5, device = svglite::svglite)
plt
```


```{r fig6b, fig.width=20, fig.height=10}
plotlist <- list()
for (use_dims in list(2:3, 4:5, 6:7, 8:9)) {
  pt_size <- 0.7
  plts <- dso$plot_projected(
    color_genes = "zero_distance",
    color_samples = as.factor(so_flt[, colnames(dso$get_data())]$ct),
    use_dims = use_dims,
    pt_size = pt_size,
    with_legend = T,
    with_solution = F,
    with_history = F,
    wrap = F
  )
  linewidth <- 0.3

  plotlist[[use_dims[[2]] - 1]] <- plts[[2]] + ggtitle("Cells") + theme(
    legend.position = "none",
    axis.line = element_line(color = "black", size = linewidth),
    axis.title = element_text(size = 18),
    axis.text = element_blank(),
    panel.grid = element_blank(),
    plot.title = element_text(size = 16, face = "bold")
  ) + xlab(paste0("S", use_dims[[1]])) + ylab(paste0("S", use_dims[[2]]))

  plotlist[[use_dims[[1]] - 1]] <- plts[[1]] + ggtitle("Genes") + theme(
    legend.position = "none",
    axis.line = element_line(color = "black", size = linewidth),
    axis.title = element_text(size = 18),
    axis.text = element_blank(),
    panel.grid = element_blank(),
    plot.title = element_text(size = 16, face = "bold")
  ) + xlab(paste0("R", use_dims[[1]])) + ylab(paste0("R", use_dims[[2]]))
}

plt <- cowplot::plot_grid(plotlist = plotlist, ncol = 4)

ggsave("../out/sc_b_first_dims.svg", plt, width = 20, height = 10, device = svglite::svglite)
plt
```


```{r fig6c, fig.width=4, fig.height=8}
plts <- dso$plot_projected(
  color_genes = which_marker(rownames(dso$get_data()), corner_genes),
  color_samples = which_marker(colnames(dso$get_data()), corner_cells),
  use_dims = NULL,
  pt_size = 1,
  with_legend = F,
  with_solution = T,
  with_history = T,
  color_history = F,
  wrap = F
)
linewidth <- 0.3

plts[[2]] <- plts[[2]] + ggtitle("Cells") + theme(
  axis.line = element_line(color = "black", size = linewidth),
  axis.title = element_text(size = 14),
  axis.text = element_blank(),
  panel.grid = element_blank(),
  plot.title = element_text(size = 14, face = "bold", vjust = -5, hjust = 0.02)
) + xlab("UMAP1") + ylab("UMAP2")

plts[[1]] <- plts[[1]] + ggtitle("Genes") + theme(
  axis.line = element_line(color = "black", size = linewidth),
  axis.title = element_text(size = 14),
  axis.text = element_blank(),
  panel.grid = element_blank(),
  plot.title = element_text(size = 14, face = "bold", vjust = -5, hjust = 0.02)
) + xlab("UMAP1") + ylab("UMAP2")


ggsave(
  "../out/sc_c_corners_genes.svg",
  plts[[1]],
  width = 4,
  height = 4,
  device = svglite::svglite
)
ggsave(
  "../out/sc_c_corners_samples.svg",
  plts[[2]],
  width = 4,
  height = 4,
  device = svglite::svglite
)

cowplot::plot_grid(plotlist = plts, ncol = 1)
```


```{r fig6d, fig.width=26, fig.height=12}
plotlist <- list()
for (ctn in 1:dso$get_n_cell_types()) {
  ct <- dso$get_ct_names()[[ctn]]
  so_flt[[ct]] <- pmin(dso$get_solution()$H[ct, ], 0.5)
  mar <- 0
  plotlist[[ct]] <- FeaturePlot(so_flt, feature = ct, raster = T) +
    NoLegend() +
    NoAxes() +
    ggtitle(ctn) +
    theme(
      plot.title = element_text(size = 24, face = "bold", vjust = -7, hjust = 0.05),
      plot.margin = margin(t = mar, r = mar, b = mar, l = mar)
    )
}
plt <- cowplot::plot_grid(plotlist = plotlist, ncol = 6, scale = 1.05)
ggsave("../out/sc_d_proportions.svg", plt, width = 26, height = 12, device = svglite::svglite)
plt
```


```{r enrichment-mapping}
# Above plot -> Fig. 6d
enrichment_mapping <- c(
  "V1" = 2,
  "V2" = 12,
  "V3" = 8,
  "V4" = 10,
  "V5" = 4,
  "V6" = 1,
  "V7" = 3,
  "V8" = 6,
  "V9" = 9,
  "V10" = 14,
  "V11" = 13,
  "V12" = 11,
  "V13" = 5,
  "V14" = 7,
  "V15" = 15,
  "V16" = 16
)
```


```{r check-if-any-missing}
setdiff(1:16, enrichment_mapping)
```
All the cell type signatures reported are captured.


```{r fig6e, fig.width=5, fig.height=5}
fn <- function(x) {
  mx <- which.max(x)
  if (x[mx] > 0) {
    return(mx)
  } else {
    return(NA)
  }
}
so_flt$dualsimplex_cluster <- apply(dso$get_solution()$H, 2, fn)[colnames(so_flt)]
plt <- (DimPlot(so_flt, group.by = "dualsimplex_cluster", label = F, raster = T) +
  NoLegend() +
  NoAxes() +
  ggtitle("")) %>% LabelClusters(id = "dualsimplex_cluster", fontface = "bold", color = "black", repel = 0.1, size = 3, box = T)

ggsave("../out/sc_e_clustering.svg", plt, width = 5, height = 5, device = svglite::svglite)
plt
```


```{r fig6f, fig.width=17, fig.height=4}
basis <- dso$get_solution()$W

# Min-max scaling for display, like in phantasus
basis <- as.data.frame(t(apply(basis, 1, function(x) (x - min(x)) / (max(x) - min(x)))))

ct_order <- c(
  6, 4, 11, 16,  # Fibroblasts
  12, 1, 2, 13, 3,  # Cancer and near epithelium
  8, 9, 14, 10, 15, 7,  # Immune
  5  # Endothelial
)
n_markers <- 5
gaps_after <- c(4, 9, 15)
gaps_after <- rep(gaps_after, each = 3)
col_gaps <- gaps_after * n_markers
col_gaps_small <- cumsum(rep(n_markers, dso$get_n_cell_types()))
col_gaps_small <- col_gaps_small[!col_gaps_small %in% col_gaps]
col_gaps <- c(col_gaps_small, col_gaps)
sele <- unlist(get_signature_markers(dso$get_solution()$W, n_markers)[ct_order])
colnames(basis) <- seq(1, ncol(basis))
plt <- pheatmap::pheatmap(
  t(basis[sele, ct_order]),
  cluster_rows = F,
  cluster_cols = F,
  color = colorRampPalette(c("blue", "white", "red"))(100),
  legend = F,
  gaps_col = col_gaps,
  gaps_row = gaps_after,
  angle_col = 315,
  fontsize_col = 10
)
ggsave("../out/sc_f_marker_genes.svg", plt, width = 15, height = 4, device = svglite::svglite)

plt
```