Updated transformation step

Plant-Food-Research-Open · Feb 23, 2024 · 77866db · 77866db
1 parent 1d33c74
commit 77866db
Show file tree

Hide file tree

Showing 8 changed files with 155 additions and 108 deletions.
diff --git a/_quarto.yml b/_quarto.yml
@@ -5,6 +5,9 @@ project:
 book:
   title: "The moiraine R package user manual"
   author: "Olivia Angelin-Bonnet"
+  search: true
+  repo-url: https://github.com/PlantandFoodResearch/moiraine
+  sharing: [twitter]
   chapters:
     - index.qmd
     - example_dataset.qmd

diff --git a/_targets/meta/meta b/_targets/meta/meta
diff --git a/comparison.qmd b/comparison.qmd
@@ -158,14 +158,14 @@ comparison_heatmap_corr(
     "sO2PLS" = c(
       "joint component 1",
       paste("metabolome specific component", 1:2),
-      paste("rnaseq specific component", 1:2)
+      "rnaseq specific component 1"
     ),
     "MOFA" = paste("Factor", 1:4)
   )
 )
 ```
 
-From the heatmaps, we can see that some latent dimensions constructed by the different methods seem to capture similar trends in the data. For example, MOFA factor 1, sPLS component 1, DIABLO component 1 and sO2PLS joint component 1 are all strongly correlated in terms of their samples score. Their correlation in terms of features weight is a bit lower, which is due to the fact that some methods perform features selection, therefore all non-selected features are given a weight of 0. Note that the sign of the correlation is interesting but not very important. We can also see some latent dimensions that seem correlated with respect to one metric but not the other. For example, there is a strong correlation between the samples score of MOFA factor 4 and sPLS component 2, but this is not reflected in their features weight. Again, that can be because one method performs latent selection and not the other. On the other hand, the correlation between sPLS component 2 and DIABLO component 2 is stronger when looking at their features weight than at their samples score. 
+From the heatmaps, we can see that some latent dimensions constructed by the different methods seem to capture similar trends in the data. For example, MOFA factor 1, sPLS component 1, DIABLO component 1 and sO2PLS joint component 1 are all strongly correlated in terms of their samples score. Their correlation in terms of features weight is a bit lower, which is due to the fact that some methods perform features selection, therefore all non-selected features are given a weight of 0. Note that the sign of the correlation is interesting but not very important. We can also see some latent dimensions that seem correlated with respect to one metric but not the other. For example, there is a strong correlation between the samples score of MOFA factor 4 and sPLS component 2, but this is not reflected in their features weight. Again, that can be because one method performs latent selection and not the other. On the other hand, the correlation between sPLS component 3 and DIABLO component 2 is stronger when looking at their features weight than at their samples score. 
 
 When using the function to compare the results of the two MOFA runs, the names that we gave to the list of output will be used as method name, and these are the names to use in order to select some latent dimensions of interest. For example:
 
@@ -231,7 +231,7 @@ comparison_plot_correlation(
   output_list[2:1],
   by = "samples",
   latent_dimensions = list(
-    "sO2PLS" = c("joint component 1", paste("rnaseq specific component", 1:2))
+    "sO2PLS" = c("joint component 1", "rnaseq specific component 1")
   )
 )
 ```

diff --git a/interpretation.qmd b/interpretation.qmd
@@ -457,12 +457,12 @@ The two disease status groups are well separated in this space.
 
 #### sO2PLS
 
-To showcase this function, we will look at the first two RNAseq-specific latent components:
+To showcase this function, we will look at the first two metabolomics-specific latent components:
 
 ```{r so2pls-plot-samples-score-pair}
 plot_samples_score_pair(
   so2pls_output,
-  paste("rnaseq specific component", 1:2),
+  paste("metabolome specific component", 1:2),
   mo_data = mo_set_complete,
   colour_by = "status",
   shape_by = "gender"
@@ -724,7 +724,7 @@ All genes and most metabolites selected for the two components are different.
 ```{r so2pls-features-weight-pair}
 plot_features_weight_pair(
   so2pls_output,
-  paste("rnaseq specific component", 1:2),
+  paste("metabolome specific component", 1:2),
   mo_data = mo_set_complete,
   label_cols = list(
     "rnaseq" = "Name",
@@ -733,7 +733,7 @@ plot_features_weight_pair(
 )
 ```
 
-The plot shows an interesting pattern: it seems that for the genes with a negative weight for the first RNAseq-specific component, their importance score for this component is highly correlated with their importance score for the second RNAseq-specific component, except that for the second one they have a positive weight. For the genes with a positive weight for the first RNAseq-specific component, their importance score for this component is highly correlated with their importance score for the second RNAseq-specific component, in which they are also given a positive weight.
+The plot shows an interesting pattern: the weights of the compound for these two specific components are negatively correlated, except for a couple of outliers.
 
 #### MOFA
 

diff --git a/mofa.qmd b/mofa.qmd
@@ -56,6 +56,10 @@ A characteristic of the MOFA2 package is that the model training part of the ana
 
 MOFA offers the option to assign samples to groups (referred to as the multi-group framework within MOFA). It is important to understand what this does before using it. In particular, if the aim of the analysis is to find features that separate samples based on a grouping, you should not use this multi-group framework. By setting sample groups in MOFA, it will ignore sources of variations that discriminate the groups. Instead, this multi-group framework will seek sources of variations that are shared between the different groups, and sources of variations that are exclusive to a specific group. For example, in [this analysis](https://raw.githack.com/bioFAM/MOFA2_tutorials/master/R_tutorials/scNMT_gastrulation.html), the multi-group framework was used to split samples (cells) based on their developmental stage. The goal of the analysis was to find coordinated variation between the different datasets, and detect at which developmental stage (i.e. in which group) it occurred, rather than trying to differentiate the different developmental stages.
 
+### MEFISTO
+
+MEFISTO is an extension of MOFA that explicitly accounts for some continuous structure amongst the samples, i.e. temporal or 2D spatial relationships between the observations. The constructed latent factors' associated with the covariate is assessed, to see whether they represent smooth variation which correlates with the covariate, or whether they are independent of the covariate. The trained model can then be used to extrapolate to unseen values of the covariate (e.g. missing time-points or physical locations). When using the multi-group framework, MEFISTO can also align the values of the covariate between the different groups, for example to account for differences in developmental speed between two groups of patients.
+
 ## Creating the MOFA input
 
 The first step is to transform the `MultiDataSet` object into a suitable format for the `MOFA2` package. This is done through the `get_input_mofa()` function. In addition to the `MultiDataSet` object, this function accepts as arguments:
@@ -181,6 +185,42 @@ mofa_input
 
 We did not use the multi-group framework, so there is only one samples group.
 
+### MEFISTO input
+
+In a similar way, we can construct the input object for a MEFISTO run with the `get_input_mefisto()` function. It is very similar to `get_input_mofa()`, except that is expects as second argument the name or names of the columns in the samples metadata tables corresponding to the continuous covariate(s) which represent(s) the structure amongst the samples. Note that there can be at most two covariates. For example, if we had a time-series dataset with information about the time at which each observation was taken in a column called `time`, we would use:
+
+::: {.targets-chunk}
+```{targets get-input-mefisto}
+tar_target(
+  mefisto_input,
+  get_input_mefisto(
+    mo_presel_supervised,
+    "time",
+    options_list = list(
+      data_options = list(scale_views = TRUE),
+      model_options = list(likelihoods = c(
+        "snps" = "poisson",
+        "rnaseq" = "gaussian",
+        "metabolome" = "gaussian")
+      ),
+      training_options = list(seed = 43)
+    ),
+    only_common_samples = FALSE
+  )
+)
+```
+:::
+
+MEFISTO has some specific options that can be set in a similar way to the data, model and training options for MOFA. The most important are:
+
+* `warping`: Logical, whether the covariate(s) should be aligned between the groups. Only when using the multi-group framework.
+
+* `warping_ref`: Character, if using the warping option, the name of the group that should be used as reference (the covariates for the other groups will be aligned to the covariates for this group).
+
+* `new_values`: Numeric vector, values of the covariate(s) for which the factors should be predicted (for inter/extrapolation).
+
+To set any of these options, they can be passed as a list to the `options_list` parameter as for the data, model and training options.
+
 ## Visualising the MOFA input
 
 It is possible to represent the samples that are present or missing across the different datasets with the `plot_data_overview()` function implemented in `MOFA2`:
@@ -256,7 +296,7 @@ plot_variance_explained(
 ) 
 ```
 
-Here, we see that factor 1 explains a lot (almost 50%) of variation in the transcriptomics dataset, a little (around 10%) in the metabolomics dataset and almost none in the genomics dataset. Factor 2 explains around 20% of variation in the genomics dataset, and none in the other datasets. Factors 3 and 4 seem specificto the transcriptomics datasets, and factors 5 to 15 explain very little variation; seeing this, we could re-train the MOFA model by setting the number of factors to 4 or 5.
+Here, we see that factor 1 explains a lot (almost 50%) of variation in the transcriptomics dataset, a little (around 10%) in the metabolomics dataset and almost none in the genomics dataset. Factor 2 explains around 20% of variation in the genomics dataset, and none in the other datasets. Factors 3 and 4 seem specific to the transcriptomics datasets, and factors 5 to 15 explain very little variation; seeing this, we could re-train the MOFA model by setting the number of factors to 4 or 5.
 
 
 In addition, the `plot_variance_explained()` function can create a second plot displaying the total percentage of variance explained by all the factors for each dataset, if the argument `plot_total` is set to `TRUE` :

diff --git a/preprocessing.qmd b/preprocessing.qmd
@@ -73,7 +73,7 @@ After inspection of the density plots for the different datasets (see @sec-inspe
 
 ### Transforming a single dataset
 
-The transformation of one dataset is done through the `transform_dataset()` function, which takes as input a `MultiDataSet` object, the name of the dataset to transform, and the name of the transformation to be applied, which should be one of `vsn`, `vst-deseq2`, `best-normalize-auto` or `best-normalize-manual`. For the latter, the name of the normalisation method from the `BestNormalize` package to use must also be supplied through the `method` argument. 
+The transformation of one dataset is done through the `transform_dataset()` function, which takes as input a `MultiDataSet` object, the name of the dataset to transform, and the name of the transformation to be applied, which should be one of `vsn`, `vst-deseq2`, `logx`, `best-normalize-auto` or `best-normalize-manual`. For the latter, the name of the normalisation method from the `BestNormalize` package to use must also be supplied through the `method` argument. 
 
 The `return_multidataset` argument determines whether a `MultiDataSet` object with the corresponding dataset transformed should be returned. If it is set to `FALSE`, the function will instead return a list with the transformed dataset as a matrix as well as other useful information returned by the transformation function applied. It is possible to only return the transformed matrix, by setting `return_matrix_only` to `TRUE`. This can be useful to quickly assess the effects of the transformation outside of the analysis pipeline. 
 
@@ -101,7 +101,7 @@ rnaseq_vst$info_transformation
 rnaseq_vst$transformation
 ```
 
-If we instead want to apply a log2 transformation to the dataset, we will use the `logx` transformation option instead. In that case, we have to specify the log base to use (here 2) through the `log_base` argument, as well as the function that should be applied to the matrix prior to the log-transformation through the `pre_log_function` argument. This function is here mostly to take care of zero values that could cause issue during the log transformation. By default, the `zero_to_half_min()` function is used, which replaces zero values in the dataset with half of the minimum non-null value found in the dataset. However, it is possible to pass any custom function instead, for example `function(x) x` would not make any change to the dataset before the log-transformation:
+If we instead want to apply a log2 transformation to the dataset, we will use the `logx` transformation option. In that case, we have to specify the log base to use (here 2) through the `log_base` argument, as well as the function that should be applied to the matrix prior to the log-transformation through the `pre_log_function` argument. This function is here mostly to take care of zero values that could cause an issue during the log transformation. By default, the `zero_to_half_min()` function is used, which replaces zero values in the dataset with half of the minimum non-null value found in the dataset. However, it is possible to pass any custom function instead, for example `function(x) x` would not make any change to the dataset before the log-transformation:
 
 ```{r apply-log2-example}
 rnaseq_log2 <- transform_dataset(
@@ -122,7 +122,7 @@ rnaseq_log2
 get_dataset_matrix(rnaseq_log2, "rnaseq")[1:5, 1:5]
 ```
 
-The two other transformation options are `best-normalize-auto` and `best-normalize-manual`. Both rely on the `bestNormalize` package to apply a transformation **independently** to each feature in the dataset. With `best-normalize-auto`, the `bestNormalize::bestNormalize()` function automatically selects for each feature the transformation that results in the most normal distribution. With the `best-normalize-manual` option, the same transformation is applied to each feature. The transformation to use must be specified through the `method` argument, and should be the name of one of the transformation implemented in `bestNormalize`: `"arcsinh_x"`, `"boxcox"`, `"center_scale"`, `"exp_x"`, `"log_x"`, `"orderNorm"`, `"sqrt_x"`, `"yeojohnson"`. Note that with this option, even if the same transformation is applied to each feature, the parameters for the transformation, if not specified, will be independently estimated for each feature. For example, with the `log_x` method, the function automatically adds a small offset (`a`) to the feature's measurements if there are any zero values. The value used for this offset might be different for each feature. We can instead set a value for `a` to be used for all features, as follows:
+The two other transformation options are `best-normalize-auto` and `best-normalize-manual`. Both rely on the `bestNormalize` package to apply a transformation **independently** to each feature in the dataset. With `best-normalize-auto`, the `bestNormalize::bestNormalize()` function automatically selects for each feature the transformation that results in a distribution of values closest to Gaussian as possible. Note that as the function is applied independently to each feature, a different distribution may be chosen for different features. With the `best-normalize-manual` option, the same transformation is applied to each feature. The transformation to use must be specified through the `method` argument, and should be the name of one of the transformation implemented in `bestNormalize`: `"arcsinh_x"`, `"boxcox"`, `"center_scale"`, `"exp_x"`, `"log_x"`, `"orderNorm"`, `"sqrt_x"` or `"yeojohnson"`. Note that with this option, even if the same transformation is applied to each feature, the parameters for the transformation, if not specified, will be independently estimated for each feature. For example, with the `log_x` method, the function automatically adds a small offset (`a`) to the feature's measurements if there are any zero values. The value used for this offset might be different for each feature. We can instead set a value for `a` to be used for all features, as follows:
 
 ```{r bestnormalize-manual-example}
 #| eval: false
@@ -138,7 +138,7 @@ transform_dataset(
 ```
 
 ::: {.callout-note}
-Before using the `best-normalize-manual` option, it is strongly recommended to have a look at the documentation of the corresponding function from the `bestNormalize` package, as they might have behaviours that are not expected. For example, by default the `bestNormalize::log_x()` function uses a log based 10, and centers and scales the transformed values, which not be what we want.
+Before using the `best-normalize-manual` option, it is strongly recommended to have a look at the documentation of the corresponding function from the `bestNormalize` package, as they might have behaviours that are not expected. For example, by default the `bestNormalize::log_x()` function uses a log base 10, and centers and scales the transformed values, which not be what we want.
 :::
 
 ### Transformation target factory
@@ -389,11 +389,9 @@ list(
   transformation_datasets_factory(
     mo_set_de,
     c("rnaseq" = "vst-deseq2",
-      "metabolome" = "best-normalize-manual"),
-    methods = c("metabolome" = "log_x"),
-    a = 0.01,
-    b = 2,
-    standardize = FALSE,
+      "metabolome" = "logx"),
+    log_bases = 2,
+    pre_log_functions = zero_to_half_min,
     transformed_data_name = "mo_set_transformed"
   ),
   

diff --git a/so2pls.qmd b/so2pls.qmd
@@ -293,7 +293,7 @@ These statistics can be visualised with the `so2pls_plot_summary()` function:
 so2pls_plot_summary(so2pls_final_run)
 ```
 
-The plot on the left shows the percentage of variation explained by the joint, orthogonal/specific and residual parts for each dataset. In our example, the joint parts explain around 27% and 11% of the variation in the transcriptomics and metabolomics datasets, respectively, while the specific components explain around 35% and 18% of the variation in each dataset. Taken together, the joint and specific parts explain 61.5% of the variation present in the transcriptomics dataset, but only 29% in the metabolomics dataset. The plot on the right depicts the percentage of joint variation of a dataset explained by the joint variation of the other dataset, i.e. it gives an indication of how correlated the joint parts of the two datasets are. In our case, the joint part of each dataset explains between 78% of the joint variation in the other dataset, which is indicative of a good covariance between the datasets.
+The plot on the left shows the percentage of variation explained by the joint, orthogonal/specific and residual parts for each dataset. In our example, the joint parts explain around 33% and 15% of the variation in the transcriptomics and metabolomics datasets, respectively, while the specific components explain around 12% and 33% of the variation in each dataset. Taken together, the joint and specific parts explain 45% of the variation present in the transcriptomics dataset and 48% in the metabolomics dataset. The plot on the right depicts the percentage of joint variation of a dataset explained by the joint variation of the other dataset, i.e. it gives an indication of how correlated the joint parts of the two datasets are. In our case, the joint part of each dataset explains between 80% of the joint variation in the other dataset, which is indicative of a good covariance between the datasets.
 
 To get more information about the percentage of variation explained by each individual joint or specific component, we can have a look at the screeplot generated by the `so2pls_screeplot()` function, which shows the percentage of variance explained by each individual joint or specific component for each dataset: