diff --git a/_targets.R b/_targets.R index b51ec4e..b8557dc 100644 --- a/_targets.R +++ b/_targets.R @@ -523,7 +523,7 @@ list( ), training_options = list(seed = 43) ), - only_common_samples = TRUE + only_common_samples = FALSE ) ), diff --git a/_targets/meta/meta b/_targets/meta/meta index 27e5dc4..ac6980a 100644 --- a/_targets/meta/meta +++ b/_targets/meta/meta @@ -52,8 +52,8 @@ mo_set_complete|stem|d79178dd6687be97|d76a85ef218e0cef|952f06b34b51e191|17398688 mo_set_de|stem|2b7e2ace37c0bed5|50f81c565a9dc6aa|1d4929f8d72b5561|1698094396||t19739.0034448046s|fa7930e7b638c680|37442539|rds|local|vector|||0.167|rnaseq dataset 1594 feature IDs missing from df dataframe.| mo_set_transformed|stem|28334a82bdfebaa7|1a49152e06c5222d|6f732476e2857317|385723480||t19739.9580970037s|5c19986154657d81|57773059|rds|local|vector|||0.252|| mo_set_with_names|stem|e64ebcd89ddb70bd|b88cddcfa5d2a267|6eebd2f68cb73320|-1640374052||t19738.0568285842s|593ace30823e21f2|36753195|rds|local|vector|||0.114|| -mofa_input|stem|55da511de7664cd8|a8548080dfc5818e|b1ff01b4e620143d|1284502458||t19746.154401993s|e22ca446e6bece21|837460|rds|local|vector|||0.324|Dataset snps is to be modelled with a poisson likelihood, but is not integer. Transforming to integer.| -mofa_trained|stem|b38f267e60c1989a|ec971dccb1086e60|91cc7bf157eacf61|1854191013||t19746.1563374787s|29bcdeb5af24197c|966299|rds|local|vector|||166.175|No output filename provided. Using tmpRtmpHWxlYdmofa_20240124164221.hdf5 to store the trained model.. Factors 1 are strongly correlated with the total number of expressed features for at least one of your omics. Such factors appear when there are differences in the total levels between your samples, sometimes because of poor normalisation in the preprocessing steps.| +mofa_input|stem|4d8e1c14434288d1|3d5db0316055019b|b1ff01b4e620143d|1284502458||t19746.8899165855s|4110beb5e0ca03cd|877459|rds|local|vector|||0.201|Dataset snps is to be modelled with a poisson likelihood, but is not integer. Transforming to integer.| +mofa_trained|stem|f523901c69cc8163|ec971dccb1086e60|dec5748f700e63fc|1854191013||t19746.8912693121s|b341cd3de1b49e2f|1004645|rds|local|vector|||116.597|No output filename provided. Using tmpRtmpaM7XWKmofa_20240125102128.hdf5 to store the trained model.. Factors 1 are strongly correlated with the total number of expressed features for at least one of your omics. Such factors appear when there are differences in the total levels between your samples, sometimes because of poor normalisation in the preprocessing steps.| omicspls_input|stem|488cc165105ef587|5103f8e1eb31dfb4|3fe933389d7e1821|1468063698||t19746.0144579806s|8dac9e868840a7ff|957717|rds|local|vector|||0.009|| pca_mats_list|pattern|c831051151e4add2|72b302a65c0920ed||1078442447||||12164806|rds|local|list||pca_mats_list_302d7473*pca_mats_list_84d36937*pca_mats_list_64d37e6c|0.09|| pca_mats_list_302d7473|branch|e839c68784be56c5|72b302a65c0920ed|0f98021e133aac1d|-46563577||t19739.9581108254s|1b793e6da03f6fbf|1616638|rds|local|list|pca_mats_list||0.072|| diff --git a/docs/data_import.html b/docs/data_import.html index 33fe26f..6510d97 100644 --- a/docs/data_import.html +++ b/docs/data_import.html @@ -228,6 +228,11 @@

3&nbs + + @@ -241,7 +246,7 @@

3&nbs @@ -256,7 +261,7 @@

3&nbs @@ -379,7 +384,7 @@

tar_read(dataset_file_geno)
-#> [1] "/powerplant/workspace/hrpoab/RENV_CACHE/v5/R-4.2/x86_64-pc-linux-gnu/moiraine/0.0.0.9000/ea944811306c200792f0808f5fa53cec/moiraine/extdata/genomics_dataset.csv"
+#> [1] "/powerplant/workspace/hrpoab/RENV_CACHE/v5/R-4.2/x86_64-pc-linux-gnu/moiraine/0.0.0.9000/c5fcba1231fe0ad084ba7ffc17cfc3e0/moiraine/extdata/genomics_dataset.csv"

The next step is to import this dataset in R. We use the import_dataset_csv() function for that, rather than the readr::read_csv() or similar functions, as it ensures that the data is imported with the correct format for further use with the moiraine package. When importing a dataset, we need to specify the path to the file, as well as the name of the column in the csv file that contains the row names (through the col_id argument). In addition, we need to specify whether the features are represented in rows in the csv file, or in columns. This is done through the argument features_as_rows. For example, we can load the genomics dataset through:

diff --git a/docs/diablo.html b/docs/diablo.html index 2831522..de76a39 100644 --- a/docs/diablo.html +++ b/docs/diablo.html @@ -4,7 +4,7 @@ -The moiraine R package user manual - 10  Integration with DIABLO +The moiraine R package user manual - 11  Integration with DIABLO + + + + + + + + + + + + + + + + + + + + + + +
+
+ + + +
+

10  Integration with MOFA

+
+ + + +
+ + + + +
+ + +
+
library(targets)
+library(moiraine)
+
+## For integration with MOFA
+library(MOFA2)
+
+

Now that the omics datasets have been appropriately pre-processed and pre-filtered, we are ready to perform the actual data integration step. In this chapter, we will show how to perform multi-omics data integration with the MOFA method from the MOFA2 package.

+

As a reminder, here is what the _targets.R script should look like so far:

+
_targets.R script +
+
+ +
+
+

+10.1 What is MOFA?

+

MOFA (for Multi-Omics Factor Analysis) is a method for the unsupervised integration of two or more omics datasets. It aims at uncovering the main axes of variations shared by all or a subset of the datasets, through the construction of a small number of latent factors. It can be assimilated to a generalisation of the PCA (Principal Components Analysis) to multiple datasets. The latent factors are constructed as sparse linear combinations of the omics features, highlighting the features that contribute to each axis of variation.

+

MOFA uses a Bayesian framework to decompose each dataset (referred to as “view” in the package documentation) into the product of a latent factor matrix (representing to the main axes of variation) and a matrix of feature weights (indicating the extent to which the features contribute to the different latent factors). MOFA applies two levels of sparsity. The first level of sparsity is used to detect whether a given latent factor is active (i.e. explains variation) in each dataset. This allows to assess which sources of variations are shared across the datasets. The second level of sparsity is applied to the features, allowing to assess which features contribute to each source of variation.

+

MOFA requires as input matrices of omics measurements that must have at least some samples in common, but accepts some samples being absent from some of the datasets. It is recommended to have at least 15 samples in common across the datasets in order to obtain sensible results. The number of features in each dataset must not be too small (i.e. at least 15). It is critical that each dataset is properly normalised, with batch effects removed. In addition, it is strongly recommended that the datasets are pre-filtered to retain highly variable features; a similar number of features should be retained across the datasets. Missing values are not a problem with MOFA. MOFA can also accept samples and features metadata as data-frames. Lastly, there is a number of parameters that can be customised. The most important ones will be mentioned in this chapter; for a complete list see the MOFA tutorial.

+
+
+
+ +
+
+Note +
+
+
+

A characteristic of the MOFA2 package is that the model training part of the analysis is done in Python with the mofapy2 module. MOFA2 interacts with Python from R with the reticulate package. This is important to know as it might affect the installation and run depending on the environment in which it is used.

+
+
+

+10.1.1 A note on groups in MOFA

+

MOFA offers the option to assign samples to groups (referred to as the multi-group framework within MOFA). It is important to understand what this does before using it. In particular, if the aim of the analysis is to find features that separate samples based on a grouping, you should not use this multi-group framework. By setting sample groups in MOFA, it will ignore sources of variations that discriminate the groups. Instead, this multi-group framework will seek sources of variations that are shared between the different groups, and sources of variations that are exclusive to a specific group. For example, in this analysis, the multi-group framework was used to split samples (cells) based on their developmental stage. The goal of the analysis was to find coordinated variation between the different datasets, and detect at which developmental stage (i.e. in which group) it occurred, rather than trying to differentiate the different developmental stages.

+

+10.2 Creating the MOFA input

+

The first step is to transform the MultiDataSet object into a suitable format for the MOFA2 package. This is done through the get_input_mofa() function. In addition to the MultiDataSet object, this function accepts as arguments:

+
    +
  • datasets: a vector of dataset names that dictates which datasets from the multi-omics set should be included in the analysis (by default, all datasets are included);

  • +
  • groups: the name of the column in the samples metadata to be used as group indicator if using the multi-group framework (if not set, the multi-group framework will not be used);

  • +
  • options_list: a list of options to customise the MOFA model – more information below;

  • +
  • only_common_samples: whether only the samples present in all datasets should be retained. The default value is FALSE; this argument should be set to TRUE when some datasets contain many more samples than others; when only a few samples are missing from some of the datasets, it is ok to leave it to FALSE.

  • +
+

We expand on the options that can be set with the options_list argument below.

+

+10.2.1 MOFA parameters

+

MOFA has three type of parameters (referred to as options) that the user can control.

+

+10.2.1.1 Data options

+

These are options for data handling when creating the MOFA model. The two important options are:

+
    +
  • scale_views: logical, whether the datasets (views) should be scaled to unit variance. The default is FALSE, but it is recommended to set it to TRUE if the scale difference between the datasets is high.

  • +
  • scale_groups: logical, whether to scale the groups to unit variance. Irrelevant unless using the multi-group framework. The default is FALSE, but it is recommended to set it to TRUE if the scale difference between the groups is high.

  • +
+

See the documentation for get_default_data_options() for a full list.

+

+10.2.1.2 Model options

+

These are the options defining different aspects of the MOFA model. The most important ones are:

+
    +
  • num_factors: The maximum number of factors to construct (note that this is a maximum, i.e. MOFA can return a lower number of factors if there is not a lot of variation in the datasets). The default is set to 15, which is a good start. This can be adjusted after running MOFA with the default value.

  • +
  • likelihoods: Named character vector, the type of likelihood to use for each dataset. MOFA offers three options: Gaussian likelihood ('gaussian') for continuous data, Bernoulli likelihood ('bernoulli') for binary data, and Poisson likelihood ('poisson') for count data. It is highly recommended to transform the datasets in order to use a Gaussian likelihood: for example, applying a variance-stabilising transformation on RNAseq data rather than using the raw read counts with a Poisson likelihood. By default, a Gaussian likelihood is chosen for all datasets.

  • +
+

See the documentation for get_default_model_options() for a full list.

+

+10.2.1.3 Training options

+

These are the options that control how the model is trained. The most important one is:

+
    +
  • +seed: setting the random seed. This is standard practice, to make sure that the results are reproducible.
  • +
+

See the documentation for get_default_training_options() for a full list.

+

+10.2.1.4 Passing the parameters to get_input_mofa +

+

When creating a MOFA input object with get_input_mofa(), all options will be set to their default values. It is possible to customise the values for some of the options through the options_list parameter. This should be a named list, with up to three elements (one per type of options that MOFA accepts): data_options, model_options, and training_options. Each element is itself a named list, where each element of the list corresponds to a specific option. All three elements do not have to be present; for example, if we want to only specify a value for the model option num_factors, we can set options_list to:

+
+
list(
+  model_options = list(num_factors = 10)
+)
+
+

If we also want to set the likelihoods and the random seed, then options_list becomes:

+
+
list(
+  model_options = list(num_factors = 10),
+  training_options = list(seed = 43)
+)
+
+

For our example, we’ll make sure that each dataset is scaled to unit variance, and that a Poisson likelihood is used for the genomics data, for which we have variants dosage. Since there are only 9 samples that are not present in all omics datasets, we can keep them in the analysis. We’ll set the random seed to ensure that the results are reproducible:

+
+
tar_target(
+  mofa_input,
+  get_input_mofa(
+    mo_presel_supervised,
+    options_list = list(
+      data_options = list(scale_views = TRUE),
+      model_options = list(likelihoods = c(
+        "snps" = "poisson",
+        "rnaseq" = "gaussian",
+        "metabolome" = "gaussian")
+      ),
+      training_options = list(seed = 43)
+    ),
+    only_common_samples = FALSE
+  )
+)
+
+

This will produce a warning:

+
+
#> Warning: Dataset snps is to be modelled with a poisson likelihood, but is not
+#> integer. Transforming to integer.
+
+

The warning informs us that we have chosen to use a Poisson likelihood for the genomics dataset, but the latter does not have integer data. This is because the missing values imputed with NIPALS-PCA (see Section 6.2.3) are continuous rather than discrete. This is taken care of by automatically rounding the dataset; but in other settings this warning can be an indication that the Poisson likelihood is not appropriate for the dataset.

+
+
+
+ +
+
+Note +
+
+
+

When constructing the MOFA object, if there exists a column named group in a samples metadata table in the MultiDataSet object, the column will be renamed as group_metadata in the resulting MOFA input object. This is because MOFA needs a group column its own version of the samples metadata (for the multi-group framework), so there cannot be another column named group. Note that will have no effect on the MultiDataSet object, but is important to remember if you want to use some of the plotting functionalities that MOFA2 offers.

+
+
+

The output of the get_input_mofa() function is a MOFA object, more specifically an untrained model:

+
+
tar_load(mofa_input)
+mofa_input
+#> Untrained MOFA model with the following characteristics: 
+#>  Number of views: 3 
+#>  Views names: snps rnaseq metabolome 
+#>  Number of features (per view): 1000 994 55 
+#>  Number of groups: 1 
+#>  Groups names: group1 
+#>  Number of samples (per group): 144 
+#> 
+
+

We did not use the multi-group framework, so there is only one samples group.

+

+10.3 Visualising the MOFA input

+

It is possible to represent the samples that are present or missing across the different datasets with the plot_data_overview() function implemented in MOFA2:

+
+
plot_data_overview(mofa_input)
+
+

+
+
+

The plot generated shows the dimensions of the datasets that will be analysed.

+

For reporting purposes, it can also be useful to summarise the different options used to create the MOFA model. This can be done with the options_list_as_tibble() function, which turns a list of parameters into a tibble presenting the name and value of each parameter. For example, we can list the data options used:

+
+
options_list_as_tibble(mofa_input@data_options)
+#> # A tibble: 6 × 2
+#>   Parameter     Value                         
+#>   <chr>         <chr>                         
+#> 1 scale_views   TRUE                          
+#> 2 scale_groups  FALSE                         
+#> 3 center_groups TRUE                          
+#> 4 use_float32   TRUE                          
+#> 5 views         'snps', 'rnaseq', 'metabolome'
+#> 6 groups        group1
+
+

+10.4 Training the model

+

Once we have prepared our MOFA input, we can train the model with the run_mofa() function (from the MOFA2 package). This is done through Python, so there might be some issues when trying to run the function for the first time after installing MOFA2 if the configuration is not correct. If that is the case, see the MOFA2 tutorial (e.g. their troubleshooting section) for tips on how to solve this.

+

By default, the function will save the resulting trained model in a temporary file, with the .hdf5 format. It can be preferable to save the result into the output folder of your project; here as we are using targets to save the results of each step of the analysis, we do not need to do so. In addition, is it strongly recommended to set the save_data parameter to TRUE, as otherwise without the data, some of the visualisation options might not be available. Lastly, the use_basilisk parameter might have to be switched to FALSE if there are any issues with calling the Python module:

+
+
tar_target(
+  mofa_trained,
+  run_mofa(
+    mofa_input,
+    save_data = TRUE,
+    use_basilisk = TRUE
+  )
+)
+
+

In our case, the following warnings are returned:

+
+
#> Warning: No output filename provided. Using tmpRtmpaM7XWKmofa_20240125102128.hdf5 to store the trained model.
+#> 
+#> Factors 1 are strongly correlated with the total number of expressed features for at least one of your omics. Such factors appear when there are differences in the total levels between your samples, sometimes because of poor normalisation in the preprocessing steps.
+
+

The first one has to do with saving the model into a temporary file. We will explain the others in more detail when analysing the resulting model.

+

The output of the function is a trained MOFA model, with 15 latent factors:

+
+
tar_load(mofa_trained)
+mofa_trained
+#> Trained MOFA with the following characteristics: 
+#>  Number of views: 3 
+#>  Views names: snps rnaseq metabolome 
+#>  Number of features (per view): 1000 994 55 
+#>  Number of groups: 1 
+#>  Groups names: group1 
+#>  Number of samples (per group): 144 
+#>  Number of factors: 15
+
+

+10.5 Results interpretation

+

In Chapter 12, we show the different functionalities implemented in the moiraine package that facilitate the interpretation of the results from an integration tool. In this section, we show some of the MOFA-specific plots that can be generated to help interpret the results of a MOFA run.

+

+10.5.1 Variance explained

+

When training the model, MOFA2 constructs latent factors, which are sparse combination of the features, and which represent main axes of variation shared across all or a subsets of the datasets. One of the first steps to take when analysing the trained model is to assess the amount of variance in the datasets explained by each factor; similarly to what we would do when running a PCA.

+

The function plot_variance_explained() from MOFA2 displays the percentage of variance in each dataset explained by each factor as a heatmap. The x and y arguments (and also split_by argument when using the multi-group framework) control whether the factors or the datasets should be represented as the rows or columns in the heatmap:

+
+
plot_variance_explained(
+  mofa_trained,
+  x = "view",  ## datasets on the x-axis
+  y = "factor" ## factors on the y-axis
+) 
+
+

+
+
+

Here, we see that factor 1 explains a lot (almost 50%) of variation in the transcriptomics dataset, a little (around 10%) in the metabolomics dataset and almost none in the genomics dataset. Factor 2 explains around 20% of variation in the genomics dataset, and none in the other datasets. Factors 3 and 4 seem specificto the transcriptomics datasets, and factors 5 to 15 explain very little variation; seeing this, we could re-train the MOFA model by setting the number of factors to 4 or 5.

+

In addition, the plot_variance_explained() function can create a second plot displaying the total percentage of variance explained by all the factors for each dataset, if the argument plot_total is set to TRUE :

+
+
plot_variance_explained(
+  mofa_trained,
+  x = "view",
+  y = "factor",
+  plot_total = TRUE
+)[[2]] ## show only the 2nd plot
+
+

+
+
+

The factors explain almost 85% of the variation in the transcriptomics dataset, but less than 30% in in the genomics and metabolomics datasets.

+

+10.5.2 Correlation between factors

+

Before moving to the interpretation, it is always good practice to check the correlation between the different computed factors. All factors should be mostly uncorrelated; if a large correlation is observed between some factors, it could be an indication of poor model fit, either because too many factors were computed (in this case, try reducing the number of factors to compute or increasing the number of iterations for the model training via the convergence_mode training option) or maybe because of an improper normalisation of the datasets. The function plot_factor_cor() from MOFA2 displays the correlation between all factors via the corrplot::corrplot() function:

+
+
plot_factor_cor(
+  mofa_trained,
+  type = "upper",
+  diag = FALSE,
+  tl.cex = 0.8
+)
+
+

+
+
+

There is a moderate positive correlation between factors 2 and 5, but nothing to be concerned about.

+

+10.5.3 Correlation between factors and samples covariates

+

To assess which sources of variation each factor is representing, we can check whether the factors are correlated with some known covariates that were recorded in the samples metadata. The mofa_plot_cor_covariates() function displays the correlation between each factor and the samples covariates. Note that factor covariates are transformed to numerical group indicators in order to compute the correlations. This function is similar to the MOFA2 function correlate_factors_with_covariates() except that it returns a ggplot (rather than creating a base R plot), and offers a few convenient features through the optional arguments. By default, the function uses all covariates recorded in the samples metadata; however we can focus on a subset of them by passing the column names to the covariates argument.

+
+
mofa_plot_cor_covariates(
+  mofa_trained
+)
+
+

+
+
+

From this plot, we can spot some interesting trends. For example, factor 1, which explains variation in the transcriptomics and metabolomics datasets, is strongly correlated with disease status. It makes sense, since the transcriptomics dataset has been filtered to retain genes most associated with differences between healthly and infected animals, so we expect it to be the strongest source of variation in the dataset. Factor 2, which explains variation only in the genomics dataset, is strongly correlated with the genomics composition of the animals, which makes a lot of sense as more related individuals share similar genomics profiles. It also makes sense that this variation is less present in the other two datasets. Factor 5, which explains variation only in the transcriptomics dataset, seems to be modestly correlated with the RNASeq batches.

+

In general, it is a good idea to interpret this correlation plot together with the plot representing the percentage of variation explained in the datasets by each factor. However, it is necessary to investigate more before making claims about what the factors are representing.

+

+10.5.4 Visualising the top features

+

The MOFA2 package offers a number of very useful visualisations to further interpret the results. For example, for a given factor and dataset of interest, we can visualise how the top contributing features vary with the factor values (i.e. the sample scores). For example, let’s have a look at the top contributing features from the transcriptomics dataset for factor 1:

+
+
plot_data_scatter(
+  mofa_trained,
+  factor = 1,
+  view = "rnaseq",
+  features = 6,
+  color_by = "status",
+  shape_by = "feedlot"
+)
+
+

+
+
+

For all six genes, their expression is higher in infected animals compared with healthy ones.

+

We can also visualise the measurements of the top contributing features across the samples as a heatmap, for a given dataset and factor. That the annotation_samples argument allows us to add annotations on top of the heatmap to represent certain characteristics of the samples. For example here we’ll add information about the phenotype group and chromosome:

+
+
MOFA2::plot_data_heatmap(
+  mofa_trained,
+  factor = 1,
+  view = "rnaseq",
+  features = 20,
+  annotation_samples = "status",
+  fontsize_col = 5
+)
+
+

+
+
+

Note that moiraine implements similar functions to represent the top contributing features, as we will see in Chapter 12.

+

+10.6 Recap – targets list

+

For convenience, here is the list of targets that we created in this section:

+
+Targets list for MOFA analysis +
+
list(
+  ## Creating MOFA input
+  tar_target(
+    mofa_input,
+    get_input_mofa(
+      mo_presel_supervised,
+      options_list = list(
+        data_options = list(scale_views = TRUE),
+        model_options = list(likelihoods = c(
+          "snps" = "poisson",
+          "rnaseq" = "gaussian",
+          "metabolome" = "gaussian")
+        ),
+        training_options = list(seed = 43)
+      ),
+      only_common_samples = FALSE
+    )
+  ),
+  
+  ## Overview plot of the samples in each dataset
+  tar_target(
+    mofa_input_plot,
+    plot_data_overview(mofa_input)
+  ),
+  
+  ## Training MOFA model
+  tar_target(
+    mofa_trained,
+    run_mofa(
+      mofa_input,
+      save_data = TRUE,
+      use_basilisk = TRUE
+    )
+  ),
+  
+  ## Formatting MOFA output
+  tar_target(
+    mofa_output,
+    get_output(mofa_trained)
+  ),
+  
+  ## Plots of variance explained
+  tar_target(
+    mofa_var_explained_plot,
+    plot_variance_explained(
+      mofa_trained,
+      x = "view",  ## datasets on the x-axis
+      y = "factor" ## factors on the y-axis
+    )
+  ),
+  tar_target(
+    mofa_total_var_explained_plot,
+    plot_variance_explained(
+      mofa_trained,
+      x = "view",
+      y = "factor",
+      plot_total = TRUE
+    )[[2]]
+  ),
+  
+  ## Plot of factors correlation with covariates
+  tar_target(
+    mofa_factors_covariates_cor_plot,
+    mofa_plot_cor_covariates(mofa_trained)
+  )
+)
+
+
+ +
+
+ + + + \ No newline at end of file diff --git a/docs/mofa_files/figure-html/mofa-plot-cor-covariates-1.png b/docs/mofa_files/figure-html/mofa-plot-cor-covariates-1.png new file mode 100644 index 0000000..722dfa2 Binary files /dev/null and b/docs/mofa_files/figure-html/mofa-plot-cor-covariates-1.png differ diff --git a/docs/mofa_files/figure-html/plot-data-heatmap-1.png b/docs/mofa_files/figure-html/plot-data-heatmap-1.png new file mode 100644 index 0000000..93e7840 Binary files /dev/null and b/docs/mofa_files/figure-html/plot-data-heatmap-1.png differ diff --git a/docs/mofa_files/figure-html/plot-data-overview-1.png b/docs/mofa_files/figure-html/plot-data-overview-1.png new file mode 100644 index 0000000..46f7abd Binary files /dev/null and b/docs/mofa_files/figure-html/plot-data-overview-1.png differ diff --git a/docs/mofa_files/figure-html/plot-data-scatter-1.png b/docs/mofa_files/figure-html/plot-data-scatter-1.png new file mode 100644 index 0000000..467a215 Binary files /dev/null and b/docs/mofa_files/figure-html/plot-data-scatter-1.png differ diff --git a/docs/mofa_files/figure-html/plot-factors-correlation-1.png b/docs/mofa_files/figure-html/plot-factors-correlation-1.png new file mode 100644 index 0000000..ecce413 Binary files /dev/null and b/docs/mofa_files/figure-html/plot-factors-correlation-1.png differ diff --git a/docs/mofa_files/figure-html/plot-total-variance-explained-1.png b/docs/mofa_files/figure-html/plot-total-variance-explained-1.png new file mode 100644 index 0000000..9fb3168 Binary files /dev/null and b/docs/mofa_files/figure-html/plot-total-variance-explained-1.png differ diff --git a/docs/mofa_files/figure-html/plot-variance-explained-1.png b/docs/mofa_files/figure-html/plot-variance-explained-1.png new file mode 100644 index 0000000..c2f272c Binary files /dev/null and b/docs/mofa_files/figure-html/plot-variance-explained-1.png differ diff --git a/docs/prefiltering.html b/docs/prefiltering.html index e5bd5a4..3f85730 100644 --- a/docs/prefiltering.html +++ b/docs/prefiltering.html @@ -228,6 +228,11 @@

7&nbs + + @@ -241,7 +246,7 @@

7&nbs @@ -256,7 +261,7 @@

7&nbs diff --git a/docs/preprocessing.html b/docs/preprocessing.html index 35e4c1d..8241b0d 100644 --- a/docs/preprocessing.html +++ b/docs/preprocessing.html @@ -209,6 +209,11 @@

6&nbs + + @@ -222,7 +227,7 @@

6&nbs @@ -237,7 +242,7 @@

6&nbs @@ -265,7 +270,7 @@

6&nbs
  • 6.3 Recap – targets list
  • @@ -685,7 +690,7 @@

    -

    +

    6.2.3 Missing values imputation

    We can check that the complete multi-omics set constructed has no more missing values:

    diff --git a/docs/references.html b/docs/references.html index f5bc878..1a789d2 100644 --- a/docs/references.html +++ b/docs/references.html @@ -178,6 +178,11 @@

    References

    + + @@ -191,7 +196,7 @@

    References

    @@ -206,7 +211,7 @@

    References

    @@ -422,7 +427,7 @@

    References

    diff --git a/docs/so2pls_files/figure-html/plot-cv-adj-1.png b/docs/so2pls_files/figure-html/plot-cv-adj-1.png index 0e91fb9..3ce1f60 100644 Binary files a/docs/so2pls_files/figure-html/plot-cv-adj-1.png and b/docs/so2pls_files/figure-html/plot-cv-adj-1.png differ diff --git a/docs/spls.html b/docs/spls.html index 6ea5a30..68fe0bc 100644 --- a/docs/spls.html +++ b/docs/spls.html @@ -228,6 +228,11 @@

    8&nbs + + @@ -241,7 +246,7 @@

    8&nbs @@ -256,7 +261,7 @@

    8&nbs @@ -433,76 +438,121 @@

    get_input_spls() function. It consists of a data-frame with observations as rows, and an integer ID for the corresponding biological individual in the first column:

    +
    attr(spls_input_multilevel1, "multilevel")
    +
    +
    +
    #>           plant_id
    +#> sample_1         1
    +#> sample_2         1
    +#> sample_3         2
    +#> sample_4         2
    +#> sample_5         3
    +#> sample_6         3
    +#> sample_7         4
    +#> sample_8         4
    +#> sample_9         5
    +#> sample_10        5
    +
    +

    In order to use the multilevel option with two factors, the multilevel argument should instead be populated with a vector of three column names from the samples metadata table. The first value, similarly to a one-factor decomposition, should be the name of the column containing the ID of the biological individuals. The second and third values should be the name of the columns containing the levels of the two factors of interest. Continuing on the previous example, supposing that in a similar study, we take measurements on the plants at three different time-points after the application of each treatment. The samples metadata table might look like this:

    +
    - + + + + + + + - + + + + - + + + + - + + + + - + + + + + + + - + + + + - + + + + - + + + + - + + + + + + + + + + + + + + + + + +
    plant_ididplant_idtreatmenttime
    sample_1sample_1plant_1A 1
    sample_21sample_2plant_1A2
    sample_32sample_3plant_1A3
    sample_42sample_4plant_1B1
    sample_53sample_5plant_1B2
    sample_6sample_6plant_1B 3
    sample_74sample_7plant_2A1
    sample_84sample_8plant_2A2
    sample_95sample_9plant_2A3
    sample_105sample_10plant_2B1
    sample_11sample_11plant_2B2
    sample_12sample_12plant_2B3
    -

    In order to use the multilevel option with two factors, the multilevel argument should instead be populated with a vector of three column names from the samples metadata table. The first value, similarly to a one-factor decomposition, should be the name of the column containing the ID of the biological individuals. The second and third values should be the name of the columns containing the levels of the two factors of interest. Continuing on the previous example, supposing that in a similar study, we take measurements on the plants at three different time-points after the application of each treatment. The samples metadata table might look like this:

    -
    -
    get_samples_metadata_combined(mo_set_multilevel2)
    -
    -
    -
    #>                  id plant_id treatment time
    -#> sample_1   sample_1  plant_1         A    1
    -#> sample_2   sample_2  plant_1         A    2
    -#> sample_3   sample_3  plant_1         A    3
    -#> sample_4   sample_4  plant_1         B    1
    -#> sample_5   sample_5  plant_1         B    2
    -#> sample_6   sample_6  plant_1         B    3
    -#> sample_7   sample_7  plant_2         A    1
    -#> sample_8   sample_8  plant_2         A    2
    -#> sample_9   sample_9  plant_2         A    3
    -#> sample_10 sample_10  plant_2         B    1
    -#> sample_11 sample_11  plant_2         B    2
    -#> sample_12 sample_12  plant_2         B    3
    -

    with similar columns to the previous example, and an additional time column indicating the time-point at which the measurement was taken. In that case, we should set multilevel = c("plant_id", "treatment", "time") to use the multilevel option, as follows:

    spls_input_multilevel2 <- get_input_spls(
    @@ -687,7 +737,7 @@ 

    8.6 Results interpretation

    -

    In Chapter 11, we show the different functionalities implemented in the moiraine package that facilitate the interpretation of the results from an integration tool. In this section, we show some of the sPLS-specific plots that can be generated to help interpret the results of an sPLS run.

    +

    In Chapter 12, we show the different functionalities implemented in the moiraine package that facilitate the interpretation of the results from an integration tool. In this section, we show some of the sPLS-specific plots that can be generated to help interpret the results of an sPLS run.

    8.6.1 Samples plots

    We can represent the samples in the subspace spanned by the latent components for each dataset, using the mixOmics::plotIndiv() function.

    diff --git a/mofa.qmd b/mofa.qmd index 734a02b..b9b272e 100644 --- a/mofa.qmd +++ b/mofa.qmd @@ -10,6 +10,10 @@ library(targets) library(moiraine) library(MOFA2) + +## For printing error message +library(dplyr) +library(stringr) ``` ```{r setup-visible} @@ -24,7 +28,7 @@ library(MOFA2) ``` -In this vignette, we assume that the omics datasets and associated metadata have been imported into a `MultiDataSet` object (see vignette on [Importing the data](data-import.html)), appropriately transformed (see vignette on [Datasets preprocessing](datasets-preprocessing.html)) and pre-filtered ([Datasets prefiltering](prefiltering.html)). We will show here how to perform multi-omics data integration with the MOFA method from the [`MOFA2`](https://biofam.github.io/MOFA2/index.html) package. +Now that the omics datasets have been appropriately pre-processed and pre-filtered, we are ready to perform the actual data integration step. In this chapter, we will show how to perform multi-omics data integration with the MOFA method from the [`MOFA2`](https://biofam.github.io/MOFA2/index.html) package. As a reminder, here is what the `_targets.R` script should look like so far: @@ -38,13 +42,15 @@ As a reminder, here is what the `_targets.R` script should look like so far: ## What is MOFA? -MOFA (for Multi-Omics Factor Analysis) is a method for the unsupervised integration of two or more omics datasets. It aims at uncovering the main axes of variations shared by all or a subset of the datasets, through the construction of a small number of latent factors. It can be assimilated to a generalisation of the PCA (Principal Components Analysis) to multiple datasets. The latent factors are constructed as sparse linear combinations of the features (variants, genes, compounds), highlighting the features that contribute to each axis of variation. +MOFA (for Multi-Omics Factor Analysis) is a method for the unsupervised integration of two or more omics datasets. It aims at uncovering the main axes of variations shared by all or a subset of the datasets, through the construction of a small number of latent factors. It can be assimilated to a generalisation of the PCA (Principal Components Analysis) to multiple datasets. The latent factors are constructed as sparse linear combinations of the omics features, highlighting the features that contribute to each axis of variation. -MOFA uses a Bayesian framework to decompose each dataset (referred to as view in the package documentation) into the product of a latent factor matrix (which corresponds to the main axes of variations) and a matrix of feature weights (which indicate the extent to which the features contribute to the different latent factors). MOFA applies two levels of sparsity. The first level of sparsity is used to detect whether a given latent factor is active (i.e. explains variation) in each dataset. This allows to assess which sources of variations are shared across the datasets. The second level of sparsity is applied to the features, allowing to assess which features contribute to each source of variation. +MOFA uses a Bayesian framework to decompose each dataset (referred to as "view" in the package documentation) into the product of a latent factor matrix (representing to the main axes of variation) and a matrix of feature weights (indicating the extent to which the features contribute to the different latent factors). MOFA applies two levels of sparsity. The first level of sparsity is used to detect whether a given latent factor is active (i.e. explains variation) in each dataset. This allows to assess which sources of variations are shared across the datasets. The second level of sparsity is applied to the features, allowing to assess which features contribute to each source of variation. -MOFA requires as input matrices of omics measurements that must have at least some samples in common, and accept some samples being absent from some of the datasets. It is recommended to have at least 15 samples in common across the datasets in order to obtain sensible results. The number of features in each dataset must not be too small (i.e. at least 15). It is critical that each dataset is properly normalised, with batch effects removed. In addition, it is strongly recommended that the datasets are pre-filtered to retain highly variable features; a similar number of features should be retained across the datasets. Missing values are not a problem with MOFA. MOFA can also accept samples and features metadata as data-frames. Lastly, there is a number of parameters that can be customised. The most important ones will be mentioned in this vignette; for a complete list see the [MOFA tutorial](https://raw.githack.com/bioFAM/MOFA2_tutorials/master/R_tutorials/getting_started_R.html). +MOFA requires as input matrices of omics measurements that must have at least some samples in common, but accepts some samples being absent from some of the datasets. It is recommended to have at least 15 samples in common across the datasets in order to obtain sensible results. The number of features in each dataset must not be too small (i.e. at least 15). It is critical that each dataset is properly normalised, with batch effects removed. In addition, it is strongly recommended that the datasets are pre-filtered to retain highly variable features; a similar number of features should be retained across the datasets. Missing values are not a problem with MOFA. MOFA can also accept samples and features metadata as data-frames. Lastly, there is a number of parameters that can be customised. The most important ones will be mentioned in this chapter; for a complete list see the [MOFA tutorial](https://raw.githack.com/bioFAM/MOFA2_tutorials/master/R_tutorials/getting_started_R.html). +::: {.callout-note} A characteristic of the MOFA2 package is that the model training part of the analysis is done in Python with the `mofapy2` module. MOFA2 interacts with Python from R with the `reticulate` package. This is important to know as it might affect the installation and run depending on the environment in which it is used. +::: ### A note on groups in MOFA @@ -52,15 +58,15 @@ MOFA offers the option to assign samples to groups (referred to as the multi-gro ## Creating the MOFA input -The first step is to transform the `MultiDataSet` object into a suitable format for the `MOFA2` package. This can be done through the `get_input_mofa()` function. In addition to the `MultiDataSet` object, this function accepts as arguments: +The first step is to transform the `MultiDataSet` object into a suitable format for the `MOFA2` package. This is done through the `get_input_mofa()` function. In addition to the `MultiDataSet` object, this function accepts as arguments: -- `datasets`: a vector of dataset names that dictates which datasets from the multi-omics set will be included in the analysis (by default, all datasets are included); +- `datasets`: a vector of dataset names that dictates which datasets from the multi-omics set should be included in the analysis (by default, all datasets are included); - `groups`: the name of the column in the samples metadata to be used as group indicator if using the multi-group framework (if not set, the multi-group framework will not be used); - `options_list`: a list of options to customise the MOFA model -- more information below; -- `only_common_samples`: whether only the samples present in all datasets should be retained. The default value is `FALSE`; this argument should be set to `TRUE` when some datasets contain much more samples than others; when only a few samples are missing from some of the datasets, it is ok to leave it to `FALSE`. +- `only_common_samples`: whether only the samples present in all datasets should be retained. The default value is `FALSE`; this argument should be set to `TRUE` when some datasets contain many more samples than others; when only a few samples are missing from some of the datasets, it is ok to leave it to `FALSE`. We expand on the options that can be set with the `options_list` argument below. @@ -90,24 +96,36 @@ See the documentation for `get_default_model_options()` for a full list. #### Training options -These are the options that control how the model is trained. The most important ones are: +These are the options that control how the model is trained. The most important one is: -- `seed`: setting the random seed. This is standard practice, to make sure that the results are be reproducible. +- `seed`: setting the random seed. This is standard practice, to make sure that the results are reproducible. See the documentation for `get_default_training_options()` for a full list. #### Passing the parameters to `get_input_mofa` -When creating a MOFA input object with `get_input_mofa()`, all options will be set to their default values. The user can customise the values for some of the options through the `options_list` parameter. This should be a named list, with up to three elements (one per type of options that MOFA accepts): `data_options`, `model_options`, and `training_options`. Each element is itself a named list, where each element of the list corresponds to a specific option. All three elements do not have to be present; for example, if we want to only specify a value of the model option `num_factors`, we can set `options_list` to `list(model_options = list(num_factors = 10))`. If we also want to set the likelihoods and the random seed, then `options_list` becomes: +When creating a MOFA input object with `get_input_mofa()`, all options will be set to their default values. It is possible to customise the values for some of the options through the `options_list` parameter. This should be a named list, with up to three elements (one per type of options that MOFA accepts): `data_options`, `model_options`, and `training_options`. Each element is itself a named list, where each element of the list corresponds to a specific option. All three elements do not have to be present; for example, if we want to only specify a value for the model option `num_factors`, we can set `options_list` to: + +```{r options-list-example1} +#| eval: false + +list( + model_options = list(num_factors = 10) +) +``` + +If we also want to set the likelihoods and the random seed, then `options_list` becomes: + +```{r options-list-example2} +#| eval: false -```{r options_list_example, eval = FALSE} list( model_options = list(num_factors = 10), training_options = list(seed = 43) ) ``` -For our example, we'll make sure that each dataset is scaled to unit variance, and that a Poisson likelihood is used for the genomics data, for which we have variants dosage. We'll also restrict the dataset to only those samples that are present in all three datasets, since the genomics dataset has many more samples than the other two datasets. We'll set the random seed to ensure that the results are reproducible: +For our example, we'll make sure that each dataset is scaled to unit variance, and that a Poisson likelihood is used for the genomics data, for which we have variants dosage. Since there are only 9 samples that are not present in all omics datasets, we can keep them in the analysis. We'll set the random seed to ensure that the results are reproducible: ```{targets get-input-mofa} tar_target( @@ -123,14 +141,17 @@ tar_target( ), training_options = list(seed = 43) ), - only_common_samples = TRUE + only_common_samples = FALSE ) ) ``` -This will produce 2 warnings: +This will produce a warning: + +```{r get-input-mofa-warnings} +#| echo: false +#| message: false -```{r get-input-mofa-warnings, echo=FALSE, message=FALSE} res <- get_input_mofa(tar_read(mo_presel_supervised), options_list = list( data_options = list(scale_views = TRUE), @@ -143,9 +164,11 @@ res <- get_input_mofa(tar_read(mo_presel_supervised), ) ``` -The first warning message informs us that in the MOFA input object, the `group` column in the samples metadata has been renamed `group_metadata`. This is because MOFA needs a `group` column its own version of the samples metadata (for the multi-group framework), so there cannot be another column named `group`. Note that will have no effect on the `MultiDataSet` object. +The warning informs us that we have chosen to use a Poisson likelihood for the genomics dataset, but the latter does not have integer data. This is because the missing values imputed with NIPALS-PCA (see @sec-preprocessing-missing-values) are continuous rather than discrete. This is taken care of by automatically rounding the dataset; but in other settings this warning can be an indication that the Poisson likelihood is not appropriate for the dataset. -The second warning informs us that we have chosen to use a Poisson likelihood for the genomics dataset, but the latter does not have integer data. This is because the genotypes are posterior mean genotypes, so they are real values rather than integer (since they include uncertainty about dosage calling). This is taken care of by automatically rounding the dataset; but in other settings this warning can be an indication that the Poisson likelihood is not appropriate for the dataset. +::: {.callout-note} +When constructing the MOFA object, if there exists a column named `group` in a samples metadata table in the `MultiDataSet` object, the column will be renamed as `group_metadata` in the resulting MOFA input object. This is because MOFA needs a `group` column its own version of the samples metadata (for the multi-group framework), so there cannot be another column named `group`. Note that will have no effect on the `MultiDataSet` object, but is important to remember if you want to use some of the plotting functionalities that `MOFA2` offers. +::: The output of the `get_input_mofa()` function is a MOFA object, more specifically an untrained model: @@ -154,36 +177,30 @@ tar_load(mofa_input) mofa_input ``` -We did not use the multi-group framework, so there is only one sample group. +We did not use the multi-group framework, so there is only one samples group. ## Visualising the MOFA input -Several functions are available both through `MOFA2` and `moiraine` to visualise and summarise the MOFA input object. - -First, it is possible to represent the samples that are present or missing across the different datasets with the `MOFA2::plot_data_overview()` function: +It is possible to represent the samples that are present or missing across the different datasets with the `plot_data_overview()` function implemented in `MOFA2`: ```{r plot-data-overview} plot_data_overview(mofa_input) ``` -Since we retained only the samples that are present in all datasets, there are no missing samples. +The plot generated shows the dimensions of the datasets that will be analysed. For reporting purposes, it can also be useful to summarise the different options used to create the MOFA model. This can be done with the `options_list_as_tibble()` function, which turns a list of parameters into a tibble presenting the name and value of each parameter. For example, we can list the data options used: -```{r mofa-options-list-as-tibble, eval = FALSE} +```{r mofa-options-list-as-tibble} options_list_as_tibble(mofa_input@data_options) ``` -```{r mofa-options-list-as-tibble-kable, echo = FALSE} -options_list_as_tibble(mofa_input@data_options) |> - kbl() -``` ## Training the model -Once we have prepared our MOFA input, we can train the model with the `MOFA2::run_mofa()` function. This is done through Python, so there might be some issues when trying to run the function for the first time after installing `MOFA2` if the configuration is not correct. If that is the case, see the MOFA2 tutorial (e.g. their [troubleshooting section](https://biofam.github.io/MOFA2/troubleshooting.html)) for tips on how to solve this. +Once we have prepared our MOFA input, we can train the model with the `run_mofa()` function (from the `MOFA2` package). This is done through Python, so there might be some issues when trying to run the function for the first time after installing `MOFA2` if the configuration is not correct. If that is the case, see the MOFA2 tutorial (e.g. their [troubleshooting section](https://biofam.github.io/MOFA2/troubleshooting.html)) for tips on how to solve this. -By default, the function will save the resulting trained model in a temporary file, with the `.hdf5` format. It can be preferable to save the result into the output folder of your project; here as we are using targets to save the results of each step of the analysis, we do not need to do so. In addition, is it strongly recommended to set the `save_data` parameter to `TRUE`, as otherwise without the data, some of the visualisation options might not be available. Lastly, the `use_basilisk` parameter can be a source of error depending on your computer configuration: +By default, the function will save the resulting trained model in a temporary file, with the `.hdf5` format. It can be preferable to save the result into the output folder of your project; here as we are using targets to save the results of each step of the analysis, we do not need to do so. In addition, is it strongly recommended to set the `save_data` parameter to `TRUE`, as otherwise without the data, some of the visualisation options might not be available. Lastly, the `use_basilisk` parameter might have to be switched to `FALSE` if there are any issues with calling the Python module: ```{targets run-mofa} tar_target( @@ -198,10 +215,13 @@ tar_target( In our case, the following warnings are returned: -```{r run-mofa-warnings, echo = FALSE} +```{r run-mofa-warnings} +#| echo: false + tar_meta(fields = "warnings") |> - dplyr::filter(name == "mofa_trained") |> - dplyr::pull(warnings) |> + filter(name == "mofa_trained") |> + pull(warnings) |> + str_replace("\\.\\. ", ".\n\n") |> warning() ``` @@ -214,15 +234,15 @@ tar_load(mofa_trained) mofa_trained ``` -The next sections of this vignette will show you how to display and interpret the trained model. +## Results interpretation -## Variance explained and diagnostic plots +In @sec-interpretation, we show the different functionalities implemented in the `moiraine` package that facilitate the interpretation of the results from an integration tool. In this section, we show some of the MOFA-specific plots that can be generated to help interpret the results of a MOFA run. -When training the model, MOFA2 constructs latent factors, which are sparse combination of the features, and which represent main axes of variation shared across all or a subsets of the datasets. One of the first steps to take when analysing the trained model is to assess the amount of variance in the datasets explained by each factor; similarly to what we would do when running a PCA. +### Variance explained -### Variance explained per factor and dataset +When training the model, MOFA2 constructs latent factors, which are sparse combination of the features, and which represent main axes of variation shared across all or a subsets of the datasets. One of the first steps to take when analysing the trained model is to assess the amount of variance in the datasets explained by each factor; similarly to what we would do when running a PCA. -The function `MOFA2::plot_variance_explained()` displays the percentage of variance in each dataset explained by each factor as a heatmap. The `x` and `y` arguments (and also `split_by` argument when using the multi-group framework) control whether the factors or the datasets should be represented as the rows or columns: +The function `plot_variance_explained()` from `MOFA2` displays the percentage of variance in each dataset explained by each factor as a heatmap. The `x` and `y` arguments (and also `split_by` argument when using the multi-group framework) control whether the factors or the datasets should be represented as the rows or columns in the heatmap: ```{r plot-variance-explained} plot_variance_explained( @@ -232,11 +252,10 @@ plot_variance_explained( ) ``` -Here, we see that factor1 explains a lot (~35%) of variation in the transcriptomics dataset, but no variation in the other datasets. Factor 2 explains some variation in the genomics dataset and a little bit in the transcriptomics dataset, but not at all in the metabolomics datasets. Factors 3 to 7 are specific to the metabolomics dataset. Factors 8 to 15 seem to explain very little variation; seeing this, we could re-train the MOFA model by setting the number of factors to 7. This also explains in part the second warning we got (which indicated that we should re-train the model with less factors). +Here, we see that factor 1 explains a lot (almost 50%) of variation in the transcriptomics dataset, a little (around 10%) in the metabolomics dataset and almost none in the genomics dataset. Factor 2 explains around 20% of variation in the genomics dataset, and none in the other datasets. Factors 3 and 4 seem specificto the transcriptomics datasets, and factors 5 to 15 explain very little variation; seeing this, we could re-train the MOFA model by setting the number of factors to 4 or 5. -### Total variance explained per dataset -In addition, the `MOFA2::plot_variance_explained()` function can create a second plot displaying the total percentage of variance explained by all the factors for each dataset, if the argument `plot_total` is set to `TRUE` : +In addition, the `plot_variance_explained()` function can create a second plot displaying the total percentage of variance explained by all the factors for each dataset, if the argument `plot_total` is set to `TRUE` : ```{r plot-total-variance-explained} plot_variance_explained( @@ -247,11 +266,11 @@ plot_variance_explained( )[[2]] ## show only the 2nd plot ``` -The factors explain around 55% of the variation in the metabolomics dataset, 50% in the transcriptomics dataset, but less than 20% in in the genomics dataset. +The factors explain almost 85% of the variation in the transcriptomics dataset, but less than 30% in in the genomics and metabolomics datasets. ### Correlation between factors -Finally, it is always good practice to check the correlation between the different computed factors. All factors should be mostly uncorrelated; if a large correlation is observed between some factors, it could be an indication of poor model fit, either because too many factors were computed (in this case, try reducing the number of factors to compute) or maybe because of an improper normalisation of the datasets. The function `MOFA2::plot_factor_cor()` displays the correlation between all factors via the `corrplot::corrplot()` function: +Before moving to the interpretation, it is always good practice to check the correlation between the different computed factors. All factors should be mostly uncorrelated; if a large correlation is observed between some factors, it could be an indication of poor model fit, either because too many factors were computed (in this case, try reducing the number of factors to compute or increasing the number of iterations for the model training via the `convergence_mode` training option) or maybe because of an improper normalisation of the datasets. The function `plot_factor_cor()` from `MOFA2` displays the correlation between all factors via the `corrplot::corrplot()` function: ```{r plot-factors-correlation} plot_factor_cor( @@ -262,30 +281,20 @@ plot_factor_cor( ) ``` -There is a moderate positive correlation between factors 2 and 3, which would have triggered the third warning that we got. In this case, in addition to reducing the total number of factors, we could also increase the amount of iterations for which we let the model train, by setting the training option `convergence_mode` to `'medium'` or even `'slow'`. - -## Factors association with samples covariates +There is a moderate positive correlation between factors 2 and 5, but nothing to be concerned about. -The next step in our analysis is to assess which sources of variation each factor is representing. ### Correlation between factors and samples covariates -One way to do this is to check whether the factors are correlated with some known covariates that were recorded in the samples metadata. The `mofa_plot_cor_covariates()` function displays the correlation between each factor and the samples covariates. Note that when covariates that are factors are transformed to numerical group indicators in order to compute the correlation with factors. This function is similar to the `MOFA2` function `correlate_factors_with_covariates()` except that it returns a ggplot (rather than creating a base R plot), and offers a few convenient features through the optional arguments. By default, the function uses all covariates recorded in the samples metadata; however we can focus on a subset of them by passing the column names to the `covariates` argument: +To assess which sources of variation each factor is representing, we can check whether the factors are correlated with some known covariates that were recorded in the samples metadata. The `mofa_plot_cor_covariates()` function displays the correlation between each factor and the samples covariates. Note that factor covariates are transformed to numerical group indicators in order to compute the correlations. This function is similar to the `MOFA2` function `correlate_factors_with_covariates()` except that it returns a ggplot (rather than creating a base R plot), and offers a few convenient features through the optional arguments. By default, the function uses all covariates recorded in the samples metadata; however we can focus on a subset of them by passing the column names to the `covariates` argument. ```{r mofa-plot-cor-covariates} mofa_plot_cor_covariates( - mofa_trained, - covariates = c( - "bruising_group", - "bruising_mean", - "Family", - "other_parent", - "Rep", - "Sample") + mofa_trained ) ``` -From this plot, we can spot some interesting trends. For example, factors 1 and 2 (which explain variation in the transcriptomics and genomics datasets, respectively) are moderately associated with both the bruising response and the parent of the samples. Factor 3, however, which explains variation in the metabolomics dataset, is correlated with bruising response but not the pedigree information. This is very interesting: we know that there is some confounding between pedigree and bruising response, as some families yielded all high- or low-bruising samples. Therefore, the first two factors are picking up differences between bruising groups that overlap with differences between the families; as genomics data is strongly relfective of pedigree information. On the other hand, the metabolomics dataset does not seem to contain information about pedigree, and therefore factor 3 highlights differences between the two bruising groups that is independent from family differences. Factor 6, which again mostly explains variation from the metabolomics dataset , is correlated with the covariate `Sample`, i.e. samples ID. +From this plot, we can spot some interesting trends. For example, factor 1, which explains variation in the transcriptomics and metabolomics datasets, is strongly correlated with disease status. It makes sense, since the transcriptomics dataset has been filtered to retain genes most associated with differences between healthly and infected animals, so we expect it to be the strongest source of variation in the dataset. Factor 2, which explains variation only in the genomics dataset, is strongly correlated with the genomics composition of the animals, which makes a lot of sense as more related individuals share similar genomics profiles. It also makes sense that this variation is less present in the other two datasets. Factor 5, which explains variation only in the transcriptomics dataset, seems to be modestly correlated with the RNASeq batches. In general, it is a good idea to interpret this correlation plot together with the plot representing the percentage of variation explained in the datasets by each factor. However, it is necessary to investigate more before making claims about what the factors are representing. @@ -293,7 +302,7 @@ In general, it is a good idea to interpret this correlation plot together with t ### Visualising the top features -The `MOFA2` package offers a number of very useful visualisations to further interpret the results. For example, for a given factor and dataset of interest, we can visualise how the top contributing features vary with the factor values (i.e. the samples score). For example, let's have a look at the top contributing features for factor 1: +The `MOFA2` package offers a number of very useful visualisations to further interpret the results. For example, for a given factor and dataset of interest, we can visualise how the top contributing features vary with the factor values (i.e. the sample scores). For example, let's have a look at the top contributing features from the transcriptomics dataset for factor 1: ```{r plot-data-scatter} plot_data_scatter( @@ -301,14 +310,14 @@ plot_data_scatter( factor = 1, view = "rnaseq", features = 6, - color_by = "bruising_mean", - shape_by = "bruising_group" + color_by = "status", + shape_by = "feedlot" ) ``` -For all features with a negative weight, their abundance decreases when the factor value increases. For features with positive weights, their abundance increases with the factor values. +For all six genes, their expression is higher in infected animals compared with healthy ones. -Lastly, we can visualise the measurements of the top contributing features across the samples as a heatmap. Note that the `annotation_samples` argument allows us to add annotations on top of the heatmap to represent certain characteristics of the samples (the `annotation_features` does the same thing for features). For example here we'll add information about the phenotype group: +We can also visualise the measurements of the top contributing features across the samples as a heatmap, for a given dataset and factor. That the `annotation_samples` argument allows us to add annotations on top of the heatmap to represent certain characteristics of the samples. For example here we'll add information about the phenotype group and chromosome: ```{r plot-data-heatmap} MOFA2::plot_data_heatmap( @@ -316,11 +325,12 @@ MOFA2::plot_data_heatmap( factor = 1, view = "rnaseq", features = 20, - annotation_samples = "bruising_group", + annotation_samples = "status", fontsize_col = 5 ) ``` +Note that `moiraine` implements similar functions to represent the top contributing features, as we will see in @sec-interpretation. ## Recap -- targets list @@ -345,7 +355,7 @@ list( ), training_options = list(seed = 43) ), - only_common_samples = TRUE + only_common_samples = FALSE ) ), @@ -393,15 +403,7 @@ list( ## Plot of factors correlation with covariates tar_target( mofa_factors_covariates_cor_plot, - mofa_plot_cor_covariates(mofa_trained, - covariates = c( - "bruising_group", - "bruising_mean", - "Family", - "other_parent", - "Rep", - "Sample") - ) + mofa_plot_cor_covariates(mofa_trained) ) ) ``` diff --git a/preprocessing.qmd b/preprocessing.qmd index a171984..02912a0 100644 --- a/preprocessing.qmd +++ b/preprocessing.qmd @@ -340,7 +340,7 @@ plot_samples_coordinates_pca( ``` -### Missing values imputation +### Missing values imputation {#sec-preprocessing-missing-values} We can check that the complete multi-omics set constructed has no more missing values: diff --git a/so2pls.qmd b/so2pls.qmd index 0218690..98914a8 100644 --- a/so2pls.qmd +++ b/so2pls.qmd @@ -90,15 +90,7 @@ In a first step, an adjusted cross-validation procedure is used to compute the o [^so2pls-1]: i.e. the prediction of the joint part of each dataset given the joint part of the other dataset. -In this example, we will test 1 to 5 joint components, and 0 to 10 specific components for each dataset (testing only even values). We use a 10-fold cross validation (`nr_folds` parameter specifying the number of folds to use for the cross-validation), and distribute the computation over 6 cores to reduce the running time (through the `nr_cores` argument). We will also set the seed to ensure the reproducibility of the results, through the `seed` argument. Note that in order to use `seed`, you need to be using the latest version of the `OmicsPLS` package from GitHub, which can be installed as follows: - -```{r installing-omicspls} -#| eval: false - -require(devtools) -devtools::install_github("selbouhaddani/OmicsPLS") -``` - +In this example, we will test 1 to 5 joint components, and 0 to 10 specific components for each dataset (testing only even values). We use a 10-fold cross validation (`nr_folds` parameter specifying the number of folds to use for the cross-validation), and distribute the computation over 6 cores to reduce the running time (through the `nr_cores` argument). We will also set the seed to ensure the reproducibility of the results, through the `seed` argument. ```{targets adj-cv} tar_target( @@ -115,6 +107,10 @@ tar_target( ) ``` +::: {.callout-note} +In order to use the `seed` argument for the `so2pls_crossval_o2m_adjR2`, you need to be using the latest version of the `OmicsPLS` package from GitHub, which can be installed with `devtools::install_github("selbouhaddani/OmicsPLS")`. +::: + Despite using multiple cores, this function can take a while to run (this example took around 8 minutes to execute). The function returns a data-frame giving, for each tested value for $n$, the values of $n_x$ and $n_y$ that yielded the lowest prediction error, as well as the prediction error obtained for these values of the parameters: ```{r load-cv-adj-res} @@ -480,6 +476,62 @@ list( so2pls_cv_res, so2pls_cv_sparsity_res ) + ), + + ## Summary plot of percentage of variance explained + tar_target( + so2pls_summary_plot, + so2pls_plot_summary(so2pls_final_run) + ), + + ## Screeplot + tar_target( + so2pls_screeplot, + so2pls_screeplot(so2pls_final_run) + ), + + ## Comparison of samples score for joint components + tar_target( + so2pls_joint_components_comparison_plot, + so2pls_compare_samples_joint_components( + so2pls_final_run, + mo_data = mo_set_de, + colour_by = "status", + shape_by = "feedlot" + ) + ), + + ## Coefficient plot for joint components + tar_target( + so2pls_joint_components_coefficients_plot, + so2pls_plot_joint_components_coefficients(so2pls_final_run) + ), + + ## Joint component samples score plot + tar_target( + so2pls_joint_components_samples_score_plot, + so2pls_plot_samples_joint_components( + so2pls_final_run, + mo_data = mo_set_de, + colour_upper = "status", + scale_colour_upper = scale_colour_brewer(palette = "Paired"), + shape_upper = "feedlot" + ) + + theme(legend.box = "vertical") + ), + + ## Specific components samples score plot + tar_target( + so2pls_specific_components_samples_score_plot, + so2pls_plot_samples_specific_components( + so2pls_final_run, + mo_data = mo_set_de, + colour_upper = "feedlot", + scale_colour_upper = scale_colour_brewer(palette = "Paired"), + colour_lower = "rnaseq_batch", + shape_upper = "gender" + ) |> + map(\(x) x + theme(legend.box = "vertical")) ) ) ``` diff --git a/spls.qmd b/spls.qmd index 97ceea6..65f7bf2 100644 --- a/spls.qmd +++ b/spls.qmd @@ -96,25 +96,26 @@ spls_input_multilevel1 <- get_input_spls( The resulting design table that will be used by the `mixOmics` package is stored in the `multilevel` argument of the object returned by the `get_input_spls()` function. It consists of a data-frame with observations as rows, and an integer ID for the corresponding biological individual in the first column: -```{r multilevel-input-one-factor-res} -#| echo: false +```{r multilevel-input-one-factor-show} +#| eval: false -tar_read(spls_multilevel1) |> - kable() +attr(spls_input_multilevel1, "multilevel") ``` -In order to use the multilevel option with two factors, the `multilevel` argument should instead be populated with a vector of three column names from the samples metadata table. The first value, similarly to a one-factor decomposition, should be the name of the column containing the ID of the biological individuals. The second and third values should be the name of the columns containing the levels of the two factors of interest. Continuing on the previous example, supposing that in a similar study, we take measurements on the plants at three different time-points after the application of each treatment. The samples metadata table might look like this: -```{r multilevel-two-factors-smetadata} -#| eval: false +```{r multilevel-input-one-factor-res} +#| echo: false -get_samples_metadata_combined(mo_set_multilevel2) +tar_read(spls_multilevel1) ``` +In order to use the multilevel option with two factors, the `multilevel` argument should instead be populated with a vector of three column names from the samples metadata table. The first value, similarly to a one-factor decomposition, should be the name of the column containing the ID of the biological individuals. The second and third values should be the name of the columns containing the levels of the two factors of interest. Continuing on the previous example, supposing that in a similar study, we take measurements on the plants at three different time-points after the application of each treatment. The samples metadata table might look like this: + ```{r multilevel-two-factors-table} #| echo: false -tar_read(spls_smeta2) +tar_read(spls_smeta2) |> + kable() ``` with similar columns to the previous example, and an additional `time` column indicating the time-point at which the measurement was taken. In that case, we should set `multilevel = c("plant_id", "treatment", "time")` to use the multilevel option, as follows: