Conditionally & persistently rare taxa #83

antagomir · 2024-05-31T17:51:15Z

In microbiome time series analyses we have the definitions of abundant taxa, conditionally rare taxa, persistently rare taxa and other rare taxa (see e.g. the two refs below).

Conditionally rare taxa are defined for instance as taxa with a maximum relative abundance at least N times higher than their minimum value.

Persistently rare taxa are taxa whose maximum relative abundance never exceeds X times greater than the minimum.

https://doi.org/10.1093%2Ffemsec%2Ffix126
https://doi.org/10.1128/mbio.01371-14

It could be helpful to have a function or example showing how to fetch these taxa sets from microbiome time series.

ginkgozh · 2024-06-05T07:32:44Z

This article categorizes microorganisms based on threshold values：always AT (AAT) with a relative abundance of ≥1% in the dataset; conditionally AT (CAT) with ≥1% relative abundance in some samples, but never <0.01%; always RT (ART) with <0.01% relative abundance in all samples; conditionally RT (CRT) with <1% relative abundance in all initial samples, but <0.01% in some sample; moderate taxa (MT) with between 0.01% and 1% relative abundances in all data; and conditionally abundant and rare taxa (CRAT) whose relative abundances ranged from rare values of <0.01% to abundant values of ≥1%. As also previously described, CRAT, CAT, and AAT were collectively referred to as abundant taxa (AT), while ART and CRT comprised rare taxa (RT).（https://www.sciencedirect.com/science/article/pii/S0013935123028347#sec2）.

However,` I am currently facing the same issue and cannot find the R code to classify them, nor do I know how to calculate their diversity indices after classification. If you have already solved this problem, please contact me (email: [email protected]). Below is the only information I have obtained, the R code to classify them.

`otu <- read.table("C:/Users/heng/Desktop/micro/BF.txt", header = TRUE, row.names = 1, sep = "\t")

#（i）稀有类群（rare taxa，RT），在所有样本中丰度均 ≤0.1% 的 OTU
otu_ART <- otu[apply(otu, 1, function(x) max(x) <= 0.0001), ]
#（ii）丰富类群（abundant taxa，AT），在所有样本中丰度均 ≥1% 的 OTU
otu_AAT <- otu[apply(otu, 1, function(x) min(x) >= 0.01), ]
#（iii）中间类群（moderate taxa，MT），在所有样本中丰度均 >0.1% 且 <1% 的 OTU
otu_MT <- otu[apply(otu, 1, function(x) min(x) > 0.0001 & max(x) < 0.01), ]
#（iv）条件稀有类群（conditionally rare taxa，CRT），在所有样本中丰度均 <1%，且仅在部分样本中丰度 <0.1% 的 OTU
otu_CRT <- otu[apply(otu, 1, function(x) min(x) < 0.0001 & max(x) < 0.01), ]
otu_CRT <- otu_CRT[which(! rownames(otu_CRT) %in% rownames(otu_ART)), ] #CRT 和 ART 是没有重叠的，在所有样本中丰度均 ≤0.1% 的 OTU 是不可取的
#（v）条件丰富类群（conditionally abundant taxa，CAT），在所有样本中丰度均 >0.1%，且仅在部分样本中丰度 >1% 的 OTU
otu_CAT <- otu[apply(otu, 1, function(x) min(x) > 0.0001 & max(x) > 0.01), ]
otu_CAT <- otu_CAT[which(! rownames(otu_CAT) %in% rownames(otu_AAT)), ] #CAT 和 AAT 是没有重叠的，在所有样本中丰度均 ≥1% 的 OTU 是不可取的

#（vi）条件稀有或丰富类群（conditionally rare or abundant taxa，CRAT），丰度跨越从稀有（最低丰度 ≤0.1%）到丰富（最高丰度 ≥1%）的 OTU
otu_CRAT <- otu[apply(otu, 1, function(x) min(x) <= 0.0001 & max(x) >= 0.01), ]

#备注：这 6 个类群没有重叠，总数即等于 OTU 表的总数，相对丰度总和 100%
otu[which(rownames(otu) %in% rownames(otu_ART)),'taxa'] <- 'ART'
otu[which(rownames(otu) %in% rownames(otu_AAT)),'taxa'] <- 'AAT'
otu[which(rownames(otu) %in% rownames(otu_MT)),'taxa'] <- 'MT'
otu[which(rownames(otu) %in% rownames(otu_CRT)),'taxa'] <- 'CRT'
otu[which(rownames(otu) %in% rownames(otu_CAT)),'taxa'] <- 'CAT'
otu[which(rownames(otu) %in% rownames(otu_CRAT)),'taxa'] <- 'CRAT'

library(openxlsx)
write.xlsx(otu_stat, file = "CK_otu_stat.xlsx",rowNames = TRUE)

for (i in 1:(ncol(otu)-1)) otu[[i]] <- ifelse(as.character(otu[[i]]) == '0', NA, otu[[ncol(otu)]])
otu_stat <- data.frame(apply(otu[-ncol(otu)], 2, table))
otu_stat$taxa <- rownames(otu_stat)
otu_stat <- reshape2::melt(otu_stat, id = 'taxa')

library(ggplot2)
ggplot(otu_stat, aes(variable, value, fill = taxa)) +
geom_col(position = 'fill', width = 0.6) +
theme(panel.grid = element_blank(), panel.background = element_rect(color = 'gray', fill = 'transparent')) +
scale_y_continuous(expand = c(0, 0)) +
labs(x = '样本', y = '不同丰富或稀有类群的占比')`

antagomir · 2024-07-09T09:19:43Z

This might use addPrevalence functions from mia, and just add argument to define the conditional abundance.

antagomir · 2024-11-03T20:36:00Z

This should be feasible measure to implement per each taxa, could you give a try @Daenarys8

Daenarys8 · 2024-11-04T09:12:55Z

we could have a wrapper function/method that classifies taxa into six distinct categories and return a list: always abundant taxa (AAT), conditionally abundant taxa (CAT), always rare taxa (ART), conditionally rare taxa (CRT), moderate taxa (MT), and conditionally rare and abundant taxa (CRAT).

classifyTaxa <- function(x, 
                                             abundant_threshold = 0.01, 
                                             rare_threshold = 0.0001, 
                                             assay.type = "relabundance", 
                                             ...) {
    aat_taxa <- getPrevalent(x, prevalence = 1, 
                              detection = abundant_threshold, 
                              assay.type = assay.type)
    
    art_taxa <- getRare(x, prevalence = 1, 
                        detection = rare_threshold, 
                        assay.type = assay.type)
    
    non_AAT_ART_taxa <- setdiff(rownames(rel_abund_mat), 
                                 c(aat_taxa, art_taxa))
    mt_taxa <- non_AAT_ART_taxa[apply(rel_abund_mat[non_AAT_ART_taxa, , 
                                                      drop = FALSE], 1, 
                                       function(x) min(x, na.rm = TRUE) > 
                                       rare_threshold & 
                                       max(x, na.rm = TRUE) < 
                                       abundant_threshold)]
    
    cat_candidates <- setdiff(getPrevalent(x, prevalence = 0.1, 
                                            detection = abundant_threshold, 
                                            assay.type = assay.type), 
                              aat_taxa)
    cat_taxa <- cat_candidates[apply(rel_abund_mat[cat_candidates, , 
                                                    drop = FALSE], 1, 
                                      min) > rare_threshold]
    
    crt_candidates <- setdiff(getRare(x, prevalence = 0.1, 
                                       detection = rare_threshold, 
                                       assay.type = assay.type), 
                              art_taxa)
    crt_taxa <- crt_candidates[apply(rel_abund_mat[crt_candidates, , 
                                                    drop = FALSE], 1, 
                                      max) < abundant_threshold]
    

    crat_taxa <- rownames(rel_abund_mat)[apply(rel_abund_mat, 1, 
        function(x) min(x, na.rm = TRUE) <= rare_threshold & 
                    max(x, na.rm = TRUE) >= abundant_threshold)]
    

    classified_taxa <- list(
        AAT = aat_taxa,
        CAT = cat_taxa,
        ART = art_taxa,
        CRT = crt_taxa,
        MT = mt_taxa,
        CRAT = crat_taxa,
        abundant_taxa = unique(c(aat_taxa, cat_taxa, crat_taxa)),
        rare_taxa = unique(c(art_taxa, crt_taxa))
    )

  return(classified_taxa)
}

for some example:

data(minimalgut)
tse <- transformAssay(minimalgut, method = "relabundance")

classified_taxa <- classifyTaxa(tse)

@antagomir @TuomasBorman

antagomir · 2024-11-05T06:42:17Z

Do we have all these definitions in the indicated (or other) literature, for instance MT and CRAT, or did you come up with these definitions on your own? They might be useful but good to know.

Some suggestions:

this should be implemented in the same way than e.g. prevalence measures have been implemented in mia. In fact, if one assumes that all data comes from a single subject, then unless I am wrong, some of the definitions (like always rare, always abundant) can be directly obtained with getPrevalence, and perhaps some of the rest (CRT, CAT?) could be obtained as suitable complements of such sets. Hence, the first step is to check systematically what can be done with getPrevalence already as is.
the added value could come from a) implementing indices that are not readily accessible by getPrevalence and/or b) adding support to multiple subjects/groups that have time series. Then these indices could be drawn per subject and one would need to decide how to present (add to rowData or not?). But even in that case the fundamental building block is doing this per subject and the rest is just splitting the data per subject, doing calculations, and then reassembling the data. The issue split-apply-combine #30 is very relevant for this and should be solved on the same go, if not ready otherwise
I would start with one single index (e.g. CRT) and make sure that can be implemented properly according to the typical standard in mia package. After that one could add additional indices.

Daenarys8 · 2024-11-05T12:47:16Z

Do we have all these definitions in the indicated (or other) literature, for instance MT and CRAT, or did you come up with these definitions on your own? They might be useful but good to know.

@ginkgozh stated this article that mentioned the additional types with their definitions: link to literature

TuomasBorman · 2024-11-05T15:20:27Z

Can this be handled by calculating prevalence for each feature for each time point, i.e., by looping all the time points? Then we have X*N matrix where X is features and N time points. Based on this matrix, it should be easy to classify features to these categories.

The only time consuming part is looping the time points (usually this is not problem since the number of time points is so little); everything else should be efficient

antagomir · 2024-11-10T10:11:49Z

In fact, this could be done also for other than time series ("Conditionally rare taxa are defined for instance as taxa with a maximum relative abundance at least N times higher than their minimum value."). Although time series are an interesting application.

So essentially the user needs to decide how to split the data, and we could provide a standard function to calculate CRT (or other such measures) for the given data.

If this is done for time series, then user may like to calculate this per time series (ie. per subject) for each feature. Then one should not do it by looping over time points but rather by looping over time series.

It could be one function with different options (as in diversity calculation). Note that prevalence calculation is closely related and possibly could even be included among the options. But I am not sure if that is helpful as it is so common and might be useful as a dedicated function of its own.

Could we create a minimal function to calculate this for a given data vector and then matrix (unless it already exists in R ecosystem?) and then see if this could/should be converted into a full TreeSE wrapper?

Daenarys8 · 2024-11-14T10:28:37Z

we can continue this discussion in the pr #98

antagomir assigned Daenarys8 Nov 3, 2024

Daenarys8 mentioned this issue Nov 14, 2024

conditionally and persisitently rare taxa #98

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conditionally & persistently rare taxa #83

Conditionally & persistently rare taxa #83

antagomir commented May 31, 2024

ginkgozh commented Jun 5, 2024

antagomir commented Jul 9, 2024

antagomir commented Nov 3, 2024

Daenarys8 commented Nov 4, 2024

antagomir commented Nov 5, 2024

Daenarys8 commented Nov 5, 2024

TuomasBorman commented Nov 5, 2024

antagomir commented Nov 10, 2024

Daenarys8 commented Nov 14, 2024 •

edited by TuomasBorman

Loading

Conditionally & persistently rare taxa #83

Conditionally & persistently rare taxa #83

Comments

antagomir commented May 31, 2024

ginkgozh commented Jun 5, 2024

antagomir commented Jul 9, 2024

antagomir commented Nov 3, 2024

Daenarys8 commented Nov 4, 2024

antagomir commented Nov 5, 2024

Daenarys8 commented Nov 5, 2024

TuomasBorman commented Nov 5, 2024

antagomir commented Nov 10, 2024

Daenarys8 commented Nov 14, 2024 • edited by TuomasBorman Loading

Daenarys8 commented Nov 14, 2024 •

edited by TuomasBorman

Loading