Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conditionally & persistently rare taxa #83

Open
antagomir opened this issue May 31, 2024 · 9 comments
Open

Conditionally & persistently rare taxa #83

antagomir opened this issue May 31, 2024 · 9 comments
Assignees

Comments

@antagomir
Copy link
Member

In microbiome time series analyses we have the definitions of abundant taxa, conditionally rare taxa, persistently rare taxa and other rare taxa (see e.g. the two refs below).

Conditionally rare taxa are defined for instance as taxa with a maximum relative abundance at least N times higher than their minimum value.

Persistently rare taxa are taxa whose maximum relative abundance never exceeds X times greater than the minimum.

https://doi.org/10.1093%2Ffemsec%2Ffix126
https://doi.org/10.1128/mbio.01371-14

It could be helpful to have a function or example showing how to fetch these taxa sets from microbiome time series.

@ginkgozh
Copy link

ginkgozh commented Jun 5, 2024

This article categorizes microorganisms based on threshold values:always AT (AAT) with a relative abundance of ≥1% in the dataset; conditionally AT (CAT) with ≥1% relative abundance in some samples, but never <0.01%; always RT (ART) with <0.01% relative abundance in all samples; conditionally RT (CRT) with <1% relative abundance in all initial samples, but <0.01% in some sample; moderate taxa (MT) with between 0.01% and 1% relative abundances in all data; and conditionally abundant and rare taxa (CRAT) whose relative abundances ranged from rare values of <0.01% to abundant values of ≥1%. As also previously described, CRAT, CAT, and AAT were collectively referred to as abundant taxa (AT), while ART and CRT comprised rare taxa (RT).(https://www.sciencedirect.com/science/article/pii/S0013935123028347#sec2).

However,` I am currently facing the same issue and cannot find the R code to classify them, nor do I know how to calculate their diversity indices after classification. If you have already solved this problem, please contact me (email: [email protected]). Below is the only information I have obtained, the R code to classify them.

`otu <- read.table("C:/Users/heng/Desktop/micro/BF.txt", header = TRUE, row.names = 1, sep = "\t")

#(i)稀有类群(rare taxa,RT),在所有样本中丰度均 ≤0.1% 的 OTU
otu_ART <- otu[apply(otu, 1, function(x) max(x) <= 0.0001), ]
#(ii)丰富类群(abundant taxa,AT),在所有样本中丰度均 ≥1% 的 OTU
otu_AAT <- otu[apply(otu, 1, function(x) min(x) >= 0.01), ]
#(iii)中间类群(moderate taxa,MT),在所有样本中丰度均 >0.1% 且 <1% 的 OTU
otu_MT <- otu[apply(otu, 1, function(x) min(x) > 0.0001 & max(x) < 0.01), ]
#(iv)条件稀有类群(conditionally rare taxa,CRT),在所有样本中丰度均 <1%,且仅在部分样本中丰度 <0.1% 的 OTU
otu_CRT <- otu[apply(otu, 1, function(x) min(x) < 0.0001 & max(x) < 0.01), ]
otu_CRT <- otu_CRT[which(! rownames(otu_CRT) %in% rownames(otu_ART)), ] #CRT 和 ART 是没有重叠的,在所有样本中丰度均 ≤0.1% 的 OTU 是不可取的
#(v)条件丰富类群(conditionally abundant taxa,CAT),在所有样本中丰度均 >0.1%,且仅在部分样本中丰度 >1% 的 OTU
otu_CAT <- otu[apply(otu, 1, function(x) min(x) > 0.0001 & max(x) > 0.01), ]
otu_CAT <- otu_CAT[which(! rownames(otu_CAT) %in% rownames(otu_AAT)), ] #CAT 和 AAT 是没有重叠的,在所有样本中丰度均 ≥1% 的 OTU 是不可取的

#(vi)条件稀有或丰富类群(conditionally rare or abundant taxa,CRAT),丰度跨越从稀有(最低丰度 ≤0.1%)到丰富(最高丰度 ≥1%)的 OTU
otu_CRAT <- otu[apply(otu, 1, function(x) min(x) <= 0.0001 & max(x) >= 0.01), ]

#备注:这 6 个类群没有重叠,总数即等于 OTU 表的总数,相对丰度总和 100%
otu[which(rownames(otu) %in% rownames(otu_ART)),'taxa'] <- 'ART'
otu[which(rownames(otu) %in% rownames(otu_AAT)),'taxa'] <- 'AAT'
otu[which(rownames(otu) %in% rownames(otu_MT)),'taxa'] <- 'MT'
otu[which(rownames(otu) %in% rownames(otu_CRT)),'taxa'] <- 'CRT'
otu[which(rownames(otu) %in% rownames(otu_CAT)),'taxa'] <- 'CAT'
otu[which(rownames(otu) %in% rownames(otu_CRAT)),'taxa'] <- 'CRAT'

library(openxlsx)
write.xlsx(otu_stat, file = "CK_otu_stat.xlsx",rowNames = TRUE)

for (i in 1:(ncol(otu)-1)) otu[[i]] <- ifelse(as.character(otu[[i]]) == '0', NA, otu[[ncol(otu)]])
otu_stat <- data.frame(apply(otu[-ncol(otu)], 2, table))
otu_stat$taxa <- rownames(otu_stat)
otu_stat <- reshape2::melt(otu_stat, id = 'taxa')

library(ggplot2)
ggplot(otu_stat, aes(variable, value, fill = taxa)) +
geom_col(position = 'fill', width = 0.6) +
theme(panel.grid = element_blank(), panel.background = element_rect(color = 'gray', fill = 'transparent')) +
scale_y_continuous(expand = c(0, 0)) +
labs(x = '样本', y = '不同丰富或稀有类群的占比')`

@antagomir
Copy link
Member Author

This might use addPrevalence functions from mia, and just add argument to define the conditional abundance.

@antagomir
Copy link
Member Author

This should be feasible measure to implement per each taxa, could you give a try @Daenarys8

@Daenarys8
Copy link
Contributor

we could have a wrapper function/method that classifies taxa into six distinct categories and return a list: always abundant taxa (AAT), conditionally abundant taxa (CAT), always rare taxa (ART), conditionally rare taxa (CRT), moderate taxa (MT), and conditionally rare and abundant taxa (CRAT).

classifyTaxa <- function(x, 
                                             abundant_threshold = 0.01, 
                                             rare_threshold = 0.0001, 
                                             assay.type = "relabundance", 
                                             ...) {
    aat_taxa <- getPrevalent(x, prevalence = 1, 
                              detection = abundant_threshold, 
                              assay.type = assay.type)
    
    art_taxa <- getRare(x, prevalence = 1, 
                        detection = rare_threshold, 
                        assay.type = assay.type)
    
    non_AAT_ART_taxa <- setdiff(rownames(rel_abund_mat), 
                                 c(aat_taxa, art_taxa))
    mt_taxa <- non_AAT_ART_taxa[apply(rel_abund_mat[non_AAT_ART_taxa, , 
                                                      drop = FALSE], 1, 
                                       function(x) min(x, na.rm = TRUE) > 
                                       rare_threshold & 
                                       max(x, na.rm = TRUE) < 
                                       abundant_threshold)]
    
    cat_candidates <- setdiff(getPrevalent(x, prevalence = 0.1, 
                                            detection = abundant_threshold, 
                                            assay.type = assay.type), 
                              aat_taxa)
    cat_taxa <- cat_candidates[apply(rel_abund_mat[cat_candidates, , 
                                                    drop = FALSE], 1, 
                                      min) > rare_threshold]
    
    crt_candidates <- setdiff(getRare(x, prevalence = 0.1, 
                                       detection = rare_threshold, 
                                       assay.type = assay.type), 
                              art_taxa)
    crt_taxa <- crt_candidates[apply(rel_abund_mat[crt_candidates, , 
                                                    drop = FALSE], 1, 
                                      max) < abundant_threshold]
    

    crat_taxa <- rownames(rel_abund_mat)[apply(rel_abund_mat, 1, 
        function(x) min(x, na.rm = TRUE) <= rare_threshold & 
                    max(x, na.rm = TRUE) >= abundant_threshold)]
    

    classified_taxa <- list(
        AAT = aat_taxa,
        CAT = cat_taxa,
        ART = art_taxa,
        CRT = crt_taxa,
        MT = mt_taxa,
        CRAT = crat_taxa,
        abundant_taxa = unique(c(aat_taxa, cat_taxa, crat_taxa)),
        rare_taxa = unique(c(art_taxa, crt_taxa))
    )

  return(classified_taxa)
}

for some example:

data(minimalgut)
tse <- transformAssay(minimalgut, method = "relabundance")

classified_taxa <- classifyTaxa(tse)

@antagomir @TuomasBorman

@antagomir
Copy link
Member Author

Do we have all these definitions in the indicated (or other) literature, for instance MT and CRAT, or did you come up with these definitions on your own? They might be useful but good to know.

Some suggestions:

  • this should be implemented in the same way than e.g. prevalence measures have been implemented in mia. In fact, if one assumes that all data comes from a single subject, then unless I am wrong, some of the definitions (like always rare, always abundant) can be directly obtained with getPrevalence, and perhaps some of the rest (CRT, CAT?) could be obtained as suitable complements of such sets. Hence, the first step is to check systematically what can be done with getPrevalence already as is.
  • the added value could come from a) implementing indices that are not readily accessible by getPrevalence and/or b) adding support to multiple subjects/groups that have time series. Then these indices could be drawn per subject and one would need to decide how to present (add to rowData or not?). But even in that case the fundamental building block is doing this per subject and the rest is just splitting the data per subject, doing calculations, and then reassembling the data. The issue split-apply-combine #30 is very relevant for this and should be solved on the same go, if not ready otherwise
  • I would start with one single index (e.g. CRT) and make sure that can be implemented properly according to the typical standard in mia package. After that one could add additional indices.

@Daenarys8
Copy link
Contributor

Do we have all these definitions in the indicated (or other) literature, for instance MT and CRAT, or did you come up with these definitions on your own? They might be useful but good to know.

@ginkgozh stated this article that mentioned the additional types with their definitions: link to literature

@TuomasBorman
Copy link
Collaborator

Can this be handled by calculating prevalence for each feature for each time point, i.e., by looping all the time points? Then we have X*N matrix where X is features and N time points. Based on this matrix, it should be easy to classify features to these categories.

The only time consuming part is looping the time points (usually this is not problem since the number of time points is so little); everything else should be efficient

@antagomir
Copy link
Member Author

In fact, this could be done also for other than time series ("Conditionally rare taxa are defined for instance as taxa with a maximum relative abundance at least N times higher than their minimum value."). Although time series are an interesting application.

So essentially the user needs to decide how to split the data, and we could provide a standard function to calculate CRT (or other such measures) for the given data.

If this is done for time series, then user may like to calculate this per time series (ie. per subject) for each feature. Then one should not do it by looping over time points but rather by looping over time series.

It could be one function with different options (as in diversity calculation). Note that prevalence calculation is closely related and possibly could even be included among the options. But I am not sure if that is helpful as it is so common and might be useful as a dedicated function of its own.

Could we create a minimal function to calculate this for a given data vector and then matrix (unless it already exists in R ecosystem?) and then see if this could/should be converted into a full TreeSE wrapper?

@Daenarys8
Copy link
Contributor

Daenarys8 commented Nov 14, 2024

we can continue this discussion in the pr #98

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants