Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple markers/loci per cluster issue #53

Open
bpjburger opened this issue Feb 3, 2021 · 4 comments
Open

Multiple markers/loci per cluster issue #53

bpjburger opened this issue Feb 3, 2021 · 4 comments
Labels

Comments

@bpjburger
Copy link

Hello,

I ran PhylotaR according to the tutorial, modifying it as is shown in the code below.
To create fasta files for each cluster I ran a loop in R. But when checking the sequences in the resulting fasta-files I discovered that they contain different loci. The first cluster (and corresponding fasta) contained ITS and trnL. This issue is present in other clusters as well even in the pre-filtered phylota-object.

Is this the result of a mistake in my code or is this a PhylotaR issue? And how can I possibly fix it?

Regards,

Bart

The code:

library(phylotaR)
wd <- '[FILEPATH TO WORKING DIRECTORY]'
ncbi_dr <- '[FILEPATH TO COMPILED BLAST+ TOOLS]'
txid <- 4210  # Asteracea ID
setup(wd = wd, txid = txid, ncbi_dr = ncbi_dr)
run(wd=wd)
phylota=read_phylota(wd)

#Create a new Phylota-object that has one sequence per taxon.

phylota_sp <- drop_by_rank(phylota = phylota, rnk = 'species', n = 1)
cids=phylota_sp@cids

#Create a new Phylota-object, but with excluding clusters that only contain a few taxa. 

ntaxa=get_ntaxa(phylota = phylota_sp, cid = cids)
keep=cids[ntaxa>50]#new set of cluster IDs with more than 50 taxa
selected=drop_clstrs(phylota = phylota_sp, cid = keep)

#And now the loop.

cids_sel=selected@cids

#A for loop that goes over the cluster IDs
for (i in cids_sel){
  txids <- get_txids(phylota = selected, cid = i, rnk = 'species')#makes a list of taxon-IDs for the specific cluster
  
# look up name for Taxon IDs per cluster and create a list out of it.
scientific_names <- get_tx_slot(phylota = selected, txid = txids, slt_nm = 'scnm')

#formatting the names
scientific_names <- gsub('\\.', '', scientific_names)
scientific_names <- gsub('\\s+', '_', scientific_names)

#specify the sequence IDs per cluster
  sids <- reduced@clstrs[[i]]@sids
  
  # write out (one file per cluster)
  pad=[GENERAL PATH TO LOCATION]
  outfile= str_c(pad, i)#add the cluster ID to the filename
  write_sqs(phylota = selected, sid = sids, sq_nm = scientific_names,
            outfile = outfile)}


@DomBennett
Copy link
Contributor

Hi,

I've managed to re-run your analysis for Asteracea and I'm looking over the clusters. So that I can repeat your process, how are you determining that there is a mix of ITS and trnL in the first cluster?

The top thing I've noticed is ...

#specify the sequence IDs per cluster
sids <- reduced@clstrs[[i]]@sids

Should be:

#specify the sequence IDs per cluster
sids <- selected@clstrs[[i]]@sids

Where does the "reduced" phylota object originate? Mixing sids across different phylota objects could be causing the mixing of sequences in a cluster.

@Bunholi
Copy link

Bunholi commented May 20, 2021

Hi @bpjburger did you solve the problem following @DomBennett correction?
I am planning to do the same around here

@bpjburger
Copy link
Author

Hi @Bunholi,

Unfortunately, the corrections did no solve the problem.
I think that it is a software issue.

@maelle
Copy link
Member

maelle commented May 9, 2022

For info, we're looking for a new maintainer / a new maintainer team for this package, see #57 and feel free to volunteer, we'd be happy to help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants