Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mask fewer sites, the mask sites include lots of false positives #384

Open
corneliusroemer opened this issue Nov 12, 2024 · 3 comments
Open

Comments

@corneliusroemer
Copy link

Usher SARS-CoV-2 masks quite a lot of sites (I think around 270, i.e. almost 1% of genome) based on this vcf: https://raw.githubusercontent.com/W-L/ProblematicSites_SARS-CoV2/master/problematic_sites_sarsCov2.vcf but I think that list of sites includes quite a lot of things that are no longer problematic.

The last update to that mask list was more than 3 years ago, so it's clearly no longer maintained. It might be worth transitioning away from it. Maybe turn it the existing sites into branch specific masks for old clades, but not for new, recent ones?

I noticed this when desigating stuff within KS.1.1.1, trying to untangle what happened. The two sites here being masked really makes things more difficult to untangle: C2091T and C16887T.

These are the relevant lines:

MN908947.3	16887	.	C	T,Y	.	mask	SUB=NDM,RCD;EXC=highly_homoplasic;SRC_COUNTRY=.;SRC_LAB=.;GENE=gene-orf1ab;AA_POS=5541;AA_REF=Y;AA_ALT=I,X
MN908947.3	2091	.	C	T,Y	.	mask	SUB=NDM;EXC=highly_ambiguous,homoplasic,narrow_src;SRC_COUNTRY=India,UK;SRC_LAB=NCDC,NU-OMICS;GENE=gene-orf1ab;AA_POS=609;AA_REF=T;AA_ALT=I,X
@russcd
Copy link
Collaborator

russcd commented Nov 12, 2024

You are definitely right about this. Many of those recommendations have outlived their usefulness and it is something @AngieHinrichs and I have been thinking about how to clean up.

Briefly, a proposed solution is:

  1. Refactor so that samples are in MAPLE representation --- we need this for other reasons but it will be easiest to add with a big overhaul and will make some of this easier.
  2. Determine which subset of sites have remained potentially problematic (e.g., 11083 is still likely not usable?) and mask them. One way to do this is to just compute the parsimony score for each current problematic site without updating the topology --- well behaved sites should not stand out tremendously wrt to parismony:allele_freq even if not used to infer the tree.
  3. Reoptimze the existing tree using new less masked samples in MAPLE format.
  4. ???
  5. Profit.

This will certainly break some stuff and we'll have to figure that out when we get there. @AngieHinrichs and @corneliusroemer what do you think?

I also think it may be a good time to operationalize the branch-specific screwy site detection approach I made.

@AngieHinrichs
Copy link
Contributor

Sounds good. I'll try to get to the MAPLE-ification and matOptimize soon. Yes, for sure I expect 11083 and some others to still be problematic, at least in some major lineages, but let's find out!

I also think it may be a good time to operationalize the branch-specific screwy site detection approach I made.

As in, recode in C++ in matUtils so it runs faster than its current Pythonic 19 hours? Or just run it every week or month or so, and mask accordingly?

@russcd
Copy link
Collaborator

russcd commented Nov 13, 2024

Cool. Thanks, Angie.

Let's not bother with a recode until we decide we really like it and want to run it often. It is not clear to me that branch masking will be something we want to run more than say monthly-ish?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants