Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine-Readable Reference File #163

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Conversation

emmahodcroft
Copy link
Collaborator

@emmahodcroft emmahodcroft commented May 19, 2021

This a draft PR to outline a JSON file format which would contain all information about the Variants & Mutations that are tracked on CoVariants, with defining mutations, to allow a 'lookup' that other apps and programs could automatically link to by using the list of information.

This is a Draft PR. I would love feedback.
I recognise that in the file there are comments which are not allowed - those are to provide clarity to the file structure. It also currently just includes 1 example of a Variant & Mutation, to settle on a good format. I will then write a script which generates this file from the existing files.

I'm not very familiar with JSON format and have found it restrictive - some parts I didn't even convert as I'm not sure if they're useful & I wasn't sure how to convert in a way that's concise.

@nodrogluap and @chaoran-chen I'd really appreciate your thoughts on this for what you have in mind to do!

Information:

  • I imagine alignment_defining mutations will be most useful as these can be used to try to identify sequences from alignment only. However, this will miss sequences that have reversions, miscalls or are missing coverage at this position.
  • phylogenetic_defining are what are used to put the 'labels' on Nextstrain trees - they mark the branch where all these mutations are present, and all sequence below this (whether or not any particular sequence has these mutations - so it takes care of reversions and non-coverage)
  • build_name is what's used in file names & URLs as it's 'safe'. Sometimes it corresponds better to the display_name, sometimes not. If the discrepancy is big enough that it's problematic, then I could try to reconcile this within CoVariants.

Questions:

  • Is phylogenetic_defining useful? or just leave it?
  • Is color useful?
  • Is pango_name useful? This may not match 1:1 with running Pango. It's just taken from the name table on CoVariants
  • I would like to include the amino-acid mutations that correspond to defining mutations where possible - see comments on lines 21-23. @ivan-aksamentov Would you have a good suggestion for how to incorporate this?
  • I have included the list of 'complete' mutations (see lines 41-76) - I haven't converted to JSON format. This is what makes the 'side-sausage' on CoVariants pages. Is this useful? Or is this not so useful and just get rid of?

To do:

  • Finalize format of the file & most useful field
  • Write script to generate file automatically

@vercel
Copy link

vercel bot commented May 19, 2021

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/hodcroftlab/covariants/HtcC7A1zYMdjvnyCUeDqLxAKCCKU
✅ Preview: https://covariants-git-covariantsfile-hodcroftlab.vercel.app

@emmahodcroft emmahodcroft marked this pull request as draft May 19, 2021 16:01
@chaoran-chen
Copy link
Contributor

Hi Emma! Thanks for creating this file, I believe that it will be very useful!

  • Is phylogenetic_defining useful? or just leave it?

I don't have a concrete usecase for it right now but it sounds useful.

  • Is color useful?

Not for me.

  • Is pango_name useful? This may not match 1:1 with running Pango. It's just taken from the name table on CoVariants

Yes, very much.

  • I have included the list of 'complete' mutations (see lines 41-76) - I haven't converted to JSON format. This is what makes the 'side-sausage' on CoVariants pages. Is this useful? Or is this not so useful and just get rid of?

I find a complete list of mutations useful.

Two suggestions:

  • The notation for the mutations should be consistent. For example, if it is not possible to add the reference base in all cases, then it would be better to remove it everywhere. I think that the short string format is easier and that there is no need for {left: ..., pos: ..., right: ...}.
  • The nucleotide-level and amino acid-level mutations should be distinguished clearly. For all mutations, it could be like this:
all_mutations: {
	"nonsynonymous": [
		{ "amino_acid": 'S:N123Y',  nucleotide: ["A1234T"]},
		...
	],
	"synonymous_nucleotide": ["A2345C", "C3456G"]
}

@chaoran-chen
Copy link
Contributor

chaoran-chen commented Aug 23, 2021

@emmahodcroft I would like to change my previous answer: having the colors field could be useful for me/cov-spectrum. It might be a good idea to use the same colors as covariants whenever possible because some reports use screenshots from both sites.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants