Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T2T - gnomAD liftover contains duplicate entries #1215

Open
davmlaw opened this issue Dec 18, 2024 · 1 comment
Open

T2T - gnomAD liftover contains duplicate entries #1215

davmlaw opened this issue Dec 18, 2024 · 1 comment

Comments

@davmlaw
Copy link
Contributor

davmlaw commented Dec 18, 2024

Split from issue #814

During import of the T2T VEP annotation, we get:

^^^^^^^^^^^^^^^^^^^ psycopg2.errors.InvalidTextRepresentation: invalid input syntax for type integer: ".&." CONTEXT: COPY annotation_variantannotation_version_3, line 2096, column gnomad_hemi_count: ".&."

We are used to gnomAD only returning 1 entry, but because there are dupes in the T2T gnomad the VEP annotation is:

[2024-12-18 11:32:51,887: INFO/ForkPoolWorker-4] NON PAR: {'gnomad_ac': '9&128', 'gnomad_popmax_ac': '8&30', 'gnomad_xy_ac': '.&.', 'gnomad_af': '0.000007&0.000105', 'gnomad_afr_af': '0.000000&0.000000', 'gnomad_amr_af': '0.000000&0.000270', 'gnomad_asj_af': '0.000000&0.000261', 'gnomad_eas_af': '0.000000&0.000173', 'gnomad_fin_af': '0.000000&0.000027', 'gnomad_popmax_af': '7.78697939175904e-06&0.0005847041397053091', 'gnomad_mid_af': '0.000000&0.000000', 'gnomad_nfe_af': '0.000008&0.000071', 'gnomad_oth_af': '0.000019&0.000197', 'gnomad_sas_af': '0.000000&0.000585', 'gnomad_xy_af': '.&.', 'gnomad_an': '1342928&1221882', 'gnomad_popmax_an': '1027356&51308', 'gnomad_xy_an': '.&.', 'gnomad_fafmax_faf95_max': '3.4979999327333644e-05&0.00042004999704658985', 'gnomad_fafmax_faf99_max': '2.3279999368241988e-05&0.000364909996278584', 'gnomad_filtered': '1&0', 'gnomad_popmax': 'NFE&SAS', 'gnomad_hom_alt': '0&0', 'gnomad_non_par': '.&.', 'repeat_masker': 'L1M1#LINE/L1&L1ME4b#LINE/L1', 'variant_class': 'DE', 'canonical': True, 'consequence': 'intron_variant', 'ensembl_protein': 'ENSP05220037437', 'hgvs_c': 'ENST05220083784.1:c.1254+51del', 'impact': '1', 'intron': '6/6', 'symbol': 'SMAD1', 'version_id': 3, 'annotation_run_id': 920, 'variant_id': 14504997, 'gene_id': 'ENSG05220022465', 'transcript_id': 'ENST05220083784', 'transcript_version_id': 1442872}

The Ensembl provided gnomAD liftover files contain duplicates

tabix gnomad.exomes.v4.1.sites.GCA_009914755.4.trimmed_liftover.vcf.gz chr4:148869903-148869903
chr4	148869903	.	GC	G	.	PASS	AC=3;AN=1221986;AF=2.45502e-06;grpmax=nfe;fafmax_faf95_max=8.2e-07;fafmax_faf95_max_gen_anc=nfe;AC_XX=1;AF_XX=1.60411e-06;AN_XX=623398;nhomalt_XX=0;AC_XY=2;AF_XY=3.3412e-06;AN_XY=598588;nhomalt_XY=0;nhomalt=0;AC_afr=0;AF_afr=0;AN_afr=26218;nhomalt_afr=0;AC_amr=0;AF_amr=0;AN_amr=25976;nhomalt_amr=0;AC_asj=0;AF_asj=0;AN_asj=19166;nhomalt_asj=0;AC_eas=0;AF_eas=0;AN_eas=34596;nhomalt_eas=0;AC_fin=0;AF_fin=0;AN_fin=37502;nhomalt_fin=0;AC_mid=0;AF_mid=0;AN_mid=4130;nhomalt_mid=0;AC_nfe=3;AF_nfe=3.08507e-06;AN_nfe=972424;nhomalt_nfe=0;AC_raw=79;AF_raw=5.45803e-05;AN_raw=1447410;nhomalt_raw=0;AC_remaining=0;AF_remaining=0;AN_remaining=50660;nhomalt_remaining=0;AC_sas=0;AF_sas=0;AN_sas=51314;nhomalt_sas=0;AC_grpmax=3;AF_grpmax=3.08507e-06;AN_grpmax=972424;nhomalt_grpmax=0;fafmax_faf99_max=2.3e-07;fafmax_faf99_max_gen_anc=nfe;age_hist_het_bin_freq=0|0|0|0|0|0|1|0|0|0;age_hist_het_n_smaller=0;age_hist_het_n_larger=0;age_hist_hom_bin_freq=0|0|0|0|0|0|0|0|0|0;age_hist_hom_n_smaller=0;age_hist_hom_n_larger=0;AS_VQSLOD=3.1304;allele_type=del;n_alt_alleles=20;variant_type=mixed;was_mixed;lcr
chr4	148869903	rs1408958407	GC	G	.	PASS	AC=128;AN=1221882;AF=0.000104756;grpmax=sas;fafmax_faf95_max=0.00042005;fafmax_faf95_max_gen_anc=sas;AC_XX=60;AF_XX=9.62541e-05;AN_XX=623350;nhomalt_XX=0;AC_XY=68;AF_XY=0.000113611;AN_XY=598532;nhomalt_XY=0;nhomalt=0;AC_afr=0;AF_afr=0;AN_afr=26218;nhomalt_afr=0;AC_amr=7;AF_amr=0.000269521;AN_amr=25972;nhomalt_amr=0;AC_asj=5;AF_asj=0.000260879;AN_asj=19166;nhomalt_asj=0;AC_eas=6;AF_eas=0.000173451;AN_eas=34592;nhomalt_eas=0;AC_fin=1;AF_fin=2.66667e-05;AN_fin=37500;nhomalt_fin=0;AC_mid=0;AF_mid=0;AN_mid=4130;nhomalt_mid=0;AC_nfe=69;AF_nfe=7.09631e-05;AN_nfe=972336;nhomalt_nfe=0;AC_raw=590;AF_raw=0.000407625;AN_raw=1447410;nhomalt_raw=0;AC_remaining=10;AF_remaining=0.000197394;AN_remaining=50660;nhomalt_remaining=0;AC_sas=30;AF_sas=0.000584704;AN_sas=51308;nhomalt_sas=0;AC_grpmax=30;AF_grpmax=0.000584704;AN_grpmax=51308;nhomalt_grpmax=0;fafmax_faf99_max=0.00036491;fafmax_faf99_max_gen_anc=sas;age_hist_het_bin_freq=0|0|2|0|4|4|2|2|0|0;age_hist_het_n_smaller=3;age_hist_het_n_larger=0;age_hist_hom_bin_freq=0|0|0|0|0|0|0|0|0|0;age_hist_hom_n_smaller=0;age_hist_hom_n_larger=0;AS_VQSLOD=3.0278;allele_type=del;n_alt_alleles=20;variant_type=mixed;was_mixed;lcr
@davmlaw
Copy link
Contributor Author

davmlaw commented Dec 18, 2024

So I guess the options are:

Make the gnomAD liftover VCF unique

  • Easy way of calling uniq on final file
  • Hard way of taking lowest/highest AF on exomes/genomes (or one not filtered) before merge

Modify importer to pick highest one

If one of the gnomad values has "&" then all of them will, pick one to be the representative

Actions:

I am going to do the simplest which is filter with uniq and see what the difference is. Might only be a very tiny fraction. I am writing the line count to /data/incoming/t2t_gnomad lines.txt etc

I'll also write the importer modification to pick the lowest/highest if the fraction is large.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant