getRelativeGenomicPosition returning 'unknown chromosome' for large intervals because of unaccounted missing sequences #1022

thomcsmits · 2023-12-22T14:41:23Z

Problem
getRelativeGenomicPosition returns 'unknown' for the chromosome for a large interval at the end of the chromosomes due to CHROM_SIZES missing unaccounted intervals.

Example used

{
  "title": "Visual Encoding",
  "subtitle": "Gosling provides diverse visual encoding methods",
  "layout": "linear",
  "assembly": "hg16",
  // "xDomain": {"chromosome": "chr1", "interval": [1, 3000500]},
  "views": [
    {
      "tracks": [
        {
          "id": "track-1",
          "data": {
            "url": "https://server.gosling-lang.org/api/v1/tileset_info/?d=cistrome-multivec",
            "type": "multivec",
            "row": "sample",
            "column": "position",
            "value": "peak",
            "categories": ["sample 1", "sample 2", "sample 3", "sample 4"],
            "binSize": 4
          },
          "mark": "rect",
          "x": {"field": "start", "type": "genomic", "axis": "top"},
          "xe": {"field": "end", "type": "genomic"},
          "row": {"field": "sample", "type": "nominal", "legend": true},
          "color": {"field": "peak", "type": "quantitative", "legend": true},
          "tooltip": [
            {"field": "start", "type": "genomic", "alt": "Start Position"},
            {"field": "end", "type": "genomic", "alt": "End Position"},
            {
              "field": "peak",
              "type": "quantitative",
              "alt": "Value",
              "format": ".2"
            },
            {"field": "sample", "type": "nominal", "alt": "Sample"}
          ],
          "width": 600,
          "height": 130
        }
      ]
    }
  ]
}

Examining the data and the chromosome mapping, the entire genomic interval between about 3,088,000,000 up to 3,260,000,000 (total of 172,000,000 positions) map to 'unknown'.

A similar large interval of unknown is observed for the basic bar example.

Why I think this is happening
Note: this behavior is largely dependent on the data and how it was assembled in the first place
Gosling's chromosome sizes don't include unlocalized/unplaced sequences in the same way that IGV does.

Summing the lengths of Gosling (hg38) gives: 3088269832
Summing the lengths of IGV (hg38) gives: 3209286105

Why is this important? All of these unknown sequences do not just go at the end of the chromosome. Leaving out e.g. chr11_KI270721v1_random after chr11 causes all subsequent chromosomes to be mapped incorrectly (and also leaving this large unknown area at the end).

The real problem
Not including these positions causes all chromosomes after chr1 to be mapping incorrectly!

Suggested changes
Include unlocalized/unplaced sequences and include these in getRelativeGenomicPosition

The text was updated successfully, but these errors were encountered:

etowahadams · 2023-12-22T20:46:36Z

I didn't realize that hg38 had all of these chromosome sub fragments. Interesting! I'm so fortunate to live in the telomere-to-telomere era of the human genome.

Looking into this, it looks like the higlass server gives the data in terms of absolute coordinates. The multivec tiles were probably created with something like this using clodius (also consistent with the IGV assembly you mentioned). So gosling maps the absolute coordinates to the wrong chromosome.

Given how random these chromosome names are, we definitely want to implement some notion of priority of different chromosomes, so that for example chr14_KI270725v1_random doesn't end up being the main chromosome name shown.

Change the hg38 assembly, have some way to represent priority of labels
Change the gosling-track-axis to respect this priority of labels

thomcsmits added the bug🐛 Something isn't working label Dec 22, 2023

thomcsmits assigned thomcsmits and sehilyi and unassigned thomcsmits Dec 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getRelativeGenomicPosition returning 'unknown chromosome' for large intervals because of unaccounted missing sequences #1022

getRelativeGenomicPosition returning 'unknown chromosome' for large intervals because of unaccounted missing sequences #1022

thomcsmits commented Dec 22, 2023

etowahadams commented Dec 22, 2023

getRelativeGenomicPosition returning 'unknown chromosome' for large intervals because of unaccounted missing sequences #1022

getRelativeGenomicPosition returning 'unknown chromosome' for large intervals because of unaccounted missing sequences #1022

Comments

thomcsmits commented Dec 22, 2023

etowahadams commented Dec 22, 2023