vg rna can't parse bgzipped GFF3 #4459

faithokamoto · 2024-11-27T17:19:38Z

1. What were you trying to do?

Create a pantranscriptome from a GBZ graph and bgzipped GFF3 annotation set, following the transcriptomic analysis wiki.

vg rna --progress --transcripts annotations.gff3.gz --transcript-tag Parent --use-hap-ref --gbz-format input_graph.gbz > pantranscriptome.pg

(My exact command on the cluster was this:)

srun -c 16 --mem 40G --time 20:00 vg rna --progress --transcripts /private/groups/patenlab/fokamoto/lr-pan-rna/data/chm13v2.0_RefSeq_Liftoff_v5.2.chm13_prefix.gff3.gz --transcript-tag Parent --use-hap-ref --gbz-format /private/home/hickey/dev/work/hprc-v1.1-jul4/hprc-v1.1-mc-chm13/hprc-v1.1-mc-chm13.gbz > test.pg

2. What did you want to happen?

Automatic detection & handling of the bgzipped annotation file, or at least a graceful error message along the lines of ".gff3 file cannot be compressed".

3. What actually happened?

When vg rna tried to parse the annotation file to add transcript splice junctions, it errored on the compressed data. Note that it fails on the first line of the file. That's the header ##gff-version 3.

[vg rna] Parsing graph file ...
[vg rna] Converting graph format ...
[vg rna] Graph and GBWT index parsed in 328.684 seconds, 14.7488 GB
[vg rna] Adding transcript splice-junctions and exon boundaries to graph ...
        ERROR: Chromosome path "�BCw�Z]o�8}��" not found in graph or haplotypes index (line 1).

4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here:

No stacktrace, but the output of --progress is in #3.

5. What data and command can the vg dev team use to make the problem happen?

Annotation file - note that this uses the chr1 naming system, and the graph uses CHM13#0#chr1 system, so you have to edit the file before using it successfully. However, that's not necessary to reproduce the error, which occurs before vg has a chance to look at any chromosome names.
On the cluster, the chromosome-edited file is /private/groups/patenlab/fokamoto/lr-pan-rna/data/chm13v2.0_RefSeq_Liftoff_v5.2.chm13_prefix.gff3.gz. The same command works with the non-gzipped version of the file, which is the same but without the .gz prefix.
I used a GBZ from the HPRC pangenom v1.1, on the cluster at /private/home/hickey/dev/work/hprc-v1.1-jul4/hprc-v1.1-mc-chm13/hprc-v1.1-mc-chm13.gbz. I suspect this will occur for any valid .gbz graph.

6. What does running vg version say?

I built vg from source with the following code:

git clone --recursive https://github.com/vgteam/vg.git
cd vg 
git checkout lr-giraffe
git checkout autoindex-zipcodes
git submodule update --init --recursive
srun -c 16 --mem=80G --time=00:30:00 make -j16

Running vg version gives:

vg version v1.61.0-1291-g1c97950d1 "Plodio"
Compiled with g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 on Linux
Linked against libstd++ 20230528
Using HTSlib headers 101990, library 1.19.1-29-g3cfe8769
Built by [email protected]

If you want to run my exact installation, y'all can find it in /private/groups/patenlab/fokamoto/lr-pan-rna/vg :)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vg rna can't parse bgzipped GFF3 #4459

vg rna can't parse bgzipped GFF3 #4459

faithokamoto commented Nov 27, 2024

vg rna can't parse bgzipped GFF3 #4459

vg rna can't parse bgzipped GFF3 #4459

Comments

faithokamoto commented Nov 27, 2024