Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vg rna can't parse bgzipped GFF3 #4459

Open
faithokamoto opened this issue Nov 27, 2024 · 0 comments
Open

vg rna can't parse bgzipped GFF3 #4459

faithokamoto opened this issue Nov 27, 2024 · 0 comments

Comments

@faithokamoto
Copy link
Contributor

1. What were you trying to do?

Create a pantranscriptome from a GBZ graph and bgzipped GFF3 annotation set, following the transcriptomic analysis wiki.

vg rna --progress --transcripts annotations.gff3.gz --transcript-tag Parent --use-hap-ref --gbz-format input_graph.gbz > pantranscriptome.pg

(My exact command on the cluster was this:)

srun -c 16 --mem 40G --time 20:00 vg rna --progress --transcripts /private/groups/patenlab/fokamoto/lr-pan-rna/data/chm13v2.0_RefSeq_Liftoff_v5.2.chm13_prefix.gff3.gz --transcript-tag Parent --use-hap-ref --gbz-format /private/home/hickey/dev/work/hprc-v1.1-jul4/hprc-v1.1-mc-chm13/hprc-v1.1-mc-chm13.gbz > test.pg

2. What did you want to happen?

Automatic detection & handling of the bgzipped annotation file, or at least a graceful error message along the lines of ".gff3 file cannot be compressed".

3. What actually happened?

When vg rna tried to parse the annotation file to add transcript splice junctions, it errored on the compressed data. Note that it fails on the first line of the file. That's the header ##gff-version 3.

[vg rna] Parsing graph file ...
[vg rna] Converting graph format ...
[vg rna] Graph and GBWT index parsed in 328.684 seconds, 14.7488 GB
[vg rna] Adding transcript splice-junctions and exon boundaries to graph ...
        ERROR: Chromosome path "�BCw�Z]o�8}��" not found in graph or haplotypes index (line 1).

4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here:

No stacktrace, but the output of --progress is in #3.

5. What data and command can the vg dev team use to make the problem happen?

  • Annotation file - note that this uses the chr1 naming system, and the graph uses CHM13#0#chr1 system, so you have to edit the file before using it successfully. However, that's not necessary to reproduce the error, which occurs before vg has a chance to look at any chromosome names.
    On the cluster, the chromosome-edited file is /private/groups/patenlab/fokamoto/lr-pan-rna/data/chm13v2.0_RefSeq_Liftoff_v5.2.chm13_prefix.gff3.gz. The same command works with the non-gzipped version of the file, which is the same but without the .gz prefix.
  • I used a GBZ from the HPRC pangenom v1.1, on the cluster at /private/home/hickey/dev/work/hprc-v1.1-jul4/hprc-v1.1-mc-chm13/hprc-v1.1-mc-chm13.gbz. I suspect this will occur for any valid .gbz graph.

6. What does running vg version say?

I built vg from source with the following code:

git clone --recursive https://github.com/vgteam/vg.git
cd vg 
git checkout lr-giraffe
git checkout autoindex-zipcodes
git submodule update --init --recursive
srun -c 16 --mem=80G --time=00:30:00 make -j16

Running vg version gives:

vg version v1.61.0-1291-g1c97950d1 "Plodio"
Compiled with g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 on Linux
Linked against libstd++ 20230528
Using HTSlib headers 101990, library 1.19.1-29-g3cfe8769
Built by [email protected]

If you want to run my exact installation, y'all can find it in /private/groups/patenlab/fokamoto/lr-pan-rna/vg :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant