Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FeatureDB.update on disk db very slow #227

Closed
YeHW opened this issue Feb 27, 2024 · 2 comments
Closed

FeatureDB.update on disk db very slow #227

YeHW opened this issue Feb 27, 2024 · 2 comments

Comments

@YeHW
Copy link

YeHW commented Feb 27, 2024

I'm trying to use FeatureDB.update and FeatureDB.create_introns to add intron features to the database.
If the database is created in memory, the speed is very fast, but if created on disk, it appears to be very slow.

gffutils version: 0.12
python version: 3.12.0

The gtf file I'm using is from refseq ftp:

curl -s https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/annotation_releases/105.20220307/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.gtf.gz \
  | gunzip -c \
  | grep -E "^#|^NC_000024.9|^NC_000013.10" \
  | gzip > NC_000013.10-NC_000024.9.gtf.gz

It's a subset of GCF_000001405.25_GRCh37.p13_genomic.gtf.gz.

Code:

import gffutils
from typing import Dict

def _create_db_kwargs() -> Dict:
    return {
        "verbose": True,
        "disable_infer_genes": True,
        "disable_infer_transcripts": True,
        "merge_strategy": "error",
    }

def _make_db(
    gtf: str, out: str, create_db_kwargs: Dict = _create_db_kwargs()
) -> None:
    db: gffutils.FeatureDB = gffutils.create_db(
        data=gtf, dbfn=out, **create_db_kwargs
    )

    # TODO create intron features
    # Creating intron features is very fast
    db.create_introns()

    # Updating will be very slow if out is not ":memory:"
    db.update(db.create_introns(), **create_db_kwargs)

if __name__ == "__main__":
    # Very fast if db is created in memory
    _make_db(gtf="NC_000013.10-NC_000024.9.gtf.gz", out=":memory:")

    # Very slow if db is created on disk
    _make_db(gtf="NC_000013.10-NC_000024.9.gtf.gz", out="NC_000013.10-NC_000024.9.gtf.gz.db")

If db is created on disk, I observed that it hangs on this step:
Populating features table and first-order relations: 0 features
What could cause this? Thanks in advance for any insights!

@daler
Copy link
Owner

daler commented Apr 13, 2024

OK, figured out what the issue is here. This is effectively updating the database with itself, so it's reading and writing simultaneously.

That is, db.create_introns() iterates through a query that incrementally yields exons from the db and creates introns from them; db.update() writes them immediately back to the database they're being read-and-generated from.

Three quick fixes:

1. Convert to a list of introns first

Consume the create_introns() generator before writing. That is, instead of

db.update(db.create_introns(), **kwargs)

use

db.update(list(db.create_introns(), **kwargs)

This will increase memory usage, but it works.

2. Use WAL

WAL allows simultaneous reads/writes without blocking.

Warning, this does NOT work on a networked filesystem like those typically used on an HPC cluster!

db.set_pragmas({'journal_mode': 'WAL'})
db.update(db.create_introns(), **kwargs)

3. Write to intermediate file

If memory is an issue and you're using networked filesystem, then you can write out to file first:

with open('tmp.gtf', 'w') as fout:
    for intron in db.create_introns():
        fout.write(str(intron) + '\n')

db.update(gffutils.DataIterator('tmp.gtf'), **kwargs)

Summary

I'm not sure if anything in the code should be changed to address this. create_introns(), like most things throughout gffutils, is implemented as a generator to keep the memory footprint low. I don't think I want to return a list such that create_introns() suddenly uses way more memory than any other method. I don't want to use WAL by default because gffutils tends to be used on HPC clusters and those clusters tend to have networked filesystems. And I don't think I want to write to a temp file all the time, that seems messy.

These different solutions would each be useful in different situations. So I think the best thing to do is to add some explanatory text to both update() and create_introns() and update the docs as well.

daler added a commit that referenced this issue Apr 13, 2024
@daler
Copy link
Owner

daler commented Apr 13, 2024

Addressed in #231

@daler daler closed this as completed Apr 13, 2024
photocyte pushed a commit to photocyte/gffutils that referenced this issue Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants