Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Looping over a VCF file seems to incur huge memory #26

Open
biona001 opened this issue Mar 29, 2020 · 0 comments
Open

Looping over a VCF file seems to incur huge memory #26

biona001 opened this issue Mar 29, 2020 · 0 comments

Comments

@biona001
Copy link

biona001 commented Mar 29, 2020

I'm writing a routine to import a VCF file as a numeric matrix, but I get a much larger memory usage than expected.

As a minimum working example, consider the code below that loops over a VCF file:

using GeneticVariation
function loop_vcf()
    reader = VCF.Reader(open("target.vcf", "r"))
    s = 0
    for record in reader, geno in record.genotype
        s += 1
    end
    close(reader)
    return s
end

On a test data (target.vcf.gz, must decompress first) with 3000 records and 100 samples, I get the following benchmark:

using BenchmarkTools
@benchmark loop_vcf()
BenchmarkTools.Trial:
  memory estimate:  98.64 MiB
  allocs estimate:  941005
  --------------
  minimum time:     62.249 ms (5.75% GC)
  median time:      63.186 ms (5.99% GC)
  mean time:        63.835 ms (6.75% GC)
  maximum time:     79.381 ms (5.22% GC)
  --------------
  samples:          79
  evals/sample:     1

Why am I getting such a large memory requirement? My data target.vcf is only 1.3MB on disk, so I feel like this memory usage is highly suspicious..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant