Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with XSI Compression for Large UKBB Genotype Subsets #6

Open
Zhangliubin opened this issue Nov 16, 2024 · 12 comments
Open

Issue with XSI Compression for Large UKBB Genotype Subsets #6

Zhangliubin opened this issue Nov 16, 2024 · 12 comments

Comments

@Zhangliubin
Copy link

Zhangliubin commented Nov 16, 2024

Hi,

I am currently benchmarking tools that balance "compression and accessibility" for genotype data and encountered an issue when using XSI with large datasets. I hope you can assist me in resolving it.

The dataset originates from UK Biobank's ukb24310_c21_b*_v1.vcf.gz, consisting of 2,336 files with a total size of 14.29 TB. My workflow includes quality control, format conversion, and merging using GBC, ensuring reliable genotypes while excluding quality information. From the resulting GTB files (generated by GBC), I created subsets of (1, 10, 100, 1,000, 5,000, 10,000, 50,000, 100,000, 200,000) samples, all containing 13,791,793 variants (excluding variants with AN=0). If you have access to UKBB data, you should be able to reproduce this, though the process can be time-consuming.

XSI works reliably for sample sizes between 1 and 50,000. However, for subsets with 100,000 and 200,000 samples, the following exception occurs:

Command executed:
"""
start=$(date +%s.%N)
xsqueezeit -c -f ~/data/UKBB_hg38/wgs/chr21.samples_200000.hg38.vcf.gz -o ~/tmp/xsi/sample_200000_xsi --zstd
end=$(date +%s.%N)
echo "Elapsed time: $(printf "%.3f" $(echo "$end - $start" | bc)) seconds"
"""

Error message:
"""
It seems the file /home/root/data/UKBB_hg38/wgs/chr21.samples_200000.hg38.vcf.gz is mostly unphased
header 256 bytes, total 256 bytes written
Size too big to be encoded with j
Failed to write compressed block
Failure occurred, exiting...
Elapsed time: 10299.386 seconds
"""

The error suggests an issue related to block size during compression. I’d appreciate any insights or suggestions for overcoming this limitation, especially for handling larger sample sizes efficiently.

Thank you for your time.

@rwk-unil
Copy link
Owner

Hello,
Thank you for reporting this issue, the problem comes from the size of a block compressed by ZSTD, XSI cannot directly encode (compressed) blocks that are bigger than 4GB because it uses an uint32_t to represent the block size and compressed block size, this value is required so that XSI knows how much space to allocate during decompression.

XSI compresses VCF/BCF in blocks of 8,192 VCF lines

So the requirement here is that the 200,000 samples x 8,192 variants matrix should be representable (post PBWT and WAH encoding) under 4GB and compress under 4GB.

I had never encountered this and have been compressing files with up to a million samples.
I wonder if the uncompressed data is too big or the compressed block is or if there is a bug.

I have pushed eaf8feb this patch allows to use the --variant-block-length option that allows to define the number of VCF lines compressed together in a block. I also increased the verbosity of the error message a bit to show the offending size.

Could you try compressing again, maybe with --variant-block-length 4096 and see how it goes ?

Does this also occur with the original UKB file prior to your workflow ?
If so I could reproduce the issue more easily.

Thank you for sharing this issue.
Best regards,
Rick

@Zhangliubin
Copy link
Author

Zhangliubin commented Nov 18, 2024

Hi, Rick,

Thank you for your prompt response and detailed explanation regarding the block size limitation in XSI.

The original UKBB-WGS dataset I am working with is extremely large (1.06 PB, containing 490,000 samples and 1 billion variants). To handle this, I preprocess the data using GBC, a faster tool with integrated QC features, to merge and prepare files. This preprocessing step generates a single GTB/PGEN/VCF.GZ file containing a large number of variants, which is subsequently used for benchmarking, including tests with XSI. I have not tested directly on the original UKBB dataset yet. My plan is to first identify tools that show promise for large-scale genomic datasets and then test these on the original files.

I understand the design of your variant matrix. However, I am uncertain how multi-allelic variants are handled in XSI. A common approach is to split multi-allelic variants into multiple bi-allelic sites, which could potentially result in a block containing more than the expected 8,192 sites. If this assumption is correct, my dataset might indeed contain such variants, with some loci having over 200 alleles (e.g., small INDEL fragments).

After recompiling XSI with your patch, I ran the following script:

start=$(date +%s.%N)
xsqueezeit -c -f ~/data/UKBB_hg38/wgs/chr21.samples_200000.hg38.vcf.gz -o ~/tmp/xsi/sample_200000_xsi --zstd
end=$(date +%s.%N)
echo "Elapsed time: $(printf "%. 3f" $(echo "$end - $start" | bc)) seconds"
This resulted in the following error message:

It seems the file /home/root/data/UKBB_hg38/wgs/chr21.samples_200000.hg38.vcf.gz is mostly unphased
header 256 bytes, total 256 bytes written
Block data size : 10971456540 Compressed block data size : 315414128
Uncompressed data size (10971456540) too big to be encoded with uint32_t
Failed to write compressed block

^Ctrl + Z
Elapsed time: 2161.529 seconds

After receiving this error, the program did not exit, and I had to manually terminate it (the file size is approximately 200GB, and the process was estimated to take two days).

Given these results, I suspect the issue may stem from a combination of the large variant matrix and specific handling of multi-allelic sites. Please let me know if you have any additional suggestions or require further details for debugging.

Best regards,
Liubin Zhang

@Zhangliubin
Copy link
Author

Zhangliubin commented Nov 18, 2024

I encountered another issue during the decompression step using the following command:

xsqueezeit -x -O z -f <input> -o <output>

While the tool works as expected on a subset of 1-10,000 samples, I observed the following exception when running on a subset of 50,000 samples:

Failed to decompress block
Error : Unknown frame descriptor.
Additionally, when encoding the 50,000 sample subset, the file size did not increase as expected. For reference, I anticipated a size of approximately 4.645 GB (based on 0.929 GB per 10,000 samples × 5), but the resulting file size was 7.898 GB.

I hope this discrepancy in the output size might provide some useful clues for debugging. Please let me know if further information is required.

image

@rwk-unil
Copy link
Owner

Hello Liubin,
The multi-allelic variants are handled as a single VCF line, but will be split into one binary line per alt-allele in the binary representation. This results in bigger blocks and may be the issue. From the error message we can see the block size being 10971456540 or about 10 GB.

As I said you can change the block size with --variant-block-length <N> the default value is 8192 as this results in the uncompressed blocks being too big, you can try a smaller value e.g., --variant-block-length 2048.

I understand this is not the best as the compression could fail if the block size is not small enough, I should patch XSI to encode the block sizes with 64-bits so that this is not an issue.

The reason the program did not directly finish, is because while it is compressing it is also generating the binary indexing matrix with another thread (which traverses the whole VCF/BCF), this task is usually faster than the compression but here because of the massive file and the error, that task did not yet finish, so you had to kill the program.

To save some time I would recommend using .bcf rather than .vcf.gz as vcf.gz is textual VCF compressed with gzip but .bcf is binary data compressed with gzip, so by using .bcf the program reading doesn't have to do all the "text to binary" conversions it has to do with .vcf.gz.

The size bump between 10,000 and 50,000 samples is because when there are less than 65'536 haplotypes (so about 32,000 samples) XSI uses uint16_t to represent the haplotype positions in the PBWT order, but with more samples it uses uint32_t, which takes double the space but can represent up to 4 billion haplotypes (2 billion samples).

The "Unknown frame descriptor" error comes from ZSTD, it means ZSTD was passed a block that it doesn't recognise, so either the block is corrupted or a wrong location is given to ZSTD.

It might be a bug with multi-allelic sites, it is hard to say like this.
It could also be interesting to try to split the multi-allelic sites with bcftools norm -a

Also does you dataset have a lot of missing data ? . in VCF, because XSI does not compress missing data well (as I assumed that datasets should not have large amount of missingness), and this might cause trouble, depending on how you filter in your QC stages you might introduce large amount of missingness (many samples with . as one or both alleles for the position), this might be why the size becomes so big. So if your QC removes genotypes at several loci for many samples and by remove I mean replaces them by ./. it might be necessary to filter the variants that have high amounts of missingness, i.e., variants where only a few samples have data, because these variants will compress badly.

As the error during compression arrives quite early (about after 30min of processing), it would be nice to isolate it. I'll have a look at making the error more verbose and see if I can print the region of the VCF file, that way we can extract it and save time by running compression only on the part that fails.

So for the moment, maybe try to compress with the --variant-block-length 2048 or even smaller (1024) option to see if at least you can compress the dataset, and check the missingness, because I think this might be the issue, but for now it is hard for me to say what the problem is.

Regards,
Rick

@Zhangliubin
Copy link
Author

Zhangliubin commented Nov 18, 2024

Hi Rick,

Thank you for your detailed explanation and clarification.

Indeed, my dataset contains a substantial amount of missing genotype data. For example, in the case of the file ukb24310_c1_b6089_v1.vcf.gz, the sizes before and after PLINK quality control are 341.3 MB and 249.84 MB, respectively (using the parameters --vcf-min-gq 20 --vcf-min-dp 8). This difference reflects the loss of many low-quality genotypes. My QC process includes additional steps beyond those mentioned, as the dataset, after passing through several high-performance genotype compression methods, lost much of the quality information. This is a necessary step to avoid false positives in downstream analyses.

I agree that using .bcf.gz as both input and output would save considerable time, and I will conduct additional tests on this format in the future. At this stage, however, to ensure compatibility across all software tools, I am using the more general and easily parsable .vcf.gz format.

It seems that adjusting the --variant-block-length parameter or allowing the program to automatically adjust the block size based on sample size could be an immediate solution. Using a 64-bit encoding for block sizes may resolve the issue, though I suspect it could lead to increased memory usage. As memory resources on cloud platforms like DNANexus can be quite expensive, it might be more practical to adjust the --variant-block-length parameter based on available or limited memory on my local machine.

I plan to restart the XSI tests later this week using the --variant-block-length option. I greatly appreciate your assistance, and I look forward to your further work on this.

Best regards,
Liubin Zhang

Note:
ukb24310_c1_b6089_v1.vcf.gz: An original file from the UK Biobank WGS dataset containing extensive site and genotype-level quality information. It includes 17,384 variants and 490,541 individuals, with a file size of 18.03 GB.

@rwk-unil
Copy link
Owner

Hello Liubin,

I have created a new branch "64bit" https://github.com/rwk-unil/xSqueezeIt/tree/64bit

This branch stores the block sizes as 64-bit numbers therefore removing the limitation of 4 GB blocks. You can try the version from that branch and see how it goes, this will remove the issue with "Uncompressed data size (10971456540) too big to be encoded with uint32_t". It may use more RAM as reducing the block size but it is still limited by the block size (default 8192) so it will not suddenly use massively more RAM than the 10GB we encountered. So feel free to try it.

I did not merge this in the main branch for the moment as it breaks compatibility with older versions of XSI (the new version can read files compressed with the older, but not the reverse naturally).

I'll have a look at the ukb24310_c1_b6089_v1.vcf.gz file, do you encounter issues with it ?

Best regards,
Rick

@Zhangliubin
Copy link
Author

Hi Rick,

Thank you very much for the update and for addressing the 4 GB block size limitation with the new "64bit" branch. I have just finished some tasks and am ready to resume testing XSI. Based on your previous analysis, I believe the "64bit" version should work well with datasets containing up to 500,000 samples.

Regarding the ukb24310_c1_b6089_v1.vcf.gz file, I may not have explained my concern clearly before. My point was not about encountering issues with this specific dataset, but rather about the significant size difference between the quality-controlled and non-quality-controlled versions of such datasets. This is largely due to the presence of many low-depth genotypes, which are replaced with ./. during quality control. These replacements might be the main factor affecting XSI's compression efficiency.

Best regards,
Liubin

@Zhangliubin
Copy link
Author

Hi Rick,

I wanted to update you on my recent testing using the new XSI_64bit version. I've successfully run tests with datasets containing 50,000 and 100,000 samples. However, I noticed that the space consumption still does not match the expected growth pattern.

Here are the results from my tests:

Subject Variants BGZF Size [GB] PLINK v2.0 [GB] BGT [GB] GTC [GB] XSI [GB]
1 13,791,793 0.058 0.322 0.296 0.119 0.082
10 13,791,793 0.065 0.323 0.333 0.128 0.084
100 13,791,793 0.136 0.339 0.354 0.139 0.091
1,000 13,791,793 1.099 0.513 0.465 0.185 0.151
5,000 13,791,793 5.241 1.217 0.873 0.351 0.453
10,000 13,791,793 10.384 2.089 1.334 0.545 0.929
50,000 13,791,793 50.322 9.151 4.883 2.036 7.898
100,000 13,791,793 100.085 18.089 9.184 3.778 41.952

As you can see, the file sizes for the XSI compression are still significantly higher than anticipated, particularly with the larger datasets.

Additionally, while the XSI file for the 50,000 sample dataset was successfully generated, I encountered issues when attempting to decompress the file. My decompression command was:

xsqueezeit -x -f /home/root/tmp/xsi/sample_50000_xsi -O z -o ~/tmp/output.vcf.gz
The terminal log displayed the following error:

[E::bcf_hdr_add_sample_len] Empty sample name: trailing spaces/tabs in the header line?
[E::bcf_hdr_add_sample_len] Duplicated sample name ''
[E::bcf_hdr_add_sample_len] Duplicated sample name ''
[E::bcf_hdr_add_sample_len] Duplicated sample name ''
[E::bcf_hdr_add_sample_len] Duplicated sample name '�'
[E::bcf_hdr_add_sample_len] Empty sample name: trailing spaces/tabs in the header line?
[E::bcf_hdr_add_sample_len] Duplicated sample name ''
[E::bcf_hdr_add_sample_len] Duplicated sample name '�'
[E::bcf_hdr_add_sample_len] Duplicated sample name 'S'
[E::bcf_hdr_add_sample_len] Duplicated sample name 'O'
[E::bcf_hdr_add_sample_len] Duplicated sample name 'K'
[E::bcf_hdr_add_sample_len] Duplicated sample name '?'
[E::bcf_hdr_add_sample_len] Duplicated sample name ';'
[E::bcf_hdr_add_sample_len] Duplicated sample name '''
[E::bcf_hdr_add_sample_len] Duplicated sample name '#'
[E::bcf_hdr_add_sample_len] Duplicated sample name ''
[E::bcf_hdr_add_sample_len] Duplicated sample name '

E::bcf_hdr_add_sample_len] Duplicated sample name ''
[E::bcf_hdr_add_sample_len] Duplicated sample name ''
[E::bcf_hdr_add_sample_len] Duplicated sample name ''
[E::bcf_hdr_add_sample_len] Duplicated sample name '^'
[E::bcf_hdr_add_sample_len] Duplicated sample name '�'
[E::bcf_hdr_add_sample_len] Duplicated sample name 'n'
[E::bcf_hdr_add_sample_len] Duplicated sample name '�'
Failed to allocate memory to decompress block
Failed to allocate memory
Elapsed time: 0.777 seconds

I would greatly appreciate any guidance you could provide regarding the space inflation issue and the decompression problem with the 50,000 sample dataset. Please let me know if there is additional information you need from my end to assist with troubleshooting.

Thank you once again for your continued support.

Best regards,
Liubin Zhang

@rwk-unil
Copy link
Owner

Hello Liubin,
Thank you very much for your detailed report.

There are two issues here; 1) XSI generates very large files in cases where it looks like it should not. 2) XSI fails to decompress these files.

For the moment I suspect the problem to come from possibly high missingness as I know this leads to bad compression. I also did not test massively with datasets of that nature so this might have shown a bug I did not encounter before.

For 1)

The problem with the missing data, is that XSI encodes missing data simply as sparse positions, i.e., the position of the haplotype where an allele is missing. For less than 2^16=65,536 haplotypes, this means one 2-byte number (uint16_t) per missing allele, with more than 65,536 haplotypes, a 4-byte number (uint32_t) is used per missing allele.

So if we compare BCF which uses a 4-byte number per allele, XSI might perform worse in cases of high missing data, because of the sparse representation of missing data. This data will compress much worse than one 4-byte number per allele as in the case of BCF where the missing allele is always represented by the same value in the allele array but at different positions, because in the sparse representation each missing value will be encoded as the position (haplotype number, in the allele array) where it appears, and as this is always a different position, as each position is unique, this is much less compressible than a repeating value in an array.

So I think the amount of missing alleles in the dataset is the reason for the poor compression rate, especially with more samples.

Do you have summary statistics on the number of missing alleles per variant ? This could give some insight.

XSI is really poor at encoding / compressing missing data, this is a fact, it was developed with the assumption that missing data would be rare, because who would store large amount of missing data ? (i.e., "no data", as it is missing...). But I understand filtering on genotype quality (GC) will create "missing" data a population scale dataset (possible in massive quantities).

For 2)

I think I know where some of the issues you found are coming from, it is something I thought I had fixed a few years ago but it seems not to be the case. There is a problem with how the file format stores the location of the samples and data blocks. This will create the error with the decompression of files (header problems with sample names) and then failed to allocate memory.

I think I could fix this issue so that XSI can compress/decompress these files, but it will not solve the size issue.


For the size issue I think there is a lot of missing data and XSI is bad at compressing this as there was no ptimised strategy for compression of missing data, XSI only supports it as not to lose it. So if you can get some summary statistics on the amount of missing alleles per variant, this would give some insights (e.g., number of VCF lines that have missing data and distribution of quantity of missing alleles per line).

Another test you could do is replace the missing values by reference (e.g., ./. by 0/0) and compress that file, if the output size differs substantially it would confirm that it is in fact the missing data that leads to bad compression.

Do PLINK/BGT/GTC keep the missing values information ? When you decompress their file are the missing alleles still there ? If the formats lose this info it may be better to remove the missing alleles altogether.

And momentarily if you want to have compression/decompression working you could split the input file at variant level e.g., files of 1,000,000 variants. I know this is not the best but at least would allow you to test for the moment.

Best regards,
Rick

@rwk-unil
Copy link
Owner

Hello Liubin,

I have pushed two patches to the 64bit branch

  1. 55ad8c7 which introduces the --wah-encode-missing option, that allows to change the data representation for missing alleles from sparse (default) to Word Align Hybrid (WAH), this should reduce the size of data for missing alleles and solve 1) above.

  2. c0906cf which solves the issue that led to problem 2) above. It was because offsets to sample names and data etc. where 32-bit, which was a bit optimistic as it would not allow for large file sizes with data going over the max offset.

Please try to compress with the --wah-encode-missing option.

Please note that patch 2) 55ad8c7 breaks compatibility with the previous version in the 64-bit branch, so data compressed with that version will not be decompressible with this new version (but the 64-bit version will still be able to decompress the main version data).

Best regards
Rick

@Zhangliubin
Copy link
Author

Hi Rick,

Thank you for your thoughtful reply. I believe we have identified the key issue here. Indeed, XSI performs well within its intended range of applicability, and as you rightly pointed out, this range aligns with most research scenarios, rather than extreme conditions such as the ones I am testing.

From your description, I suspect that you are applying the PBWT transformation to genotype data where multi-allelic sites are coded with 0 for the reference genotype and 1 for the alternative genotype, and missing genotypes are handled by separately listing their indices. The key assumption behind this approach is that the number of missing genotypes is relatively small. If this assumption is violated, the overhead from recording indices could become significant. Would it be possible to explore a combined approach using varInt and differential encoding to store these indices more efficiently?

As for PBWT, BGT, GTC, and GTShark, they encode genotypes as 0/0 -> 00, 0/1 -> 01, 1/1 -> 11, and ./. -> 10 before applying further transformations. This strategy, with some variations in encoding details across methods, naturally accommodates missing genotypes. However, a common limitation of these methods is that they struggle to faithfully reconstruct multi-allelic genotypes, unlike PLINK or BCFTools. As you mentioned, multi-allelic variants are common in large cohorts, and this limitation could pose challenges.

While I can't fully evaluate whether this method is the most accurate or advanced, I can relate to your observations regarding the quality of genotype files in clinical and large-scale cohort data analysis. In our experience, FASTQ and VCF files from such datasets often have low quality, which results in significant false positives in downstream analyses. We have similar datasets, including 8,000 tumor samples, whole-genome and whole-exome data from thousands of psychiatric cases, and UKBB-WGS data. Due to cost limitations, not all samples are sequenced at deep coverage across the entire genome, which often leads to missing or low-quality genotypes. This could explain why advanced methods perform poorly on our data—after filtering out low-quality genotypes, the loss of LD information might degrade PBWT’s performance, thereby reducing compression efficiency. I suspect that half-calling genotypes such as ./. and ./0 contribute to this issue by disrupting the statistical properties of the sequences.

Thank you again for your insights. I hope our discussion has also provided you with some useful perspectives.

Best regards,
Liubin

@rwk-unil
Copy link
Owner

rwk-unil commented Nov 22, 2024

Hello Liubin,
Thank you for your response.

Here the issue is that XSI represents missing as sparse data, for which I added an option to change that to "WAH". In XSI the missing data will not affect the PBWT ordering, also PBWT is only applied on common variants and for most common variants the data will usually not be missing as they are common. For the rare variants they will be sparse and this will already compress well as they are rare.

Fig. 2 in https://academic.oup.com/bioinformatics/article/38/15/3778/6617346 shows that PBWT is only applied to common variants, and with WGS data much of it will be rare variation.

For the multi allelic sites I internally split them into one line per alt allele, and for all lines where the alt alleles are rare (less than 0.001 MAF) no PBWT will apply as it is only for common variants, so for XSI multi-allelic sites should not cause these problems.

Here what we see is really the effect of the missing alleles, they made the file size explode which generated the other issues. The current version of XSI is simply bad at compressing missing data (there is no mystery here, it simply puts the missing data as sparse, nothing else).

Introduction of missing data will occur with QC especially on genotype quality "GQ", this can introduce a lot of missing data points, especially with rare variations only confirmed in a single or a few samples, all others will have missing data in these cases. There is nothing wrong with this type of QC, it is simply XSI is not optimised for missing data.


If you want you can still try the 64bit version with the --wah-encode-missing option, it should perform much better than the normal version on missing data.

Thank you for the discussion, I hope you will find a file format that fills your needs, and in the future if I have more time on developing XSI I will take more care with missing alleles, for the moment in my own tests the --wah-encode-missing option already made it better.

Best regards,
Rick

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants