-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with XSI Compression for Large UKBB Genotype Subsets #6
Comments
Hello, XSI compresses VCF/BCF in blocks of 8,192 VCF lines So the requirement here is that the 200,000 samples x 8,192 variants matrix should be representable (post PBWT and WAH encoding) under 4GB and compress under 4GB. I had never encountered this and have been compressing files with up to a million samples. I have pushed eaf8feb this patch allows to use the Could you try compressing again, maybe with Does this also occur with the original UKB file prior to your workflow ? Thank you for sharing this issue. |
Hi, Rick, Thank you for your prompt response and detailed explanation regarding the block size limitation in XSI. The original UKBB-WGS dataset I am working with is extremely large (1.06 PB, containing 490,000 samples and 1 billion variants). To handle this, I preprocess the data using GBC, a faster tool with integrated QC features, to merge and prepare files. This preprocessing step generates a single GTB/PGEN/VCF.GZ file containing a large number of variants, which is subsequently used for benchmarking, including tests with XSI. I have not tested directly on the original UKBB dataset yet. My plan is to first identify tools that show promise for large-scale genomic datasets and then test these on the original files. I understand the design of your variant matrix. However, I am uncertain how multi-allelic variants are handled in XSI. A common approach is to split multi-allelic variants into multiple bi-allelic sites, which could potentially result in a block containing more than the expected 8,192 sites. If this assumption is correct, my dataset might indeed contain such variants, with some loci having over 200 alleles (e.g., small INDEL fragments). After recompiling XSI with your patch, I ran the following script: start=$(date +%s.%N) It seems the file /home/root/data/UKBB_hg38/wgs/chr21.samples_200000.hg38.vcf.gz is mostly unphased ^Ctrl + Z After receiving this error, the program did not exit, and I had to manually terminate it (the file size is approximately 200GB, and the process was estimated to take two days). Given these results, I suspect the issue may stem from a combination of the large variant matrix and specific handling of multi-allelic sites. Please let me know if you have any additional suggestions or require further details for debugging. Best regards, |
Hello Liubin, As I said you can change the block size with I understand this is not the best as the compression could fail if the block size is not small enough, I should patch XSI to encode the block sizes with 64-bits so that this is not an issue. The reason the program did not directly finish, is because while it is compressing it is also generating the binary indexing matrix with another thread (which traverses the whole VCF/BCF), this task is usually faster than the compression but here because of the massive file and the error, that task did not yet finish, so you had to kill the program. To save some time I would recommend using The size bump between 10,000 and 50,000 samples is because when there are less than 65'536 haplotypes (so about 32,000 samples) XSI uses uint16_t to represent the haplotype positions in the PBWT order, but with more samples it uses uint32_t, which takes double the space but can represent up to 4 billion haplotypes (2 billion samples). The "Unknown frame descriptor" error comes from ZSTD, it means ZSTD was passed a block that it doesn't recognise, so either the block is corrupted or a wrong location is given to ZSTD. It might be a bug with multi-allelic sites, it is hard to say like this. Also does you dataset have a lot of missing data ? As the error during compression arrives quite early (about after 30min of processing), it would be nice to isolate it. I'll have a look at making the error more verbose and see if I can print the region of the VCF file, that way we can extract it and save time by running compression only on the part that fails. So for the moment, maybe try to compress with the Regards, |
Hi Rick, Thank you for your detailed explanation and clarification. Indeed, my dataset contains a substantial amount of missing genotype data. For example, in the case of the file ukb24310_c1_b6089_v1.vcf.gz, the sizes before and after PLINK quality control are 341.3 MB and 249.84 MB, respectively (using the parameters --vcf-min-gq 20 --vcf-min-dp 8). This difference reflects the loss of many low-quality genotypes. My QC process includes additional steps beyond those mentioned, as the dataset, after passing through several high-performance genotype compression methods, lost much of the quality information. This is a necessary step to avoid false positives in downstream analyses. I agree that using .bcf.gz as both input and output would save considerable time, and I will conduct additional tests on this format in the future. At this stage, however, to ensure compatibility across all software tools, I am using the more general and easily parsable .vcf.gz format. It seems that adjusting the --variant-block-length parameter or allowing the program to automatically adjust the block size based on sample size could be an immediate solution. Using a 64-bit encoding for block sizes may resolve the issue, though I suspect it could lead to increased memory usage. As memory resources on cloud platforms like DNANexus can be quite expensive, it might be more practical to adjust the --variant-block-length parameter based on available or limited memory on my local machine. I plan to restart the XSI tests later this week using the --variant-block-length option. I greatly appreciate your assistance, and I look forward to your further work on this. Best regards, Note: |
Hello Liubin, I have created a new branch "64bit" https://github.com/rwk-unil/xSqueezeIt/tree/64bit This branch stores the block sizes as 64-bit numbers therefore removing the limitation of 4 GB blocks. You can try the version from that branch and see how it goes, this will remove the issue with "Uncompressed data size (10971456540) too big to be encoded with uint32_t". It may use more RAM as reducing the block size but it is still limited by the block size (default 8192) so it will not suddenly use massively more RAM than the 10GB we encountered. So feel free to try it. I did not merge this in the main branch for the moment as it breaks compatibility with older versions of XSI (the new version can read files compressed with the older, but not the reverse naturally). I'll have a look at the ukb24310_c1_b6089_v1.vcf.gz file, do you encounter issues with it ? Best regards, |
Hi Rick, Thank you very much for the update and for addressing the 4 GB block size limitation with the new "64bit" branch. I have just finished some tasks and am ready to resume testing XSI. Based on your previous analysis, I believe the "64bit" version should work well with datasets containing up to 500,000 samples. Regarding the ukb24310_c1_b6089_v1.vcf.gz file, I may not have explained my concern clearly before. My point was not about encountering issues with this specific dataset, but rather about the significant size difference between the quality-controlled and non-quality-controlled versions of such datasets. This is largely due to the presence of many low-depth genotypes, which are replaced with ./. during quality control. These replacements might be the main factor affecting XSI's compression efficiency. Best regards, |
Hi Rick, I wanted to update you on my recent testing using the new XSI_64bit version. I've successfully run tests with datasets containing 50,000 and 100,000 samples. However, I noticed that the space consumption still does not match the expected growth pattern. Here are the results from my tests:
As you can see, the file sizes for the XSI compression are still significantly higher than anticipated, particularly with the larger datasets. Additionally, while the XSI file for the 50,000 sample dataset was successfully generated, I encountered issues when attempting to decompress the file. My decompression command was: xsqueezeit -x -f /home/root/tmp/xsi/sample_50000_xsi -O z -o ~/tmp/output.vcf.gz [E::bcf_hdr_add_sample_len] Empty sample name: trailing spaces/tabs in the header line? E::bcf_hdr_add_sample_len] Duplicated sample name '' I would greatly appreciate any guidance you could provide regarding the space inflation issue and the decompression problem with the 50,000 sample dataset. Please let me know if there is additional information you need from my end to assist with troubleshooting. Thank you once again for your continued support. Best regards, |
Hello Liubin, There are two issues here; 1) XSI generates very large files in cases where it looks like it should not. 2) XSI fails to decompress these files. For the moment I suspect the problem to come from possibly high missingness as I know this leads to bad compression. I also did not test massively with datasets of that nature so this might have shown a bug I did not encounter before. For 1) The problem with the missing data, is that XSI encodes missing data simply as sparse positions, i.e., the position of the haplotype where an allele is missing. For less than 2^16=65,536 haplotypes, this means one 2-byte number (uint16_t) per missing allele, with more than 65,536 haplotypes, a 4-byte number (uint32_t) is used per missing allele. So if we compare BCF which uses a 4-byte number per allele, XSI might perform worse in cases of high missing data, because of the sparse representation of missing data. This data will compress much worse than one 4-byte number per allele as in the case of BCF where the missing allele is always represented by the same value in the allele array but at different positions, because in the sparse representation each missing value will be encoded as the position (haplotype number, in the allele array) where it appears, and as this is always a different position, as each position is unique, this is much less compressible than a repeating value in an array. So I think the amount of missing alleles in the dataset is the reason for the poor compression rate, especially with more samples. Do you have summary statistics on the number of missing alleles per variant ? This could give some insight. XSI is really poor at encoding / compressing missing data, this is a fact, it was developed with the assumption that missing data would be rare, because who would store large amount of missing data ? (i.e., "no data", as it is missing...). But I understand filtering on genotype quality (GC) will create "missing" data a population scale dataset (possible in massive quantities). For 2) I think I know where some of the issues you found are coming from, it is something I thought I had fixed a few years ago but it seems not to be the case. There is a problem with how the file format stores the location of the samples and data blocks. This will create the error with the decompression of files (header problems with sample names) and then failed to allocate memory. I think I could fix this issue so that XSI can compress/decompress these files, but it will not solve the size issue. For the size issue I think there is a lot of missing data and XSI is bad at compressing this as there was no ptimised strategy for compression of missing data, XSI only supports it as not to lose it. So if you can get some summary statistics on the amount of missing alleles per variant, this would give some insights (e.g., number of VCF lines that have missing data and distribution of quantity of missing alleles per line). Another test you could do is replace the missing values by reference (e.g., Do PLINK/BGT/GTC keep the missing values information ? When you decompress their file are the missing alleles still there ? If the formats lose this info it may be better to remove the missing alleles altogether. And momentarily if you want to have compression/decompression working you could split the input file at variant level e.g., files of 1,000,000 variants. I know this is not the best but at least would allow you to test for the moment. Best regards, |
Hello Liubin, I have pushed two patches to the
Please try to compress with the Please note that patch 2) 55ad8c7 breaks compatibility with the previous version in the Best regards |
Hi Rick, Thank you for your thoughtful reply. I believe we have identified the key issue here. Indeed, XSI performs well within its intended range of applicability, and as you rightly pointed out, this range aligns with most research scenarios, rather than extreme conditions such as the ones I am testing. From your description, I suspect that you are applying the PBWT transformation to genotype data where multi-allelic sites are coded with 0 for the reference genotype and 1 for the alternative genotype, and missing genotypes are handled by separately listing their indices. The key assumption behind this approach is that the number of missing genotypes is relatively small. If this assumption is violated, the overhead from recording indices could become significant. Would it be possible to explore a combined approach using varInt and differential encoding to store these indices more efficiently? As for PBWT, BGT, GTC, and GTShark, they encode genotypes as 0/0 -> 00, 0/1 -> 01, 1/1 -> 11, and ./. -> 10 before applying further transformations. This strategy, with some variations in encoding details across methods, naturally accommodates missing genotypes. However, a common limitation of these methods is that they struggle to faithfully reconstruct multi-allelic genotypes, unlike PLINK or BCFTools. As you mentioned, multi-allelic variants are common in large cohorts, and this limitation could pose challenges. While I can't fully evaluate whether this method is the most accurate or advanced, I can relate to your observations regarding the quality of genotype files in clinical and large-scale cohort data analysis. In our experience, FASTQ and VCF files from such datasets often have low quality, which results in significant false positives in downstream analyses. We have similar datasets, including 8,000 tumor samples, whole-genome and whole-exome data from thousands of psychiatric cases, and UKBB-WGS data. Due to cost limitations, not all samples are sequenced at deep coverage across the entire genome, which often leads to missing or low-quality genotypes. This could explain why advanced methods perform poorly on our data—after filtering out low-quality genotypes, the loss of LD information might degrade PBWT’s performance, thereby reducing compression efficiency. I suspect that half-calling genotypes such as ./. and ./0 contribute to this issue by disrupting the statistical properties of the sequences. Thank you again for your insights. I hope our discussion has also provided you with some useful perspectives. Best regards, |
Hello Liubin, Here the issue is that XSI represents missing as sparse data, for which I added an option to change that to "WAH". In XSI the missing data will not affect the PBWT ordering, also PBWT is only applied on common variants and for most common variants the data will usually not be missing as they are common. For the rare variants they will be sparse and this will already compress well as they are rare. Fig. 2 in https://academic.oup.com/bioinformatics/article/38/15/3778/6617346 shows that PBWT is only applied to common variants, and with WGS data much of it will be rare variation. For the multi allelic sites I internally split them into one line per alt allele, and for all lines where the alt alleles are rare (less than 0.001 MAF) no PBWT will apply as it is only for common variants, so for XSI multi-allelic sites should not cause these problems. Here what we see is really the effect of the missing alleles, they made the file size explode which generated the other issues. The current version of XSI is simply bad at compressing missing data (there is no mystery here, it simply puts the missing data as sparse, nothing else). Introduction of missing data will occur with QC especially on genotype quality "GQ", this can introduce a lot of missing data points, especially with rare variations only confirmed in a single or a few samples, all others will have missing data in these cases. There is nothing wrong with this type of QC, it is simply XSI is not optimised for missing data. If you want you can still try the Thank you for the discussion, I hope you will find a file format that fills your needs, and in the future if I have more time on developing XSI I will take more care with missing alleles, for the moment in my own tests the Best regards, |
Hi,
I am currently benchmarking tools that balance "compression and accessibility" for genotype data and encountered an issue when using XSI with large datasets. I hope you can assist me in resolving it.
The dataset originates from UK Biobank's ukb24310_c21_b*_v1.vcf.gz, consisting of 2,336 files with a total size of 14.29 TB. My workflow includes quality control, format conversion, and merging using GBC, ensuring reliable genotypes while excluding quality information. From the resulting GTB files (generated by GBC), I created subsets of (1, 10, 100, 1,000, 5,000, 10,000, 50,000, 100,000, 200,000) samples, all containing 13,791,793 variants (excluding variants with AN=0). If you have access to UKBB data, you should be able to reproduce this, though the process can be time-consuming.
XSI works reliably for sample sizes between 1 and 50,000. However, for subsets with 100,000 and 200,000 samples, the following exception occurs:
Command executed:$(printf "%.3f" $ (echo "$end - $start" | bc)) seconds"
"""
start=$(date +%s.%N)
xsqueezeit -c -f ~/data/UKBB_hg38/wgs/chr21.samples_200000.hg38.vcf.gz -o ~/tmp/xsi/sample_200000_xsi --zstd
end=$(date +%s.%N)
echo "Elapsed time:
"""
Error message:
"""
It seems the file /home/root/data/UKBB_hg38/wgs/chr21.samples_200000.hg38.vcf.gz is mostly unphased
header 256 bytes, total 256 bytes written
Size too big to be encoded with j
Failed to write compressed block
Failure occurred, exiting...
Elapsed time: 10299.386 seconds
"""
The error suggests an issue related to block size during compression. I’d appreciate any insights or suggestions for overcoming this limitation, especially for handling larger sample sizes efficiently.
Thank you for your time.
The text was updated successfully, but these errors were encountered: