Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is long vector support necessary? #3

Open
ahl27 opened this issue Dec 20, 2024 · 0 comments
Open

Is long vector support necessary? #3

ahl27 opened this issue Dec 20, 2024 · 0 comments

Comments

@ahl27
Copy link

ahl27 commented Dec 20, 2024

Noting down here a potential issue that may arise in the future that I encountered while working on Biostrings.

A current limitation of Biostrings is the width field of XStringSet I/O functions being limited to 200,002 due to internal I/O buffer sizing. Users have expressed a desire to have unlimited line widths when reading/writing FASTA files, so I'm working on removing it.

The internal functions rely on variants of Ocopy functions to copy values in the XStringSet (XVector) object to the I/O buffer. It seems like these functions expect int offsets as input, eg:

void _vector_Ocopy(SEXP out, int out_offset, SEXP in, int in_offset, int nelt,
SEXP lkup, int reverse, int Omode)

On most systems that should autocompile to int32_t, which is limited to a max size of 2^31 (~2 billion). I feel like I could envision a user having a 2GBp sequence -- it would be extremely rare, but not outside the realm of possibility. Attempting to write such a sequence using writeXStringSet would cause integer overflow errors due to int offset indexing.

Upon looking into it more, it actually turns out that this vector would be unlikely to be supported in the first place, since all the XVector derived functions seem to use int indexing internally:

library(Biostrings)
y <- paste(sample(DNA_BASES, 128, replace=TRUE), collapse='')
x <- DNAString(y)
for(i in seq_len(23))
  x <- xscat(x, x)
x ## length 1,073,741,824

 ## would be length 2,147,483,648, but fails because of overflow
xscat(x, x)

## Error in .Call2("XString_xscat", args, PACKAGE = "Biostrings") : 
##  negative length vectors are not allowed

And this extends to all XVector classes:

XInteger(3e9)
## Error in SharedInteger(length = length, val = val) : 
##   'length' must be a single non-negative integer
## In addition: Warning message:
## In .local(.Object, ...) : NAs introduced by coercion to integer range

Note that R itself does support long vectors via double indexing (I believe):

## this uses a lot of RAM, be careful
x <- rep(1L, 3e9)
length(x)
# > 3e+9 

The obvious solution is to move to use the more standard size_t or long (or (u)int64_t). size_t and uint64_t are unsigned, which may change other processing if negative values are used for internal checks in other XVector/IRanges/Biostrings methods. It does seem like some functions already use long values internally, e.g.:

if (lkup == R_NilValue && reverse == 0 && Omode == 0) {
copy_vector_block(out, (long long int) out_offset,
in, (long long int) in_offset,
(long long int) nelt);
return;
}

So that makes me think it may be as simple as just converting int to long int, eg:

void _vector_Ocopy(SEXP out, long int out_offset, SEXP in, long int in_offset, long int nelt,
		SEXP lkup, int reverse, int Omode)

But there would be a ton of places to change this, and I'm not sure what impact it would have downstream.

My larger question is, is this something that's actually needed? If so, how urgently? It's not clear to me how often users are using extremely large vector objects, and if the frequency of their use will increase in the future. At the very least, it's probably worth adding a warning to the documentation somewhere and/or including some more informative error messages to catch the XInteger(3e9) case. Might also be worth inquiring on the Bioconductor slack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant