Is `long` vector support necessary? #3

ahl27 · 2024-12-20T16:45:00Z

Noting down here a potential issue that may arise in the future that I encountered while working on Biostrings.

A current limitation of Biostrings is the width field of XStringSet I/O functions being limited to 200,002 due to internal I/O buffer sizing. Users have expressed a desire to have unlimited line widths when reading/writing FASTA files, so I'm working on removing it.

The internal functions rely on variants of Ocopy functions to copy values in the XStringSet (XVector) object to the I/O buffer. It seems like these functions expect int offsets as input, eg:

XVector/src/vector_copy.c

Lines 20 to 21 in 3fa081b

    
           void _vector_Ocopy(SEXP out, int out_offset, SEXP in, int in_offset, int nelt, 
        
           		SEXP lkup, int reverse, int Omode)

On most systems that should autocompile to int32_t, which is limited to a max size of 2^31 (~2 billion). I feel like I could envision a user having a 2GBp sequence -- it would be extremely rare, but not outside the realm of possibility. Attempting to write such a sequence using writeXStringSet would cause integer overflow errors due to int offset indexing.

Upon looking into it more, it actually turns out that this vector would be unlikely to be supported in the first place, since all the XVector derived functions seem to use int indexing internally:

library(Biostrings)
y <- paste(sample(DNA_BASES, 128, replace=TRUE), collapse='')
x <- DNAString(y)
for(i in seq_len(23))
  x <- xscat(x, x)
x ## length 1,073,741,824

 ## would be length 2,147,483,648, but fails because of overflow
xscat(x, x)

## Error in .Call2("XString_xscat", args, PACKAGE = "Biostrings") : 
##  negative length vectors are not allowed

And this extends to all XVector classes:

XInteger(3e9)
## Error in SharedInteger(length = length, val = val) : 
##   'length' must be a single non-negative integer
## In addition: Warning message:
## In .local(.Object, ...) : NAs introduced by coercion to integer range

Note that R itself does support long vectors via double indexing (I believe):

## this uses a lot of RAM, be careful
x <- rep(1L, 3e9)
length(x)
# > 3e+9

The obvious solution is to move to use the more standard size_t or long (or (u)int64_t). size_t and uint64_t are unsigned, which may change other processing if negative values are used for internal checks in other XVector/IRanges/Biostrings methods. It does seem like some functions already use long values internally, e.g.:

XVector/src/vector_copy.c

Lines 31 to 36 in 3fa081b

    
           if (lkup == R_NilValue && reverse == 0 && Omode == 0) { 
        
           	copy_vector_block(out, (long long int) out_offset, 
        
           			  in, (long long int) in_offset, 
        
           			  (long long int) nelt); 
        
           	return; 
        
           }

So that makes me think it may be as simple as just converting int to long int, eg:

void _vector_Ocopy(SEXP out, long int out_offset, SEXP in, long int in_offset, long int nelt,
		SEXP lkup, int reverse, int Omode)

But there would be a ton of places to change this, and I'm not sure what impact it would have downstream.

My larger question is, is this something that's actually needed? If so, how urgently? It's not clear to me how often users are using extremely large vector objects, and if the frequency of their use will increase in the future. At the very least, it's probably worth adding a warning to the documentation somewhere and/or including some more informative error messages to catch the XInteger(3e9) case. Might also be worth inquiring on the Bioconductor slack.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is `long` vector support necessary? #3

Is `long` vector support necessary? #3

ahl27 commented Dec 20, 2024 •

edited

Loading

Is long vector support necessary? #3

Is long vector support necessary? #3

Comments

ahl27 commented Dec 20, 2024 • edited Loading

Is `long` vector support necessary? #3

Is `long` vector support necessary? #3

ahl27 commented Dec 20, 2024 •

edited

Loading