You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Noting down here a potential issue that may arise in the future that I encountered while working on Biostrings.
A current limitation of Biostrings is the width field of XStringSet I/O functions being limited to 200,002 due to internal I/O buffer sizing. Users have expressed a desire to have unlimited line widths when reading/writing FASTA files, so I'm working on removing it.
The internal functions rely on variants of Ocopy functions to copy values in the XStringSet (XVector) object to the I/O buffer. It seems like these functions expect int offsets as input, eg:
On most systems that should autocompile to int32_t, which is limited to a max size of 2^31 (~2 billion). I feel like I could envision a user having a 2GBp sequence -- it would be extremely rare, but not outside the realm of possibility. Attempting to write such a sequence using writeXStringSet would cause integer overflow errors due to int offset indexing.
Upon looking into it more, it actually turns out that this vector would be unlikely to be supported in the first place, since all the XVector derived functions seem to use int indexing internally:
library(Biostrings)
y<- paste(sample(DNA_BASES, 128, replace=TRUE), collapse='')
x<- DNAString(y)
for(iin seq_len(23))
x<- xscat(x, x)
x## length 1,073,741,824## would be length 2,147,483,648, but fails because of overflow
xscat(x, x)
## Error in .Call2("XString_xscat", args, PACKAGE = "Biostrings") : ## negative length vectors are not allowed
And this extends to all XVector classes:
XInteger(3e9)
## Error in SharedInteger(length = length, val = val) : ## 'length' must be a single non-negative integer## In addition: Warning message:## In .local(.Object, ...) : NAs introduced by coercion to integer range
Note that R itself does support long vectors via double indexing (I believe):
## this uses a lot of RAM, be carefulx<- rep(1L, 3e9)
length(x)
# > 3e+9
The obvious solution is to move to use the more standard size_t or long (or (u)int64_t). size_t and uint64_t are unsigned, which may change other processing if negative values are used for internal checks in other XVector/IRanges/Biostrings methods. It does seem like some functions already use long values internally, e.g.:
copy_vector_block(out, (long long int) out_offset,
in, (long long int) in_offset,
(long long int) nelt);
return;
}
So that makes me think it may be as simple as just converting int to long int, eg:
void_vector_Ocopy(SEXPout, long intout_offset, SEXPin, long intin_offset, long intnelt,
SEXPlkup, intreverse, intOmode)
But there would be a ton of places to change this, and I'm not sure what impact it would have downstream.
My larger question is, is this something that's actually needed? If so, how urgently? It's not clear to me how often users are using extremely large vector objects, and if the frequency of their use will increase in the future. At the very least, it's probably worth adding a warning to the documentation somewhere and/or including some more informative error messages to catch the XInteger(3e9) case. Might also be worth inquiring on the Bioconductor slack.
The text was updated successfully, but these errors were encountered:
Noting down here a potential issue that may arise in the future that I encountered while working on
Biostrings
.A current limitation of
Biostrings
is thewidth
field ofXStringSet
I/O functions being limited to 200,002 due to internal I/O buffer sizing. Users have expressed a desire to have unlimited line widths when reading/writing FASTA files, so I'm working on removing it.The internal functions rely on variants of
Ocopy
functions to copy values in the XStringSet (XVector) object to the I/O buffer. It seems like these functions expectint
offsets as input, eg:XVector/src/vector_copy.c
Lines 20 to 21 in 3fa081b
On most systems that should autocompile to
int32_t
, which is limited to a max size of 2^31 (~2 billion). I feel like I could envision a user having a 2GBp sequence -- it would be extremely rare, but not outside the realm of possibility. Attempting to write such a sequence usingwriteXStringSet
would cause integer overflow errors due toint
offset indexing.Upon looking into it more, it actually turns out that this vector would be unlikely to be supported in the first place, since all the
XVector
derived functions seem to useint
indexing internally:And this extends to all
XVector
classes:Note that R itself does support long vectors via double indexing (I believe):
The obvious solution is to move to use the more standard
size_t
orlong
(or(u)int64_t
).size_t
anduint64_t
are unsigned, which may change other processing if negative values are used for internal checks in other XVector/IRanges/Biostrings methods. It does seem like some functions already uselong
values internally, e.g.:XVector/src/vector_copy.c
Lines 31 to 36 in 3fa081b
So that makes me think it may be as simple as just converting
int
tolong int
, eg:But there would be a ton of places to change this, and I'm not sure what impact it would have downstream.
My larger question is, is this something that's actually needed? If so, how urgently? It's not clear to me how often users are using extremely large vector objects, and if the frequency of their use will increase in the future. At the very least, it's probably worth adding a warning to the documentation somewhere and/or including some more informative error messages to catch the
XInteger(3e9)
case. Might also be worth inquiring on the Bioconductor slack.The text was updated successfully, but these errors were encountered: