-
Notifications
You must be signed in to change notification settings - Fork 264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PHPLIB-1022: Codec implementation for better BSON decoding/encoding #1059
Conversation
$this->attachEncoder($item); | ||
} | ||
|
||
// Yes, we'll silently discard everything. Please let me already have union types... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to replace this with an InvalidArgumentException
- for now this is fine.
|
||
$return = []; | ||
foreach ($value as $offset => $offsetValue) { | ||
$return[$offset] = $this->getLibrary()->canEncode($offsetValue) ? $this->getLibrary()->encode($offsetValue) : $offsetValue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently codecs are designed to throw an exception if encode
is called with an unsupported object. However, we may want to consider a encodeOrReturn
kind of method that contains this logic to avoid repeating this all over the place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
encodeOrReturn()
seems sensible.
* | ||
* @template-extends IteratorIterator<TKey, TValue, TIterator> | ||
*/ | ||
final class AsListIterator extends IteratorIterator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This iterator is used when serialising LazyBSONArray
instances to ensure that we return a packed array. The psalm template annotations may need some work (or silencing of errors), but for the time being this works.
@@ -49,7 +49,7 @@ public function __construct(Traversable $traversable, Closure $callback) | |||
#[ReturnTypeWillChange] | |||
public function current() | |||
{ | |||
return ($this->callback)($this->iterator->current()); | |||
return ($this->callback)($this->iterator->current(), $this->iterator->key()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that our CallbackIterator
only passes the value to the callback, but not the key. This change brings the iterator in line with CallbackFilterIterator
which also passes both. Note that the latter also passes the whole iterator object as well, which I can add for the sake of feature parity, but it isn't required for this PR to work properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Historically, CallbackIterator has only ever been used by ListCollectionNames (2e39482), which had no need of key processing.
Is it feasible to document the callback type using Psalm annotations?
Also, I would be in favor of relaxing its type from Closure to callable
and using call_user_func()
here if you think that's more consistent with what we do elsewhere. I believe this class is the only place in the library we actually use Closure directly.
$codec = new LazyBSONArrayCodec(); | ||
$codec->attachLibrary($this->getLibrary()); | ||
|
||
// @psalm-suppress InvalidReturnStatement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
toPHP
returns array|object
, so we specifically have to ignore psalm errors here as we know that a PackedArray::toPHP()
will always return an array in the absence of a type map.
if ($offset === null) { | ||
$this->readEntirePackedArray(); | ||
|
||
$offset = max(array_merge( | ||
array_keys($this->read), | ||
array_keys($this->set), | ||
)) + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This piece of logic is necessary to allow appending to the list, in which case $offset
will be null
. In that case, we grab the maximum offset that we have read or written and write the next offset. This is consistent with how PHP handles arrays with missing offsets. Note that for this to work we have to ensure that the entire BSON packed array has been read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you test this for appending to an empty array? In that case, I'd expect $this->read
and $this->set
might both be empty, in which case you pass an empty array to max()
, which is an error.
8e27a3c
to
d2d950d
Compare
d2d950d
to
cf85d8c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a first pass through the source files but not tests.
* @psalm-template B | ||
* @psalm-template T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume "T" refers to a PHP value and "B" is for BSON. Can we use more descriptive names (e.g. TBson) like you did in AsListIterator?
namespace MongoDB\Codec; | ||
|
||
/** | ||
* @api |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this imply the class is something we want documented?
I know we use @api
elsewhere, but I've forgotten the motivation behind that. When we're thinking about preserving BC, I know everything in the library is implicitly part of the API unless we've annotated it with @internal
, so that might lead one to think @api
is redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be superfluous, in which case I'd go through and remove instances of @api
across the board. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/** @param Decoder|Encoder $items */ | ||
public function __construct(...$items) | ||
{ | ||
array_map( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to use this over foreach
? You're not actually modifying the array elements, so the callback doesn't seem necessary.
} | ||
} | ||
|
||
return $value === null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does null
get special treatment? Likewise for canEncode()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was for each codec to decide how to handle a null
value. For example, an ODM may decide that for a list of embedded documents it may not want to encode null
as an empty array. If a single codec can't deal with a null
value, it should return false
in its canEncode
method.
On second thought, the special handling for null
seems misguided. If null
deserves special treatment, other scalar values should probably enjoy the same. By that logic, it would make more sense to include a separate ScalarValueCodec
that handles all scalar values that PHP offers, including null
and returns them as-is. This would also do away with the encodeOrReturn
topic in other places. Same goes for array, where a separate codec should support any array and traverse the array to encode all items within the array. I think that would make it more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would ScalarValueCodec require additional changes in PHPC? If so, this may be a good opportunity to try and provide a way to differentiate 32-bit and 64-bit integers.
if (! $decoder->canDecode($value)) { | ||
continue; | ||
} | ||
|
||
return $decoder->decode($value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (! $decoder->canDecode($value)) { | |
continue; | |
} | |
return $decoder->decode($value); | |
if ($decoder->canDecode($value)) { | |
return $decoder->decode($value); | |
} |
IMO, the guard statement makes sense if we were doing something else here that would need a break
afterwards, but that's not the case with return
. This just seems more concise unless you expect the body of this loop to get more complicated down the line.
$value = $this->bson->get($offset); | ||
|
||
// Decode value using the codec library if a codec exists | ||
$this->read[$offset] = $this->decodeBSONValue($value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the logic in offsetGet()
, $this->set
is preferred over $this->read
.
When offsetSet()
is called, it clears out the corresponding offset in $this->unset
and $this->notFound
. Would it make sense for it to also clear out $this->read
as well? As-is, you're leaving behind a decoded value that won't be accessed.
|
||
private function getLibrary(): CodecLibrary | ||
{ | ||
return $this->library ?? $this->library = new CodecLibrary(new LazyBSONDocumentCodec(), new LazyBSONArrayCodec()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to my earlier comment, I think this is more readable with a conditional to initialize $this->library
followed by an unconditional return.
return; | ||
} | ||
|
||
$this->entirePackedArrayRead = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be safer to defer this until after iteration? I'm considering some edge case where decoding BSON fails (e.g. during a call to offsetSet()
) but the user catches and ignores the exception. That might leave LazyBSONArray in an inconsistent state.
$this->entirePackedArrayRead = true; | ||
|
||
foreach ($this->bson as $key => $value) { | ||
if (isset($this->read[$key])) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you use array_key_exists()
here? $this->read
stores decoded BSON values, and null
is a valid field value.
If this was an oversight, it'd be worth adding a regression test.
Although, if this is just an optimization to cut down on additional calls to decodeBSONValue()
maybe it isn't worth the trouble, since most things won't decode to null
. In that case, a comment explaining the logic here would still be helpful.
if (is_object($value)) { | ||
$this->set[$key] = clone $value; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to my earlier comment in LazyBSONArray, I think you may want recursive_copy()
here. And perhaps a regression test for both that fails with this original code.
Closing this prototype PR in favour of actual work: #1125 |
This pull request adds the concept of codecs to the library. A codec consists of an encoder and decoder (currently separated, but we could consolidate these into a single interface).
A decoder is responsible for decoding BSON data into PHP data, while the encoder converts PHP data into BSON data. This PR comes with two default codecs,
LazyBSONDocumentCodec
andLazyBSONArrayCodec
. These work with new lazy classes that mimic the existingBSONArray
andBSONDocument
, but are backed by the new BSONDocument
andPackedArray
classes provided in the new extension version. This allows us to defer reading and decoding BSON until data is actually accessed. Furthermore, only accessed fields are ever read, meaning that any BSON data that isn't actively used is never decoded and thus does not have a considerable memory or time impact.For the time being, the new lazy classes do not extend the existing
BSONArray
andBSONDocument
classes, but for backward compatibility it might be preferable to do so: this would allow us to enable this lazy behaviour by default. As is, this functionality is opt-in due to the BC break involved in returning classes that don't extend the current classes.The next step is to change the
Collection
class to take acodec
option, which would allow said class to decode and encode data in the various methods that would return modelled data. Apart from decoding BSON documents into specific PHP objects lazily, the codec functionality would also allow users to change the way BSON types are returned. For example, aDateTimeCodec
could decode anyMongoDB\BSON\UTCDateTime
instances to aDateTime
instance and vice-versa, with the collection class automatically converting those instances toUTCDateTime
.This is a first draft, and as such there are a number of items left to do:
Array
vs.PackedArray
)Collection
instanceClient
aware of a codec library containing codecs that should always be used, e.g. to transform BSON typescodec
option whenever fetching aCollection
instance.