Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PHPLIB-1220: Support codec option for GridFS buckets #1149

Merged
merged 7 commits into from
Sep 11, 2023

Conversation

alcaeus
Copy link
Member

@alcaeus alcaeus commented Sep 5, 2023

PHPLIB-1220

Note that the Bucket class does not forward the codec option to the underlying collections, as they can't both work with the same codec (it's only applicable to file documents). Setting a codec on the file collection also wouldn't work, as the collection would expect to apply the codec when inserting a file document. To work around these issues, the codec is manually applied in the findOne, find, and getFileDocumentForStream methods.

@alcaeus alcaeus requested review from jmikola and GromNaN September 5, 2023 12:20
@alcaeus alcaeus self-assigned this Sep 5, 2023
src/GridFS/WritableStream.php Outdated Show resolved Hide resolved
Copy link
Member

@GromNaN GromNaN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test codec is a good simple example of object mapping 👏🏻.

LGTM. It's up to you to allow null to unset the codec option.

src/GridFS/Bucket.php Outdated Show resolved Hide resolved
src/GridFS/WritableStream.php Outdated Show resolved Hide resolved
@alcaeus alcaeus requested a review from GromNaN September 6, 2023 08:57
Copy link
Member

@GromNaN GromNaN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alcaeus alcaeus force-pushed the phplib-1220-gridfs-codec branch from c8dd067 to d6dce41 Compare September 6, 2023 10:15
src/GridFS/Bucket.php Outdated Show resolved Hide resolved
@@ -136,6 +136,8 @@ public function __construct(CollectionWrapper $collectionWrapper, string $filena
'_id' => $options['_id'],
'chunkSize' => $this->chunkSize,
'filename' => $filename,
'length' => 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not initialized until fileCollectionInsert(). Is there any reason to use 0 here instead of null like you do for uploadDate below? null more clearly communicates that it's unset vs. being a potentially empty file, just as a null upload date infers that we're in a pre-upload state.

The only reason I can think otherwise might be if you were defining an array shape on this, but I don't see any evidence of that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discovered this when calling getFileDocumentForStream, which tried to operate on the partial data available. Adding null values here allowed me to call $document->get('...') in the codec. While GridFS files are expected to always have a length and uploadDate, during the state of uploading (i.e. when the file hasn't been persisted to the database yet) this data is missing entirely, making the previous assumption false. I'm not sure what the better course of action is: ensuring that the data always has all fields, or making the codec accept partial documents that have some fields missing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noted that you changed length to null, which is actually all I was curious about.

The explanation is very helpful, though ✌️

@@ -166,6 +175,7 @@ public function __construct(Manager $manager, string $databaseName, array $optio
$this->databaseName = $databaseName;
$this->bucketName = $options['bucketName'];
$this->chunkSizeBytes = $options['chunkSizeBytes'];
$this->codec = $options['codec'] ?? null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your explanation for not injecting codec into $collectionOptions and only applying on-demand in specific methods, but now I'm curious if we're handling typeMap incorrectly. That is set directly into $collectionOptions and ends up applying to most calls on the files and chunks collections.

Bucket uses getRawFileDocumentForStream() internally as needed to access a raw document (without the type map). Does that not also cover us for a codec? Various CollectionWrapper methods explicitly set a simple typeMap when we need consistent access to a document, and that should avoid a codec being inherited.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main problem with the codec is that it also applies on incoming data for operations like insertOne, while typeMap only applies to outgoing data. Applying a codec on the collection broke inserting GridFS files, which is when I noticed that something was up.

In an ideal world, we'd be handling everything as raw BSON data internally and only decoding it using typeMap or codec when necessary to do so. However, I believe this goes beyond the scope of what we want to do here. I'd be in favour of revisiting this when we make changes to our GridFS API to accommodate the recent request to make the stream protocol available to consumers. The refactoring necessary to accommodate this would also allow us to change the Bucket to internally always work with BSON documents and removing this special handling entirely. The collections would then receive a ['root' => 'bson'] type map, and all handling regarding codecs and type maps for the file document would be handled by the Bucket class.

src/GridFS/Bucket.php Show resolved Hide resolved
$fileObject->filename = $value->get('filename');

$uploadDate = $value->get('uploadDate');
$fileObject->uploadDate = $uploadDate ? DateTimeImmutable::createFromMutable($uploadDate->toDateTime()) : null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this prompt PHPC-2286?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prompted by #1154 (comment).

$fileObject = new TestFile();
$fileObject->id = $value->get('_id');
$fileObject->length = (int) $value->get('length');
$fileObject->chunkSize = (int) $value->get('chunkSize');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of the int casts here? I notice you don't do casts for filename, but perhaps that's handled by PHP enforcing TestFile's property type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're operating on raw BSON, we may get Int64 instances, however unlikely they may be (especially for chunkSize). I decided to be prudent and cast these to int to avoid that issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my assumption. Thanks for confirming!

$fileObject->uploadDate = $uploadDate ? DateTimeImmutable::createFromMutable($uploadDate->toDateTime()) : null;

if ($value->has('metadata')) {
$fileObject->metadata = $value->get('metadata');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this assert that metadata is a document? I don't think any other type is expected.

Or is that possibly a can of worms because then we'd want a codec for that as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assertion added so I can call toPHP() on it, see comment below.

public int $chunkSize;
public ?DateTimeImmutable $uploadDate = null;
public string $filename;
public $metadata;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per my earlier question, should this be typed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a ?stdClass doctype and called toPHP on the metadata document to show that the "old" way of decoding data is still available.


final class TestFile
{
public $id;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted that GridFS permits custom _id values, so we'd be wrong to force ObjectId here.

@alcaeus alcaeus force-pushed the phplib-1220-gridfs-codec branch 2 times, most recently from fca2389 to a326ea4 Compare September 8, 2023 08:56
@alcaeus alcaeus force-pushed the phplib-1220-gridfs-codec branch from a326ea4 to f3a3be3 Compare September 11, 2023 06:40
@@ -381,6 +404,10 @@ public function getFileDocumentForStream($stream)
{
$file = $this->getRawFileDocumentForStream($stream);

if ($this->codec) {
return $this->codec->decode(Document::fromPHP($file));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you re-create a BSON document that have already been decoded into PHP?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, yes: the document may not have come from the database but may be stored by the WritableStream instance as a hash.

@alcaeus alcaeus merged commit 7da7372 into mongodb:master Sep 11, 2023
@alcaeus alcaeus deleted the phplib-1220-gridfs-codec branch September 11, 2023 07:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants