Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PHPLIB-1254 Create Yaml for all operators and stages #1180

Closed
wants to merge 37 commits into from

Conversation

GromNaN
Copy link
Member

@GromNaN GromNaN commented Oct 4, 2023

Fix PHPLIB-1254

Operators and stages specifications

To create Yaml for all the stages and operators (query, pipeline), I started by filling a Google Sheet manually, from online documentation. Even if that took hours, it was a good learning. This document is temporary and will not be used after this PR.

The Yaml files are generated with the scrape command, that will be removed before merging.

The schema.json applies to Yaml files.

All operators and stages have an "encoding" type, that describes how to convert the object into BSON.

"$comment": "Specifies how operator parameters are encoded.",
"$comment": "array: parameters are encoded as an array of values in the order they are defined by the spec",
"$comment": "object: parameters are encoded as an object with keys matching the parameter names",
"$comment": "single: get the single parameter value",
"$comment": "group: specific for $group stage",

Types

The types will be reworked in PHPLIB-1251.

  • Add AccumulatorInterface to type hint Group Accumulator Operators. It might be necessary to distinct the accumulators for each stage ($group, $setWindowFields...)
  • Add QueryInterface distinct query operators from expressions marked with ExpressionInterface.

Add the MongoDB\object function to create stdClass from variadic named arguments.

/**
* Create a new stdClass instance with the provided properties.
*
* @psalm-suppress MoreSpecificReturnType
* @psalm-suppress LessSpecificReturnStatement
*/
function object(mixed ...$values): stdClass
{
return (object) $values;
}

Copy link
Member Author

@GromNaN GromNaN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very WIP

$filename = $dirname . '/' . $name . '.yaml';

// Add a schema reference to the top of the file
$schema = '# $schema: ../schema.json' . PHP_EOL;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PHPStorm will support it soon.

Will be available in 2023.3 Release.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I committed .idea/jsonSchemas.xml for the time being.


public function execute(InputInterface $input, OutputInterface $output): int
{
$index = file_get_contents('https://docs.google.com/spreadsheets/d/e/2PACX-1vROpGTJGXAKf2SVuSZaw16NwMVtzMVGH9b-YiMtddgZRZOjOO6jK2YLbTUZ0N_qe74nxGY9hYhUe-l2/pubhtml');
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I published the Google Sheet publicly as an HTML document because csv and tsv doesn't have correct support for newline. Also, I can edit in Google and run my command without all the hassle of Google authentication.


use const PHP_EOL;

final class ScrapeCommand extends Command
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll probably remove this command before merging. The google sheet is a step for generating hundreds of Yaml file. But it is not suitable for day-to-day maintenance and versioning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it potentially come in handy down the line to track necessary changes to YAML files? Drivers typically do not get notified of all updates to MQL (query and aggregation) syntax, so I think it's likely we'll fall behind on this and may want some tooling to compare docs against our own mappings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's all done by hand. Unless you use AI, I don't see how you can recreate this file simply to detect changes.

The form in which things are documented varies greatly and some informations where extracted from the server source code. I'd better maintain Yaml files and get notified of changes.

@GromNaN GromNaN marked this pull request as ready for review October 6, 2023 07:39
@GromNaN GromNaN requested review from alcaeus and jmikola October 6, 2023 07:41
.github/workflows/benchmark.yml Outdated Show resolved Hide resolved
@@ -20,14 +24,44 @@ function typeFieldPath(string $resolvesTo): array
}

return [
'mixed' => ['scalar' => true, 'types' => ['mixed']],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mixed is very PHP-ish. Should we rename this to any to make it more applicable to other programming languages?

Comment on lines 37 to 40
'Regex' => ['scalar' => true, 'types' => [BSON\Regex::class]],
'Constant' => ['scalar' => true, 'types' => ['mixed']],
'Binary' => ['scalar' => true, 'types' => ['string', BSON\Binary::class]],

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason these are capitalised? If we're supporting all BSON types here, there's no reason to change the capitalisation of names.

],
DateFieldPath::class => typeFieldPath(ResolvesToDate::class),
ResolvesToTimestamp::class => [
'implements' => [ResolvesToInt::class],
'types' => ['int', BSON\Int64::class],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should include the BSON\Timestamp classs.

@@ -32,7 +32,10 @@ class BuilderEncoder implements Encoder
*/
public function canEncode($value): bool
{
return $value instanceof Pipeline || $value instanceof StageInterface || $value instanceof ExpressionInterface;
return $value instanceof Pipeline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should definitely consider splitting this class.

- "7.4"
- "8.0"
#- "7.4"
#- "8.0"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be reverted by PHPLIB-1268

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you come to a solution to still run PHPLIB against 7.4+ but limit the aggregation builder code to 8.1+?

@@ -18,6 +18,9 @@
"mongodb/mongodb": "@dev",
"nette/php-generator": "^4",
"symfony/console": "^6.3",
"symfony/css-selector": "^6.3",
"symfony/dom-crawler": "^6.3",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove when the scrape command is removed.

};

// Convert arguments to ArgumentDefinition objects
// Optional arguments must be after required arguments
Copy link
Member Author

@GromNaN GromNaN Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The optional argument have to move to the end, as PHP doesn't support optional arguments before required ones.

That doesn't matter for "object" encoding, but this is an issue for "array" encoding since the order of arguments matter.


$slice have an optional argument in the middle of required arguments.

To be reworked in PHPLIB-1274

Copy link
Member

@jmikola jmikola Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$slice have an optional argument in the middle of required arguments.

This sadly dates back to server version 3.2. I hope we've learned not to make such mistakes in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we ask the server team to allow an object here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They certainly can't change the array order due to BC reasons.

Argument lists for operators are typically lists, so I don't imagine using an object would be consistent. Even if such a change was made, we'd still have to support the legacy array format for older server versions, which in turns means we'd need to implement version checking here. AFAIK, all of the query syntax is modeled irrespective of the server version, so I'd rather not open that can of worms.

I expect $slice will just need to be handled specially.

-
name: config
type:
- object
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not modelize the full object. Let's use object function for that. People will have to read the documentation anyway.
I could have made it a variadic map.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aye, this agrees with what we do elsewhere for certain operations (e.g. ModifyCollection).

-
name: range
type:
- Range
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

range model is a plain object for now. Not modelized.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For cases where we are choosing not to model the object, can we incorporate links in the YAML so that we get @see references in the generated code? Then, if folks happen to view the generated operator file they can be directed to the server documentation.

Copy link
Member Author

@GromNaN GromNaN Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a @see comment with the link to the doc on each operator class header, and factory method.

@@ -0,0 +1,23 @@
# $schema: ../schema.json
name: $firstN
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a name conflict with $firstN that is both aggregation accumulator and array operator. Same for $lastN.

Copy link
Member

@jmikola jmikola Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of these fall under aggregation, correct? And not the query namespace.

The API for both of these operators looks similar, but I assume they are distinct and there's nothing to say other operators and accumulators with the same name might differ more significantly. In this case, would it make sense to organize accumulators under their own namespace?

I actually notice that there are two groupings for accumulators in the docs:

It's hard to predict how we might encounter naming conflicts in the future, but do you think it'd be worth organizing the many aggregation operators by separate namespaces to group them logically?

We could also do the same for Query and Projection operators, which have far fewer categories, if you like.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we over-categorize, it could be confusing for developers, who will have to guess which category we've put the operator they want to use.
Also, I don't want to duplicate Yaml files for operator what would be in multiple categories, like $avg or $count.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file modeling the array operator or accumulator? The link and description fields below suggest the array operator , but I see type: [ Accumulator ] as well.

Copy link
Member

@jmikola jmikola Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file modeling the array operator or accumulator? The link and description fields below suggest the array operator , but I see type: [ Accumulator ] as well.

I can see how over-categorization could be problematic (assuming users need to reference class names and don't use factory methods). What is your proposed solution?

Comment on lines +161 to +177
private function recursiveEncode(mixed $value): mixed
{
if (is_array($value)) {
foreach ($value as $key => $val) {
$value[$key] = $this->recursiveEncode($val);
}

return $value;
}

if ($value instanceof stdClass) {
foreach (get_object_vars($value) as $key => $val) {
$value->{$key} = $this->recursiveEncode($val);
}
}

return $this->encodeIfSupported($value);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is all about the ArrayCodec and ObjectCodec that were described in the codec architecture doc.
https://github.com/mongodb/mongo-php-library/pull/1125/files#diff-8bbcff6724826a3ca2b160e0718dbee5520987bbaa22e78eda96d0d448cad5e8R27

- "7.4"
- "8.0"
#- "7.4"
#- "8.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you come to a solution to still run PHPLIB against 7.4+ but limit the aggregation builder code to 8.1+?

@@ -0,0 +1,27 @@
<?xml version="1.0" encoding="UTF-8"?>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, this got picked up as a generated file in GitHub:

Some generated files are not rendered by default.

I realize it's temporary, but that doesn't seem right.

@@ -48,15 +50,15 @@ function toJSON(object $document): string
totalCount: Aggregation::sum(1),
evenCount: Aggregation::sum(
Aggregation::mod(
Expression::fieldPath('randomValue'),
Expression::numberFieldPath('randomValue'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is numberFieldPath required over fieldPath since the code would expect ResolvesToNumber here? Do users still have the choice of passing an arbitrary field path as a string?

Looking at this now, I can see how these might be annoying if users must use type in the query builder.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a request from @alcaeus in the Technical Design doc.

To complete the type experience, there are also typed fieldPath helpers (e.g. stringFieldPath) which communicate to the type system that the field path resolves to the given type. This can again be used by an ORM to verify that the field will actually resolve to the given type (if known).

);

$expected = [
[
'$match' => [
'$or' => [
['score' => ['$gt' => 70, '$lt' => 90]],
// same as ['score' => ['$gt' => 70, '$lt' => 90]],
['score' => [['$gt' => 70], ['$lt' => 90]]],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are these equivalent? One is matching a field against an array, which I expect would be an exact match. Consider:

> db.foo.insertOne({score:10})
> db.foo.insertOne({score:80})
> db.foo.insertOne({score:[{$gt:70},{$lt:90}]})

> db.foo.aggregate([{$match:{$or:[{score:[{$gt:70},{$lt:90}]}]}}])
[
  {
    _id: ObjectId("6525b7d5f411a41fea45873b"),
    score: [ { '$gt': 70 }, { '$lt': 90 } ]
  }
]

> db.foo.aggregate([{$match:{$or:[{score:{$gt:70,$lt:90}}]}}])
[ { _id: ObjectId("6525b859f411a41fea45873c"), score: 80 } ]

The document with {score: 10} is never matched, as it doesn't satisfy both range operators. In the case where the range operators are passed separately in an array, the server performs an equality match.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was too easy... to be correct. I reworked the implementation to generate the expected result.

src/Builder/Type/QueryInterface.php Outdated Show resolved Hide resolved
*/
public function __construct(FieldPath|string ...$field)
{
if (\count($field) < 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it expected that generated files reference global functions this way? It's inconsistent with what we do in the hand-written sources, so if it's easily configurable it may be worth changing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added $namespace->addUseFunction calls to add this. The code generator don't know about the function body, so it cannot be detected.

@GromNaN
Copy link
Member Author

GromNaN commented Oct 16, 2023

Replaced by mongodb/mongo-php-builder#1

@GromNaN GromNaN closed this Oct 16, 2023
@GromNaN GromNaN deleted the PHPLIB-1254 branch October 16, 2023 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants