Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract vocabularies from the specs #1510

Merged
merged 12 commits into from
Aug 17, 2024
Merged

Conversation

gregsdennis
Copy link
Member

@gregsdennis gregsdennis commented May 22, 2024

Relates to https://github.com/orgs/json-schema-org/discussions/724 and #1505.

Depends on #1518
Depends on #1512
(will likely need rebase after those merge)

The Core spec was a lot, but it was fairly straightforward.

This PR will make a best effort to include vocabularies as they are in 2020-12 as a start. Some redesign is inevitible, but this this is a good starting point.

@gregsdennis gregsdennis changed the title remove vocabularies from the specs Extract vocabularies from the specs May 22, 2024
Copy link
Member

@jdesrosiers jdesrosiers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only skimmed a little so far, but one thing that stood out is that we don't necessarily need to remove the word or the concept of vocabularies generally as collection of keywords. We only need to remove the mechanism ($vocabulary) and anything that refers to the behavior of that mechanism. The spec used the term "vocabulary" before the vocabulary system was introduced and I think it would be fine if we leave a few places that use the term in the same way. It could help make this a smaller change.

@gregsdennis
Copy link
Member Author

Yes, the word was present, but it was used very loosely. Later we added it with a very specific (and vaguely different) meaning but didn't address the existing usages. (See https://json-schema.org/draft-07/json-schema-validation.)

I figure it's better for now to just remove it completely. I don't think that the text suffers.

@@ -295,7 +284,7 @@ the name of a property in the instance.

Omitting this keyword has the same behavior as an empty object.

## Vocabularies for Semantic Content With `format` {#format}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are considerable impactful changes for format. Specifically, schema authors will no longer have the ability to specify that format should be validated. Tool configuration is the only way to validate format when we remove vocabularies.

See discussion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Might not be an issue if #1520 goes through.)

@gregsdennis gregsdennis force-pushed the gregsdennis/extract-vocabs branch from 0e9d125 to a3d2698 Compare May 22, 2024 23:26
@jdesrosiers
Copy link
Member

I figure it's better for now to just remove it completely. I don't think that the text suffers.

That's fine with me. I didn't think the text suffered, I was just trying to make this a smaller job 😄.

@gregsdennis gregsdennis force-pushed the gregsdennis/extract-vocabs branch from a3d2698 to c96c349 Compare June 10, 2024 01:10
@gregsdennis gregsdennis marked this pull request as ready for review June 11, 2024 01:16
@gregsdennis gregsdennis requested a review from a team June 11, 2024 22:45
@gregsdennis gregsdennis added this to the stable-release milestone Jun 11, 2024
@gregsdennis gregsdennis self-assigned this Jun 11, 2024
@Relequestual Relequestual added the agenda For the OCWM agenda label Jun 18, 2024
@gregsdennis
Copy link
Member Author

Reminder (to me): open a new issue to discuss the future of the keyword and link all of the issues I've been linking ☝️

Copy link
Member

@jdesrosiers jdesrosiers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is big, so I'm going to review in several parts. This first part is for jsonschema-core.md.

I like the wording changes. I especially like that works whether the vocabulary system is a thing or not which will make it easier to merge back in at some point if it ever reaches a stable state.

jsonschema-core.md Outdated Show resolved Hide resolved
jsonschema-core.md Outdated Show resolved Hide resolved
Comment on lines +806 to +834
Meta-schemas are used to inform an implementation how to interpret a schema.
Every schema has a meta-schema, which can be explicitly declared using the
`$schema` keyword.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section was effectively the introduction to and definition of the term "dialect". That ends up getting lost in the changes. Do we want to do something to include that defines "dialect"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. I wonder if that's something that needs to be re-introduced with the proposal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the concept of a dialect should stay. The concept isn't dependent on the vocabulary system. I explain further in my latest review.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you want to define "dialect"? Is it merely the collection of keywords? Open to suggestions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually define "dialect" as any distinct flavor of JSON Schema whether it's defined by us (drafts), a third party (OpenAPI, MongoDB), or individuals for private use. That's probably not formal enough for the spec, but I like that definition specifically because it's somewhat vague. There are several things out there that call themselves JSON Schema, and we need a name for that. I see the vocabulary system as a way to formally define a dialect, but there are informal or otherwise formally defined dialects out there that don't use the vocabulary system.

Copy link
Member

@jdesrosiers jdesrosiers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's another partial review for the jsonschema-validation.md changes.

I think the format section still needs a little more work. I bet we can reduce the amount of changes necessary here by introducing an additional "mode" for evaluating format that people can configure. I'm thinking something like annotation-only, best-effort-assertion, and full-assertion. annotation-only would be equivalent to the format-annotation vocabulary. best-effort-assertion would be equivalent to the current configuration enabled behavior. full-assertion would be equivalent to the format-annotation vocabulary.

Comment on lines 356 to +357
parties, schema authors SHALL NOT expect a peer implementation to support such
custom format attributes. An implementation MUST NOT fail to collect unknown
formats as annotations. When the Format-Assertion vocabulary is specified,
implementations MUST fail upon encountering unknown formats.
custom format attributes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that given the changes in this section, we would always expect implementations to "support" custom format attributes, we just wouldn't expect them to validate to anything other than true. Since partial and even no validation is allowed for any format attribute, always returning true is a valid way to "support" that attribute.

I think the point of this sentence is to say that you can't expect it to assert the way you expect, but that ends up being the case for any format attribute including the ones defined in the spec. So, I'm not sure what we should do with this.

Copy link
Member Author

@gregsdennis gregsdennis Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point might be moot with #1520.

Maybe it's enough to say that for unknown formats, an implementation must collect the value as an annotation and provide a passing validation result.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does #1520 makes this irrelevant? Isn't that just about what the default behavior is? We still have multiple behavior that we have to define in consistent way.

Maybe it's enough to say that for unknown formats, an implementation must collect the value as an annotation and provide a passing validation result.

That would work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is the solution here, too?

this keyword, it is RECOMMENDED to define additional keywords in a custom
vocabulary rather than additional format attributes if interoperability is
desired.
An implementation MUST NOT fail to collect unknown formats as annotations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been qualifying this with "(if the implementation supports annotation collection)" everywhere. Should we be doing that here as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly. I think this is a candidate for "things that need cleaning up later".

Comment on lines +361 to +362
When configured for assertion behavior for `format`, implementations MUST fail
upon encountering unknown formats.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We say that even when assertion behavior is enabled, implementations might still just return true, so why must an implementation fail when encountering an unknown format rather than returning true? Or should they fail if they don't have a best-effort implementation for the format whether it's known or not?

I think this requirement made sense applied to the format-assertion vocabulary, but doesn't really work in this context.

Copy link
Member Author

@gregsdennis gregsdennis Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The answer to the question is: which is more acceptable, a false negative or a false positive? I landed on the side of false negative being more acceptable.

I think it's better for an implementation to default to failing a schema that contains a format it doesn't know rather than to pass it. Maybe it's one of those "refuse to process" scenarios?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't advocating for any particular behavior. I was just pointing out that the text is contradictory.

But, my opinion on the behavior is that in assert mode it should either:

  • Allow partial/no validation and unknown formats return true; OR
  • Require full validation and refuse to process unknown formats

Allowing no validation and refusing to process unknown formats doesn't make sense.

Copy link
Member Author

@gregsdennis gregsdennis Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we just remove assertion completely, then, and let implementations figure out what that mode means? (Then separately we handle #1520.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we define something before we release, I'm fine with leaving this somewhat undefined for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I want to leave this for #1520 to figure out. For now, I just want to pull out the vocabulary language. We'll definitely address it before release, though.

@@ -969,40 +883,6 @@ draft-bhutton-json-schema-01, June 2022,
Hoehrmann, B., "Scripting Media Types", RFC 4329, DOI 10.17487/RFC4329, April
2006, <<https://www.rfc-editor.org/info/rfc4329>>.

## [Appendix] Keywords Moved from Validation to Core
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure removing this section is related to extracting vocabularies, but I fully agree with removing it.

Copy link
Member Author

@gregsdennis gregsdennis Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... I'm not sure. It probably got caught up from one of my more aggressive change experiments I have locally.

Still, I agree it can be removed. I'm not fussed about removing it in this PR if you're not.

Copy link
Member

@jdesrosiers jdesrosiers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the last part of my review including the adr and proposal.

proposals/vocabularies-adr.md Outdated Show resolved Hide resolved
proposals/vocabularies-adr.md Outdated Show resolved Hide resolved
proposals/vocabularies.md Outdated Show resolved Hide resolved
proposals/vocabularies.md Outdated Show resolved Hide resolved
proposals/vocabularies.md Outdated Show resolved Hide resolved
proposals/vocabularies.md Show resolved Hide resolved
Comment on lines +128 to +131
schema is being validated by a meta-schema that includes the vocabulary. (A
vocabulary schema is not itself a meta-schema since it does not validate entire
schemas.) To facilitate this extra validation, when a vocabulary schema is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've always defined a meta-schema to be a schema that describes a schema. These vocabulary schemas do fit that definition. I understand what you're trying to say here, but I don't think saying it's not a meta-schema is the right approach. Didn't Henry update the spec at some point with a way to describe this behavior without having to say it's ignored or it's not a meta-schema?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

He didn't update the spec or anything. This actually came out of reviewing @jviotti's book and trying to rework my vocab schemas. I wrote about it here, and we pulled out $vocabulary from them in #1460, #1461, and #1462.

They're not meta-schemas because they don't themselves describe full schemas; they are used by meta-schemas as components.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we're quite talking about the same thing. I entirely agree that vocabulary meta-schemas shouldn't include the $vocabulary keyword.

They're not meta-schemas because they don't themselves describe full schemas they are used by meta-schemas as components.

This distinction is what doesn't sit right with me. The way I see it, a component of a meta-schema is still a meta-schema. Section 8.1.2.2 is what I thought you were referring to here. Is that correct?

It does seem like the wording there is not considering schemas referenced by a meta-schema a meta-schema, but that's never how I've understood the word or how we've defined the word. I've always used the terms dialect meta-schema and vocabulary meta-schema. The $vocabulary keyword is only meaningful when the meta-schema is used as a dialect meta-schema.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I wasn't referring to that section.

I'm looking at Core 4.3.4 which defines "meta-schema":

A schema that itself describes a schema is called a meta-schema.

A vocabulary schema doesn't describe a schema, therefore it's not a meta-schema.

You wouldn't use the Meta-data vocab schema as a meta-schema; you reference it from a meta-schema.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. We're using the same definition, but interpreting it differently. The way I see it, a vocabulary schema is validating a schema. It validates the syntax of keywords in a schema. I don't think the definition implies that a schema is only a meta-schema if it describes the entire dialect used by the schema.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to publish a canonical vocab metaschema schema?

I'm not sure what you're leading toward. Are you questioning the current practice of defining a vocab schema? Or are you saying that we should have a vocab schema that can function as a meta-schema?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I got confused by this, others doubtlessly will (or have already) as well.

I 100% agree that we need to explain this better and using more descriptive terms is a big part of doing that well. But, the way you're currently expressing this is confusing to me because this doesn't mesh with my (and I'm sure others) understanding of the term "meta-schema". That's why I'd prefer to address this by introducing new terms that are more specific. Earlier I mention that I use the terms "dialect meta-schema" and "vocabulary meta-schema", but since the term "meta-schema" appears to be inconsistently understood, perhaps "dialect schema" and "vocabulary schema" is better.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to publish a canonical vocab metaschema schema?

I'm not sure what you're leading toward. Are you questioning the current practice of defining a vocab schema? Or are you saying that we should have a vocab schema that can function as a meta-schema?

I'm not questioning the current practice. But I would not have made the mistake if we'd had a "vocab metaschema" schema that clearly defined what was allowed in a vocab metaschema. It would have been apparent that we were talking about a very similar, but different entity.

And I am tending towards the idea of using the terms dialect schema and vocabulary schema

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not have made the mistake if we'd had a "vocab metaschema" schema that clearly defined what was allowed in a vocab metaschema. It would have been apparent that we were talking about a very similar, but different entity.

What would be the difference between these two meta-schemas? Everything that's allowed in a vocab schema is allowed in a dialect schema. Everything that's allowed in a dialect schema is allowed in a vocab schema. The only difference I can see would be their identifier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @jdesrosiers on this: syntactically they're just schemas.

The only difference is intent. A meta-schema is intended to describe/validate a schema. It can use vocab schemas (by reference) to accomplish that task. A vocab schema on its own does not describe a schema; it's merely a building block in a meta-schema. This is the distinction I'm making. (One could create a single-vocab meta-schema.)

I don't think having a separate meta-schema for vocab schemas is needed since anything you do in a general schema should technically be allowed in a vocab schema. I'm not sure how useful all of those features would be, but I don't see a reason to forbid any features.

proposals/vocabularies.md Outdated Show resolved Hide resolved
proposals/vocabularies.md Outdated Show resolved Hide resolved
@gregsdennis gregsdennis force-pushed the gregsdennis/extract-vocabs branch from 8faa17e to bc542ee Compare June 21, 2024 04:51
@gregsdennis
Copy link
Member Author

I've rebased this on top of #1512. Starting to address comments now.

@gregsdennis gregsdennis force-pushed the gregsdennis/extract-vocabs branch from cb49d5e to d23ec05 Compare July 17, 2024 00:21
@gregsdennis gregsdennis requested review from mwadams, jdesrosiers and a team August 1, 2024 21:12
@gregsdennis
Copy link
Member Author

@jdesrosiers would you please re-review this? I've made changes according to our conversations.

Also, would you be satisfied with the proposal doc in its current state as a starting point? I fully expect further issues and PRs to be created to edit the proposal.

@gregsdennis
Copy link
Member Author

[Just some housekeeping]

Issues to be opened and addressed in follow-up PRs:

Copy link
Member

@jdesrosiers jdesrosiers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not happy with some parts. I feel like this is a change in the meaning of "meta-schema" and I'm not sure I'm comfortable with that. But, I'm ok with merging this as is as an iteration. The proposal will need to evolve or die anyway. This is not the final version in any case.

proposals/vocabularies.md Outdated Show resolved Hide resolved
Co-authored-by: Jason Desrosiers <[email protected]>
@gregsdennis
Copy link
Member Author

I feel like this is a change in the meaning of "meta-schema" and I'm not sure I'm comfortable with that.

Regarding this, I feel we've been misusing "meta-schema" by allowing it to refer to schemas that describe vocabs, and I want us to return to using it to refer to schemas that describes schemas.

It's not a change in definition; it's a return to correct usage.

But, I'm ok with merging this as is as an iteration.

Definitely, this PR is focused on getting vocabs out of the spec so that we can start iterating on the spec again. This is the last piece that we agreed to remove.

The proposal is definitely unfinished. The approach I took was to just make it as close to the 2020-12 behavior as I could. I recognize that it won't work with the next spec, but we need a starting point. The proposal will continue to evolve and will likely not look much like its current state when it's done. Maybe it'll just be wholesale replaced by something better.

@gregsdennis gregsdennis merged commit 51285b3 into main Aug 17, 2024
3 checks passed
@gregsdennis gregsdennis deleted the gregsdennis/extract-vocabs branch August 17, 2024 00:06

- MUST still collect the keyword's value as an annotation (if the implementation
supports annotation collection),
- MUST provide a configuration option to enable assertion behavior, defaulting to
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default is surely up to the tool? We don't usually specify what the tooling does in that regard.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. In order to ensure interoperability, we need all implementations to have the same out-of-the-box behavior. Implementations can do whatever, but they need non-spec behavior hidden behind configuration so that the user must explicitly opt in. This applies in general.

This particular text will be revisited with the resolution of #1520.

this keyword, it is RECOMMENDED to define additional keywords in a custom
vocabulary rather than additional format attributes if interoperability is
desired.
An implementation MUST NOT fail to collect unknown formats as annotations.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might it be better to avoid the double negative? This then becomes "must collect format annotations (if it supports annotation collection)" like all the other cases where we express this idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, probably. This was the text in previous versions. I just reverted it to that. Can be revisited.

proposals/vocabularies.md Show resolved Hide resolved
Comment on lines +128 to +131
schema is being validated by a meta-schema that includes the vocabulary. (A
vocabulary schema is not itself a meta-schema since it does not validate entire
schemas.) To facilitate this extra validation, when a vocabulary schema is
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to publish a canonical vocab metaschema schema?

I'm not sure what you're leading toward. Are you questioning the current practice of defining a vocab schema? Or are you saying that we should have a vocab schema that can function as a meta-schema?

I'm not questioning the current practice. But I would not have made the mistake if we'd had a "vocab metaschema" schema that clearly defined what was allowed in a vocab metaschema. It would have been apparent that we were talking about a very similar, but different entity.

And I am tending towards the idea of using the terms dialect schema and vocabulary schema

@gregsdennis gregsdennis removed the agenda For the OCWM agenda label Aug 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants