-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specifying Data Profiles #33
Comments
In response to the questions above:
|
Great, thanks.
Great - do you want to start a new repository and put all the schema files in a directory "data-profiles/NAME/" there? |
@thill-odi I've been working with the core schema in https://github.com/openactive/conformance-profiles and come across a winkle. In the email chat and this issue we agreed that I'd split any sub types of data out and validate each type against it's relevant type schema separately. However, you've used refs in the core profile. EG: In https://github.com/openactive/conformance-profiles/blob/master/core/Event.json#L61 the Event object has a "location" field and you've put a $ref So in the code I just wrote I start with an Event object, split out the "location" Place object and validate them separately! But now when validating the root Event object I get a incorrect validation error of:
I don't think we can do both - either we use ref links or I do this splitting thing. I can see an argument to choosing the refs way; you can now mark a "location" field required for an Event (which indeed you have done - https://github.com/openactive/conformance-profiles/blob/master/core/Event.json#L13 ) and that's not possible with the splitting way. So shall we just allow Refs, and I'll stop separating out sub objects? |
The question that comes to mind (and pinging @robredpath, as this relates to a conversation we had earlier) is whether the splitting strategy allows finer-grained and more accurate reporting of conformance problems. The difficulty is that, if the JSON Schema is treated as a monolith, it's hard to get a precise sense of how closely a data item cleaves to the Schema: complete absence of a |
If Event with location totally missing:
If Event with location but 2 attributes missing:
|
Okay, great. Then let's continue with the $refs .... |
The crucial realisation that I had about this during the course of the week was that if we're ultimately aiming for a % figure, then we need two input numbers:
Because of arrays and nesting, both of these are non-trivial to calculate. One of the reasons this is absent from the OCDS Data Review Tool is exactly this issue! In order to calculate A, you need to have the schema and the data, and you need to carry out some logic that's beyond what standard JSON Schema manipulation libraries offer. I think the logic you'd need would be to run through the schema, and every time you hit an array, count its contents the number of times that the array has entries in the data. The best thing about recursion is recursion, and all that. In order to calculate B, we'd need to go through the normal validation process except that we wouldn't discard info from when a field does validate. Because of nesting, we can't just do A minus errors, because we don't get enough errors generated. None of this is necessarily hard, although we're rapidly running out of time to implement it, and it's very prone to edge case complexity. I'm also not sure that it's what we want, if we're trying to use the scores to motivate people. Especially for arrays where the individual objects have lots of fields, publishers might just opt to not publish them at all, rather than publish them in an incomplete way. Presumably that's not what we want. We could look at programmatic solutions that don't involve validation. For example, we might want to look at parsing the JSON Schema and producing a score based on the top-level (well, second-level) objects only. Separately, we might assess particular types - for example, locations. I'll have a chat with @odscjames and a couple of the other team members here and see if there's anything that jumps out to us. |
Less of a programmatic solution than a programmatic hack, but ....
? |
@thill-odi I've just talked with @odscjames and @Bjwebb and we like your hack! Just to make the caveats explicit:
We're thinking that we should apply the logic at a per-item level and then determine the mean across a feed to establish a score. That way, if someone's riding off the back of good, old data their score will go down over time, and if someone makes just their newer data better, then their score will go up over time. Does that work for you? |
I've just been looking at this, specifically:
And this is very difficult due to the use of oneOf and anyOf in the schema files. eg Event->"organiser" can be a Person or a Organisation. So when generating the list of fields we expect for an event, which do we include? And if the event has no "organiser" key at all, do we punish it's score by how many required fields are in the Person type, or the Organisation type? We at least have to move to generating the list of possible fields each time for each piece of input data. I'm not sure yet if it's possible to do that well; I'll think I would just have to try that and see. Or give this more thought generally. |
If all of the oneOfs are quite a long way down in the schema, then maybe we could just ignore them? Unfortunately, they're probably quite important. We could generate field lists for every permutation and choose a feed's best score? That's a lot more computing to do, but would the closest we can get to fair in almost every case. It wouldn't catch oneOf violations, but presumably they'd fail validation at the source stage anyway. |
One suggestion is to maybe avoid this issue entirely by re-framing this and just not having a score? Instead if we just JSON Schema validate data, and then for all the items of data in a feed group all the errors and instead present a report:
Then there is no direct number to compare easily between publishers, which isn't what we want to encourage anyway - we want a publisher to look at their feed and improve it for their own feed's sake. And if they add something and the list gets longer, that's not as explicitly a bad a signal as a score number going up. This also means that when anyone adds other validation rules to the schema later (eg content rules like enums or length), there won't be any additional code to write - those errors will show up nicely in the report straight away. |
Okay, I'm on a hack spree ... Amending the above proposal so that it includes
For example, supposing we had a data model consisting only of an
Then we again count the intersection of |
Reviewing the above, I'm not sure how much less computationally-expensive this is than @robredpath's 'just generate all the arrays and compare' approach. But hey, it uses regexes. |
@thill-odi so, the issue with the algorithm you proposed is that oneOf is also used when fields of differing combination rather than differing types, which means that they wouldn't flatten out well. (see https://github.com/openactive/conformance-profiles/blob/master/core/Event.json#L109 for example). Given that time is starting to get tight, how about this:
This is simple, and means that scores never go backward if your data improves. It does mean that the effort required to get a better score can be disproportionate, but that's the trade-off! It also provides some of the scaffolding for future further levels, weightings, etc - if you wanted to do that at some point. We'd also then be able to get access to the validation errors relatively easily, so that we could give feedback on what people need to do to improve. It should be pretty straightforward to link to that from the status page, at least in a machine-friendly way. |
Also ajv for actually doing the work #33
Also ajv for actually doing the work #33
We previously had an email conversation about how to do these, and I've started working out how that looks like.
Each Data Profile exists in it's own directory:
For each type of data, there is a separate JSON Schema file:
There is a helper function, apply_data_profile(). You pass the actual data and the profile name.
The process is:
So here's an example in a test:
https://github.com/openactive/conformance-services/blob/data-profiles/test/test-util-data-profile.js#L18
results.results[1].dataPath == "location"
to reflect it's place in the original dataQs:
Does this method sound good to you?
Is Open Active happy to work on the actual JSON Schema files while we work on the mechanism to use them described in this ticket?
Would O.A. like the JSON Schema files to exist in a new repository and included in this one by a git submodule, so that they can be reused elsewhere easily?
(I have one possible follow up Q, but let's see about these ones first)
The text was updated successfully, but these errors were encountered: