Specifying Data Profiles #33

ghost · 2020-06-24T15:04:19Z

We previously had an email conversation about how to do these, and I've started working out how that looks like.

Each Data Profile exists in it's own directory:

core is in https://github.com/openactive/conformance-services/tree/data-profiles/data-profiles/core

For each type of data, there is a separate JSON Schema file:

There is a helper function, apply_data_profile(). You pass the actual data and the profile name.

The process is:

Break apart data by separating out all sub types. EG, if you are given Event data with a location key and that location key is a Place object, separate that out
For the main data type you are given, run against that's types JSON Schema and record all errors
For each sub type you have separated out, run against that's types JSON Schema and record all errors
Put all errors in one list, making sure the data path information is corrected as you do so

So here's an example in a test:

https://github.com/openactive/conformance-services/blob/data-profiles/test/test-util-data-profile.js#L18

There is a simple bit of Event data there, with a location key and that location key is a Place object
We separate out the Place object
The main bit of data, the Event object is checked by https://github.com/openactive/conformance-services/blob/data-profiles/data-profiles/core/Event.json
One required field is found to be missing
The Place object is checked by https://github.com/openactive/conformance-services/blob/data-profiles/data-profiles/core/Place.json
One required field is found to be missing
We combine those 2 errors in one list, rewriting data path. You can see in the test results.results[1].dataPath == "location" to reflect it's place in the original data

Qs:

Does this method sound good to you?

Is Open Active happy to work on the actual JSON Schema files while we work on the mechanism to use them described in this ticket?

Would O.A. like the JSON Schema files to exist in a new repository and included in this one by a git submodule, so that they can be reused elsewhere easily?

(I have one possible follow up Q, but let's see about these ones first)

The text was updated successfully, but these errors were encountered:

robredpath · 2020-06-26T12:43:53Z

Validate data using data profiles information

thill-odi · 2020-07-10T12:27:36Z

In response to the questions above:

I'll write up the JSON Schema files in accordance with the pattern already established here; but
Yes, having this in a separate repo and pulled in as a submodule is a good idea

ghost · 2020-07-10T14:36:53Z

I'll write up the JSON Schema files in accordance with the pattern already established here; but

Great, thanks.

Yes, having this in a separate repo and pulled in as a submodule is a good idea

Great - do you want to start a new repository and put all the schema files in a directory "data-profiles/NAME/" there?

robredpath · 2020-07-14T10:07:16Z

Validate normalised data against profiles; expose results in API; display on status page

ghost · 2020-07-16T10:35:30Z

@thill-odi I've been working with the core schema in https://github.com/openactive/conformance-profiles and come across a winkle.

In the email chat and this issue we agreed that I'd split any sub types of data out and validate each type against it's relevant type schema separately.

However, you've used refs in the core profile. EG: In https://github.com/openactive/conformance-profiles/blob/master/core/Event.json#L61 the Event object has a "location" field and you've put a $ref

So in the code I just wrote I start with an Event object, split out the "location" Place object and validate them separately! But now when validating the root Event object I get a incorrect validation error of:

  {
    keyword: 'required',
    dataPath: '',
    schemaPath: '#/required',
    params: { missingProperty: 'location' },
    message: "should have required property 'location'"
  },

I don't think we can do both - either we use ref links or I do this splitting thing.

I can see an argument to choosing the refs way; you can now mark a "location" field required for an Event (which indeed you have done - https://github.com/openactive/conformance-profiles/blob/master/core/Event.json#L13 ) and that's not possible with the splitting way.

So shall we just allow Refs, and I'll stop separating out sub objects?

thill-odi · 2020-07-16T11:08:03Z

The question that comes to mind (and pinging @robredpath, as this relates to a conversation we had earlier) is whether the splitting strategy allows finer-grained and more accurate reporting of conformance problems.

The difficulty is that, if the JSON Schema is treated as a monolith, it's hard to get a precise sense of how closely a data item cleaves to the Schema: complete absence of a Location will register as a single error, while its presence, but with two attributes missing, will register as two. Does the splitting strategy solve or mitigate this?

ghost · 2020-07-16T12:38:23Z

If Event with location totally missing:

splitting strategy - no errors recorded. We never split data out and we don't know it should be there.
ref - one error recorded.

If Event with location but 2 attributes missing:

splitting strategy - 2 errors recorded
ref - 2 errors recorded

thill-odi · 2020-07-16T14:03:50Z

Okay, great. Then let's continue with the $refs ....

#33

robredpath · 2020-07-16T14:59:09Z

The crucial realisation that I had about this during the course of the week was that if we're ultimately aiming for a % figure, then we need two input numbers:

The total number of correct fields that should have been in a data item in order to match the profile (lets call this variable A)
The number of correct fields that actually did exist (let's call this B)

Because of arrays and nesting, both of these are non-trivial to calculate. One of the reasons this is absent from the OCDS Data Review Tool is exactly this issue!

In order to calculate A, you need to have the schema and the data, and you need to carry out some logic that's beyond what standard JSON Schema manipulation libraries offer. I think the logic you'd need would be to run through the schema, and every time you hit an array, count its contents the number of times that the array has entries in the data. The best thing about recursion is recursion, and all that.

In order to calculate B, we'd need to go through the normal validation process except that we wouldn't discard info from when a field does validate. Because of nesting, we can't just do A minus errors, because we don't get enough errors generated.

None of this is necessarily hard, although we're rapidly running out of time to implement it, and it's very prone to edge case complexity.

I'm also not sure that it's what we want, if we're trying to use the scores to motivate people. Especially for arrays where the individual objects have lots of fields, publishers might just opt to not publish them at all, rather than publish them in an incomplete way. Presumably that's not what we want.

We could look at programmatic solutions that don't involve validation. For example, we might want to look at parsing the JSON Schema and producing a score based on the top-level (well, second-level) objects only. Separately, we might assess particular types - for example, locations.

I'll have a chat with @odscjames and a couple of the other team members here and see if there's anything that jumps out to us.

thill-odi · 2020-07-16T15:18:48Z

Less of a programmatic solution than a programmatic hack, but ....

Turn JSON Schema files into a flat array of strings representing a hierarchy with points (e.g. ["event.location.postaladdress.postalcode", "event.location.postaladdress.streetaddress" ....].
Flatten all attributes in each data item that have a non-null value similarly
Count intersection of Array 1 with Array 2.
Intersection / Array1.length = % profile satisfied

?

robredpath · 2020-07-16T15:53:39Z

@thill-odi I've just talked with @odscjames and @Bjwebb and we like your hack!

Just to make the caveats explicit:

If there's an array then it'll consider any field that exists in any array item to be present for all array items
This won't directly allow expansion into further checking (eg regex) but we've got ideas for that, when the time comes

We're thinking that we should apply the logic at a per-item level and then determine the mean across a feed to establish a score. That way, if someone's riding off the back of good, old data their score will go down over time, and if someone makes just their newer data better, then their score will go up over time.

Does that work for you?

ghost · 2020-07-17T14:30:49Z

I've just been looking at this, specifically:

Turn JSON Schema files into a flat array of strings representing a hierarchy with points (e.g. ["event.location.postaladdress.postalcode", "event.location.postaladdress.streetaddress" ....].

And this is very difficult due to the use of oneOf and anyOf in the schema files. eg Event->"organiser" can be a Person or a Organisation. So when generating the list of fields we expect for an event, which do we include? And if the event has no "organiser" key at all, do we punish it's score by how many required fields are in the Person type, or the Organisation type?

We at least have to move to generating the list of possible fields each time for each piece of input data. I'm not sure yet if it's possible to do that well; I'll think I would just have to try that and see. Or give this more thought generally.

robredpath · 2020-07-17T14:40:19Z

If all of the oneOfs are quite a long way down in the schema, then maybe we could just ignore them? Unfortunately, they're probably quite important.

We could generate field lists for every permutation and choose a feed's best score? That's a lot more computing to do, but would the closest we can get to fair in almost every case. It wouldn't catch oneOf violations, but presumably they'd fail validation at the source stage anyway.

ghost · 2020-07-17T15:47:08Z

One suggestion is to maybe avoid this issue entirely by re-framing this and just not having a score? Instead if we just JSON Schema validate data, and then for all the items of data in a feed group all the errors and instead present a report:

Publisher X Accessiblitiy Profile:

Field Missing - Example (first 3)
/x/z    -      show, show, show
/z      -       show, show

Then there is no direct number to compare easily between publishers, which isn't what we want to encourage anyway - we want a publisher to look at their feed and improve it for their own feed's sake. And if they add something and the list gets longer, that's not as explicitly a bad a signal as a score number going up.

This also means that when anyone adds other validation rules to the schema later (eg content rules like enums or length), there won't be any additional code to write - those errors will show up nicely in the report straight away.

thill-odi · 2020-07-17T16:20:36Z

Okay, I'm on a hack spree ...

Amending the above proposal so that it includes @type information where appropriate:

Turn JSON Schema files into a flat array of strings representing a hierarchy with points (e.g. ["event.location.postaladdress.postalcode", "event.location.postaladdress.streetaddress" ....].
1a. In cases where anyOf is found, all type possibilities are generated with type information postpended all the way down the tree. e.g. Note that (a) For the sake of simplicity all non-complex objects are represented with the suffix @simple, and (b) we do not record @type as a separate datapoint to avoid double-counting). E.g.
["event.level@simple", "event.level@Concept", "[email protected]", "[email protected]", "event.organizer@Organization.postalAddress@PostalAddress" ...]
Let's call this the schema_array
Flatten all attributes in each data item that have a non-null value similarly. Let's call this the data_item_array.
2a. Foreach value in data_item_array that ends with @[\w]+\ (that is to say, that really only indicates type)
2ai. Loop through a copy ofschema_array and
2aii. Boot out any values that do match prior to the final @ sign, but not the sequence of @[\w]+ after it
This yields a third array, pruned_schema_array
Count intersection of data_item_array with pruned_schema_array.
Intersection / pruned_schema_array.length = % profile satisfied

For example, supposing we had a data model consisting only of an Event with three attributes: a name (string), a level (string or skos:Concept), and an organiser (schema:Person or schema:Organization)

schema_array: ["event.name", "event.level@simple", "event.level@Concept", "[email protected]", "event.organiser@Person", "[email protected]", "[email protected]", "event.organiser@Place", "[email protected]@PostalAddress", "[email protected]@PostalAddress.streetAddress" ...]

data_item_array: ["event.name", "event.level@simple", ""event.organiser@Person", "[email protected]"]

pruned_schema_array: [ "event.name", "event.level@simple", ""event.organiser@Person", "[email protected]", "[email protected]"]

Then we again count the intersection of pruned_schema_array and data_item_array to get a raw score, over the total length of pruned_schema_array.

thill-odi · 2020-07-17T16:23:41Z

Reviewing the above, I'm not sure how much less computationally-expensive this is than @robredpath's 'just generate all the arrays and compare' approach. But hey, it uses regexes.

https://xkcd.com/1171/

robredpath · 2020-07-21T15:57:21Z

@thill-odi so, the issue with the algorithm you proposed is that oneOf is also used when fields of differing combination rather than differing types, which means that they wouldn't flatten out well. (see https://github.com/openactive/conformance-profiles/blob/master/core/Event.json#L109 for example).

Given that time is starting to get tight, how about this:

We take the required array from each object (value A)
We count the number of those fields which appear in the errors list from JSON Schema validation (value B). This would be the 'top-level' fields - ie, if you have an array then you need to get the array totally right for it to count.
B / A = your score

This is simple, and means that scores never go backward if your data improves. It does mean that the effort required to get a better score can be disproportionate, but that's the trade-off! It also provides some of the scaffolding for future further levels, weightings, etc - if you wanted to do that at some point.

We'd also then be able to get access to the validation errors relatively easily, so that we could give feedback on what people need to do to improve. It should be pretty straightforward to link to that from the status page, at least in a machine-friendly way.

Also ajv for actually doing the work #33

#33

robredpath · 2020-07-23T08:29:42Z

Establish what metrics we need to expose on the status page

ghost assigned thill-odi and nickevansuk Jun 25, 2020

ghost pushed a commit that referenced this issue Jul 16, 2020

conformance-profiles: Add Submodule and util function

d4ee31e

#33

ghost mentioned this issue Jul 16, 2020

conformance-profiles: Add Submodule and util function #65

Closed

ghost pushed a commit that referenced this issue Jul 16, 2020

conformance-profiles: Add Submodule and util function

a5ec845

#33

ghost pushed a commit that referenced this issue Jul 22, 2020

conformance-profiles: Add Submodule and util function

849c1e0

Also ajv for actually doing the work #33

ghost mentioned this issue Jul 22, 2020

conformance-profiles: Add Submodule and util function #76

Merged

ghost pushed a commit that referenced this issue Jul 22, 2020

conformance-profiles: Add Submodule and util function

869147a

Also ajv for actually doing the work #33

ghost pushed a commit that referenced this issue Jul 22, 2020

conformance-profiles: Database table and worker to process data

3de9465

#33

ghost mentioned this issue Jul 22, 2020

conformance-profiles: Database table and worker to process data #78

Merged

ghost pushed a commit that referenced this issue Jul 22, 2020

conformance-profiles: Database table and worker to process data

24190ea

#33

ghost pushed a commit that referenced this issue Jul 22, 2020

docs: Add Profile normalised data to readme

538fd8f

#33

ghost mentioned this issue Jul 22, 2020

docs: Add Profile normalised data to readme #80

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specifying Data Profiles #33

Specifying Data Profiles #33

ghost commented Jun 24, 2020

robredpath commented Jun 26, 2020

thill-odi commented Jul 10, 2020

ghost commented Jul 10, 2020

robredpath commented Jul 14, 2020

ghost commented Jul 16, 2020

thill-odi commented Jul 16, 2020

ghost commented Jul 16, 2020

thill-odi commented Jul 16, 2020

robredpath commented Jul 16, 2020

thill-odi commented Jul 16, 2020

robredpath commented Jul 16, 2020

ghost commented Jul 17, 2020

robredpath commented Jul 17, 2020

ghost commented Jul 17, 2020

thill-odi commented Jul 17, 2020 •

edited

Loading

thill-odi commented Jul 17, 2020

robredpath commented Jul 21, 2020

robredpath commented Jul 23, 2020

Specifying Data Profiles #33

Specifying Data Profiles #33

Comments

ghost commented Jun 24, 2020

robredpath commented Jun 26, 2020

thill-odi commented Jul 10, 2020

ghost commented Jul 10, 2020

robredpath commented Jul 14, 2020

ghost commented Jul 16, 2020

thill-odi commented Jul 16, 2020

ghost commented Jul 16, 2020

thill-odi commented Jul 16, 2020

robredpath commented Jul 16, 2020

thill-odi commented Jul 16, 2020

robredpath commented Jul 16, 2020

ghost commented Jul 17, 2020

robredpath commented Jul 17, 2020

ghost commented Jul 17, 2020

thill-odi commented Jul 17, 2020 • edited Loading

thill-odi commented Jul 17, 2020

robredpath commented Jul 21, 2020

robredpath commented Jul 23, 2020

thill-odi commented Jul 17, 2020 •

edited

Loading