Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specifying Data Profiles #33

Open
ghost opened this issue Jun 24, 2020 · 18 comments
Open

Specifying Data Profiles #33

ghost opened this issue Jun 24, 2020 · 18 comments
Assignees

Comments

@ghost
Copy link

ghost commented Jun 24, 2020

We previously had an email conversation about how to do these, and I've started working out how that looks like.

Each Data Profile exists in it's own directory:

For each type of data, there is a separate JSON Schema file:

There is a helper function, apply_data_profile(). You pass the actual data and the profile name.

The process is:

  • Break apart data by separating out all sub types. EG, if you are given Event data with a location key and that location key is a Place object, separate that out
  • For the main data type you are given, run against that's types JSON Schema and record all errors
  • For each sub type you have separated out, run against that's types JSON Schema and record all errors
  • Put all errors in one list, making sure the data path information is corrected as you do so

So here's an example in a test:

https://github.com/openactive/conformance-services/blob/data-profiles/test/test-util-data-profile.js#L18

Qs:

Does this method sound good to you?

Is Open Active happy to work on the actual JSON Schema files while we work on the mechanism to use them described in this ticket?

Would O.A. like the JSON Schema files to exist in a new repository and included in this one by a git submodule, so that they can be reused elsewhere easily?

(I have one possible follow up Q, but let's see about these ones first)

@ghost ghost assigned thill-odi and nickevansuk Jun 25, 2020
@robredpath
Copy link
Collaborator

@thill-odi
Copy link
Contributor

In response to the questions above:

  1. I'll write up the JSON Schema files in accordance with the pattern already established here; but
  2. Yes, having this in a separate repo and pulled in as a submodule is a good idea

@ghost
Copy link
Author

ghost commented Jul 10, 2020

I'll write up the JSON Schema files in accordance with the pattern already established here; but

Great, thanks.

Yes, having this in a separate repo and pulled in as a submodule is a good idea

Great - do you want to start a new repository and put all the schema files in a directory "data-profiles/NAME/" there?

@robredpath
Copy link
Collaborator

@ghost
Copy link
Author

ghost commented Jul 16, 2020

@thill-odi I've been working with the core schema in https://github.com/openactive/conformance-profiles and come across a winkle.

In the email chat and this issue we agreed that I'd split any sub types of data out and validate each type against it's relevant type schema separately.

However, you've used refs in the core profile. EG: In https://github.com/openactive/conformance-profiles/blob/master/core/Event.json#L61 the Event object has a "location" field and you've put a $ref

So in the code I just wrote I start with an Event object, split out the "location" Place object and validate them separately! But now when validating the root Event object I get a incorrect validation error of:

  {
    keyword: 'required',
    dataPath: '',
    schemaPath: '#/required',
    params: { missingProperty: 'location' },
    message: "should have required property 'location'"
  },

I don't think we can do both - either we use ref links or I do this splitting thing.

I can see an argument to choosing the refs way; you can now mark a "location" field required for an Event (which indeed you have done - https://github.com/openactive/conformance-profiles/blob/master/core/Event.json#L13 ) and that's not possible with the splitting way.

So shall we just allow Refs, and I'll stop separating out sub objects?

@thill-odi
Copy link
Contributor

The question that comes to mind (and pinging @robredpath, as this relates to a conversation we had earlier) is whether the splitting strategy allows finer-grained and more accurate reporting of conformance problems.

The difficulty is that, if the JSON Schema is treated as a monolith, it's hard to get a precise sense of how closely a data item cleaves to the Schema: complete absence of a Location will register as a single error, while its presence, but with two attributes missing, will register as two. Does the splitting strategy solve or mitigate this?

@ghost
Copy link
Author

ghost commented Jul 16, 2020

If Event with location totally missing:

  • splitting strategy - no errors recorded. We never split data out and we don't know it should be there.
  • ref - one error recorded.

If Event with location but 2 attributes missing:

  • splitting strategy - 2 errors recorded
  • ref - 2 errors recorded

@thill-odi
Copy link
Contributor

Okay, great. Then let's continue with the $refs ....

@robredpath
Copy link
Collaborator

The crucial realisation that I had about this during the course of the week was that if we're ultimately aiming for a % figure, then we need two input numbers:

  • The total number of correct fields that should have been in a data item in order to match the profile (lets call this variable A)
  • The number of correct fields that actually did exist (let's call this B)

Because of arrays and nesting, both of these are non-trivial to calculate. One of the reasons this is absent from the OCDS Data Review Tool is exactly this issue!

In order to calculate A, you need to have the schema and the data, and you need to carry out some logic that's beyond what standard JSON Schema manipulation libraries offer. I think the logic you'd need would be to run through the schema, and every time you hit an array, count its contents the number of times that the array has entries in the data. The best thing about recursion is recursion, and all that.

In order to calculate B, we'd need to go through the normal validation process except that we wouldn't discard info from when a field does validate. Because of nesting, we can't just do A minus errors, because we don't get enough errors generated.

None of this is necessarily hard, although we're rapidly running out of time to implement it, and it's very prone to edge case complexity.

I'm also not sure that it's what we want, if we're trying to use the scores to motivate people. Especially for arrays where the individual objects have lots of fields, publishers might just opt to not publish them at all, rather than publish them in an incomplete way. Presumably that's not what we want.

We could look at programmatic solutions that don't involve validation. For example, we might want to look at parsing the JSON Schema and producing a score based on the top-level (well, second-level) objects only. Separately, we might assess particular types - for example, locations.

I'll have a chat with @odscjames and a couple of the other team members here and see if there's anything that jumps out to us.

@thill-odi
Copy link
Contributor

Less of a programmatic solution than a programmatic hack, but ....

  1. Turn JSON Schema files into a flat array of strings representing a hierarchy with points (e.g. ["event.location.postaladdress.postalcode", "event.location.postaladdress.streetaddress" ....].
  2. Flatten all attributes in each data item that have a non-null value similarly
  3. Count intersection of Array 1 with Array 2.
  4. Intersection / Array1.length = % profile satisfied

?

@robredpath
Copy link
Collaborator

@thill-odi I've just talked with @odscjames and @Bjwebb and we like your hack!

Just to make the caveats explicit:

  • If there's an array then it'll consider any field that exists in any array item to be present for all array items
  • This won't directly allow expansion into further checking (eg regex) but we've got ideas for that, when the time comes

We're thinking that we should apply the logic at a per-item level and then determine the mean across a feed to establish a score. That way, if someone's riding off the back of good, old data their score will go down over time, and if someone makes just their newer data better, then their score will go up over time.

Does that work for you?

@ghost
Copy link
Author

ghost commented Jul 17, 2020

I've just been looking at this, specifically:

Turn JSON Schema files into a flat array of strings representing a hierarchy with points (e.g. ["event.location.postaladdress.postalcode", "event.location.postaladdress.streetaddress" ....].

And this is very difficult due to the use of oneOf and anyOf in the schema files. eg Event->"organiser" can be a Person or a Organisation. So when generating the list of fields we expect for an event, which do we include? And if the event has no "organiser" key at all, do we punish it's score by how many required fields are in the Person type, or the Organisation type?

We at least have to move to generating the list of possible fields each time for each piece of input data. I'm not sure yet if it's possible to do that well; I'll think I would just have to try that and see. Or give this more thought generally.

@robredpath
Copy link
Collaborator

If all of the oneOfs are quite a long way down in the schema, then maybe we could just ignore them? Unfortunately, they're probably quite important.

We could generate field lists for every permutation and choose a feed's best score? That's a lot more computing to do, but would the closest we can get to fair in almost every case. It wouldn't catch oneOf violations, but presumably they'd fail validation at the source stage anyway.

@ghost
Copy link
Author

ghost commented Jul 17, 2020

One suggestion is to maybe avoid this issue entirely by re-framing this and just not having a score? Instead if we just JSON Schema validate data, and then for all the items of data in a feed group all the errors and instead present a report:

Publisher X Accessiblitiy Profile:

Field Missing - Example (first 3)
/x/z    -      show, show, show
/z      -       show, show

Then there is no direct number to compare easily between publishers, which isn't what we want to encourage anyway - we want a publisher to look at their feed and improve it for their own feed's sake. And if they add something and the list gets longer, that's not as explicitly a bad a signal as a score number going up.

This also means that when anyone adds other validation rules to the schema later (eg content rules like enums or length), there won't be any additional code to write - those errors will show up nicely in the report straight away.

@thill-odi
Copy link
Contributor

thill-odi commented Jul 17, 2020

Okay, I'm on a hack spree ...

Amending the above proposal so that it includes @type information where appropriate:

  1. Turn JSON Schema files into a flat array of strings representing a hierarchy with points (e.g. ["event.location.postaladdress.postalcode", "event.location.postaladdress.streetaddress" ....].
    1a. In cases where anyOf is found, all type possibilities are generated with type information postpended all the way down the tree. e.g. Note that (a) For the sake of simplicity all non-complex objects are represented with the suffix @simple, and (b) we do not record @type as a separate datapoint to avoid double-counting). E.g.
    ["event.level@simple", "event.level@Concept", "[email protected]", "[email protected]", "event.organizer@Organization.postalAddress@PostalAddress" ...]
    Let's call this the schema_array
  2. Flatten all attributes in each data item that have a non-null value similarly. Let's call this the data_item_array.
    2a. Foreach value in data_item_array that ends with @[\w]+\ (that is to say, that really only indicates type)
    2ai. Loop through a copy ofschema_array and
    2aii. Boot out any values that do match prior to the final @ sign, but not the sequence of @[\w]+ after it
    This yields a third array, pruned_schema_array
  3. Count intersection of data_item_array with pruned_schema_array.
  4. Intersection / pruned_schema_array.length = % profile satisfied

For example, supposing we had a data model consisting only of an Event with three attributes: a name (string), a level (string or skos:Concept), and an organiser (schema:Person or schema:Organization)

schema_array: ["event.name", "event.level@simple", "event.level@Concept", "[email protected]", "event.organiser@Person", "[email protected]", "[email protected]", "event.organiser@Place", "[email protected]@PostalAddress", "[email protected]@PostalAddress.streetAddress" ...]

data_item_array: ["event.name", "event.level@simple", ""event.organiser@Person", "[email protected]"]

pruned_schema_array: [ "event.name", "event.level@simple", ""event.organiser@Person", "[email protected]", "[email protected]"]

Then we again count the intersection of pruned_schema_array and data_item_array to get a raw score, over the total length of pruned_schema_array.

@thill-odi
Copy link
Contributor

Reviewing the above, I'm not sure how much less computationally-expensive this is than @robredpath's 'just generate all the arrays and compare' approach. But hey, it uses regexes.

https://xkcd.com/1171/

@robredpath
Copy link
Collaborator

@thill-odi so, the issue with the algorithm you proposed is that oneOf is also used when fields of differing combination rather than differing types, which means that they wouldn't flatten out well. (see https://github.com/openactive/conformance-profiles/blob/master/core/Event.json#L109 for example).

Given that time is starting to get tight, how about this:

  • We take the required array from each object (value A)
  • We count the number of those fields which appear in the errors list from JSON Schema validation (value B). This would be the 'top-level' fields - ie, if you have an array then you need to get the array totally right for it to count.
  • B / A = your score

This is simple, and means that scores never go backward if your data improves. It does mean that the effort required to get a better score can be disproportionate, but that's the trade-off! It also provides some of the scaffolding for future further levels, weightings, etc - if you wanted to do that at some point.

We'd also then be able to get access to the validation errors relatively easily, so that we could give feedback on what people need to do to improve. It should be pretty straightforward to link to that from the status page, at least in a machine-friendly way.

ghost pushed a commit that referenced this issue Jul 22, 2020
Also ajv for actually doing the work

#33
ghost pushed a commit that referenced this issue Jul 22, 2020
Also ajv for actually doing the work

#33
ghost pushed a commit that referenced this issue Jul 22, 2020
@robredpath
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants