Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After validating normalised data, how should we produce aggregrate stats for the status site? #14

Open
ghost opened this issue Jun 19, 2020 · 10 comments
Assignees

Comments

@ghost
Copy link

ghost commented Jun 19, 2020

We put normalised data through https://github.com/openactive/data-model-validator and store the results in the database.

Then, how should we produce aggregate stats for the status site for each publisher?

In openactive/data-model-validator#349 I noted there can be different values of "severity" for instance - should we filter some of those out?

Ultimately, what does the user want to see on the status page when considering validation stats?

Thanks

@ghost ghost assigned ghost and thill-odi Jun 19, 2020
@thill-odi
Copy link
Contributor

We'll go for four categories: 'Conformant', 'Core', 'Accessibility', and 'Social Prescribing'. The profiles for each of these consist essentially of a list of attributes; testing for these will involve

(i) establishing whether an attribute is populated; by
(ii) a value of the correct datatype (or matching a particular regex, if we're feeling fancy.

We can check this using JSON-LD.

The end output will be a percentage value for the number of records that satisfy these conditions. In other words, we'll want to display four columns on the status page. If there are 100 records, and:

  • all of them possess all required attributes and these attributes are correctly populated
  • 80 of them possess all the 'core' attributes. and these are correctly populated
  • 60 of them possess all the correctly-populated 'social prescribing' attributes
  • 3 of them possess correctly-populated accessibility attributes

... then we should end up with a series of columns next to the dataset link with '100', '80', '60', '3'

I don't think there's any value in weighting particular attributes or counting partial satisfaction of the profiles. That way lies madness.

@thill-odi
Copy link
Contributor

thill-odi commented Jun 22, 2020

Sorry, have just realised after discussion with @nickevansuk that this really only deals with items after normalisation. Stats should ideally also be kept of items failing validation prior to normalisation - again, expressed as a percentage. And with warnings left unregistered.

In an ideal world, a list of the individual items failing validation would also be kept and linked from the validation page as an aid to data users.

@nickevansuk
Copy link
Contributor

To add some further detail to this, you'll want to filter something like severity === "failure" to only count the validation errors (and ignore the warnings)

See validator integration from test suite for more info: https://github.com/openactive/openactive-test-suite/blob/master/packages/openactive-integration-tests/test/shared-behaviours/validation.js#L94

@nickevansuk
Copy link
Contributor

nickevansuk commented Jun 22, 2020

To provide "a list of the individual items failing validation would also be kept and linked from the validation page as an aid to data users", as @thill-odi mentions above, one option is that a specific link to the validator which includes the item in the feed can be constructed as follows:
https://validator.openactive.io/?url={url}&rpdeId={rpdeId}

For example:

Note that the validator only validates the first 10 non-deleted items in any RPDE page that's provided, so the rpdeId parameter is required to ensure the item in question is validated by the online validator.

When you're using the validator programmatically, RPDE items should be validated individually (i.e. the data of the item should be validated, rather than the whole RPDE page validated), to ensure that all items are validated.

@ghost
Copy link
Author

ghost commented Jun 23, 2020

Thanks for many replies - this touches on a lot and is interesting. I'm going to move a lot of things out to other issues tho, and be strict about keeping this on track on the original question. Hope that's ok.

We'll go for four categories: 'Conformant', 'Core', 'Accessibility', and 'Social Prescribing'. The profiles for each of these consist essentially of a list of attributes; testing for these will involve

So Conformant is the results from the validation library and the other 3 are data profiles?

Because these come from different mechanisms i'd like to deal with them differently - I'll deal with data profiles in another ticket soon.

In an ideal world, a list of the individual items failing validation would also be kept and linked from the validation page as an aid to data users.

one option is that a specific link to the validator which includes the item in the feed can be constructed as follows: https://validator.openactive.io/?url={url}&rpdeId={rpdeId}

Moved to openactive-archive/conformance-status-page#4

Stats should ideally also be kept of items failing validation prior to normalisation - again, expressed as a percentage.

To be clear:

We should be running the validation library against the raw data we download, the un-normalised data? And calculating stats for that.

So, we take the results and filter ...

Can you be clear which one it is?

Then on the status page; show a % of how many records pass - ie have no validation library results against them after filtering.

When calculating the % and counting the total records, should it be total all records or just total of records that aren't deletes? Probably the latter I assume.

@nickevansuk
Copy link
Contributor

nickevansuk commented Jun 23, 2020

Filter severity === "failure" is the one. By removing "warnings" Tim meant removing warning, notice, and suggestion (which are all classed as "warnings" in the OpenActive Test Suite)

@nickevansuk
Copy link
Contributor

nickevansuk commented Jun 23, 2020

Also on the other points:

We should be running the validation library against the raw data we download, the un-normalised data? And calculating stats for that.

Yes, so that publishers can fix issues

When calculating the % and counting the total records, should it be total all records or just total of records that aren't deletes? Probably the latter I assume.

Suggest ignoring deleted records

@ghost
Copy link
Author

ghost commented Jun 23, 2020

Q: In the case where a publisher has multiple feeds (eg https://onlinebooking.1610.org.uk/OpenActive/ Slot, FacilityUse, ..... ) should we calculate the stat per Publisher or per each feed for that publisher?

@nickevansuk
Copy link
Contributor

nickevansuk commented Jun 23, 2020

not sure how it's presented, guess it depends on the UI - "% of data published that is conformant" would work per-publisher, but as in openactive-archive/conformance-status-page#4 they need to get to the detail of which feeds have errors, (so a % conformance per-feed could be useful?) and example pages/items within the feeds that exhibit the errors

Ideally we want the headline of every publisher being 100% conformant (though this is unlikely to be the case on day 1 of this tool going live)

@robredpath
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants