After validating normalised data, how should we produce aggregrate stats for the status site? #14

ghost · 2020-06-19T11:18:20Z

We put normalised data through https://github.com/openactive/data-model-validator and store the results in the database.

Then, how should we produce aggregate stats for the status site for each publisher?

In openactive/data-model-validator#349 I noted there can be different values of "severity" for instance - should we filter some of those out?

Ultimately, what does the user want to see on the status page when considering validation stats?

Thanks

thill-odi · 2020-06-22T09:58:38Z

We'll go for four categories: 'Conformant', 'Core', 'Accessibility', and 'Social Prescribing'. The profiles for each of these consist essentially of a list of attributes; testing for these will involve

(i) establishing whether an attribute is populated; by
(ii) a value of the correct datatype (or matching a particular regex, if we're feeling fancy.

We can check this using JSON-LD.

The end output will be a percentage value for the number of records that satisfy these conditions. In other words, we'll want to display four columns on the status page. If there are 100 records, and:

all of them possess all required attributes and these attributes are correctly populated
80 of them possess all the 'core' attributes. and these are correctly populated
60 of them possess all the correctly-populated 'social prescribing' attributes
3 of them possess correctly-populated accessibility attributes

... then we should end up with a series of columns next to the dataset link with '100', '80', '60', '3'

I don't think there's any value in weighting particular attributes or counting partial satisfaction of the profiles. That way lies madness.

thill-odi · 2020-06-22T11:21:43Z

Sorry, have just realised after discussion with @nickevansuk that this really only deals with items after normalisation. Stats should ideally also be kept of items failing validation prior to normalisation - again, expressed as a percentage. And with warnings left unregistered.

In an ideal world, a list of the individual items failing validation would also be kept and linked from the validation page as an aid to data users.

nickevansuk · 2020-06-22T11:31:15Z

To add some further detail to this, you'll want to filter something like severity === "failure" to only count the validation errors (and ignore the warnings)

See validator integration from test suite for more info: https://github.com/openactive/openactive-test-suite/blob/master/packages/openactive-integration-tests/test/shared-behaviours/validation.js#L94

nickevansuk · 2020-06-22T12:38:11Z

To provide "a list of the individual items failing validation would also be kept and linked from the validation page as an aid to data users", as @thill-odi mentions above, one option is that a specific link to the validator which includes the item in the feed can be constructed as follows:
https://validator.openactive.io/?url={url}&rpdeId={rpdeId}

For example:

Note that the validator only validates the first 10 non-deleted items in any RPDE page that's provided, so the rpdeId parameter is required to ensure the item in question is validated by the online validator.

When you're using the validator programmatically, RPDE items should be validated individually (i.e. the data of the item should be validated, rather than the whole RPDE page validated), to ensure that all items are validated.

ghost · 2020-06-23T12:06:35Z

Thanks for many replies - this touches on a lot and is interesting. I'm going to move a lot of things out to other issues tho, and be strict about keeping this on track on the original question. Hope that's ok.

We'll go for four categories: 'Conformant', 'Core', 'Accessibility', and 'Social Prescribing'. The profiles for each of these consist essentially of a list of attributes; testing for these will involve

So Conformant is the results from the validation library and the other 3 are data profiles?

Because these come from different mechanisms i'd like to deal with them differently - I'll deal with data profiles in another ticket soon.

In an ideal world, a list of the individual items failing validation would also be kept and linked from the validation page as an aid to data users.

one option is that a specific link to the validator which includes the item in the feed can be constructed as follows: https://validator.openactive.io/?url={url}&rpdeId={rpdeId}

Moved to openactive-archive/conformance-status-page#4

Stats should ideally also be kept of items failing validation prior to normalisation - again, expressed as a percentage.

To be clear:

We should be running the validation library against the raw data we download, the un-normalised data? And calculating stats for that.

So, we take the results and filter ...

Remove warnings (Tim said in After validating normalised data, how should we produce aggregrate stats for the status site? #14 (comment) )
Filter severity === "failure" - only count failure (Nick said in After validating normalised data, how should we produce aggregrate stats for the status site? #14 (comment) )

Can you be clear which one it is?

Then on the status page; show a % of how many records pass - ie have no validation library results against them after filtering.

When calculating the % and counting the total records, should it be total all records or just total of records that aren't deletes? Probably the latter I assume.

nickevansuk · 2020-06-23T12:58:21Z

Filter severity === "failure" is the one. By removing "warnings" Tim meant removing warning, notice, and suggestion (which are all classed as "warnings" in the OpenActive Test Suite)

nickevansuk · 2020-06-23T14:09:23Z

Also on the other points:

We should be running the validation library against the raw data we download, the un-normalised data? And calculating stats for that.

Yes, so that publishers can fix issues

When calculating the % and counting the total records, should it be total all records or just total of records that aren't deletes? Probably the latter I assume.

Suggest ignoring deleted records

ghost · 2020-06-23T15:31:30Z

Q: In the case where a publisher has multiple feeds (eg https://onlinebooking.1610.org.uk/OpenActive/ Slot, FacilityUse, ..... ) should we calculate the stat per Publisher or per each feed for that publisher?

nickevansuk · 2020-06-23T16:26:38Z

not sure how it's presented, guess it depends on the UI - "% of data published that is conformant" would work per-publisher, but as in openactive-archive/conformance-status-page#4 they need to get to the detail of which feeds have errors, (so a % conformance per-feed could be useful?) and example pages/items within the feeds that exhibit the errors

Ideally we want the headline of every publisher being 100% conformant (though this is unlikely to be the case on day 1 of this tool going live)

robredpath · 2020-07-14T10:07:03Z

Validate normalised data against profiles; expose results in API; display on status page

ghost assigned ghost and thill-odi Jun 19, 2020

ghost mentioned this issue Jun 23, 2020

a list of the individual items failing validation - linked from the validation page as an aid to data users openactive-archive/conformance-status-page#4

Open

This was referenced Jun 23, 2020

validate: Validate Raw Data, store in DB #23

Merged

Add source Validation Stats to status page openactive-archive/conformance-status-page#6

Open

nickevansuk mentioned this issue Jun 24, 2020

Validate RPDE feeds openactive/openactive-test-suite#86

Open

nickevansuk mentioned this issue Jul 15, 2020

Adding provenanceInformation #51

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After validating normalised data, how should we produce aggregrate stats for the status site? #14

After validating normalised data, how should we produce aggregrate stats for the status site? #14

ghost commented Jun 19, 2020

thill-odi commented Jun 22, 2020

thill-odi commented Jun 22, 2020 •

edited

Loading

nickevansuk commented Jun 22, 2020

nickevansuk commented Jun 22, 2020 •

edited

Loading

ghost commented Jun 23, 2020

nickevansuk commented Jun 23, 2020 •

edited

Loading

nickevansuk commented Jun 23, 2020 •

edited

Loading

ghost commented Jun 23, 2020

nickevansuk commented Jun 23, 2020 •

edited

Loading

robredpath commented Jul 14, 2020

After validating normalised data, how should we produce aggregrate stats for the status site? #14

After validating normalised data, how should we produce aggregrate stats for the status site? #14

Comments

ghost commented Jun 19, 2020

thill-odi commented Jun 22, 2020

thill-odi commented Jun 22, 2020 • edited Loading

nickevansuk commented Jun 22, 2020

nickevansuk commented Jun 22, 2020 • edited Loading

ghost commented Jun 23, 2020

nickevansuk commented Jun 23, 2020 • edited Loading

nickevansuk commented Jun 23, 2020 • edited Loading

ghost commented Jun 23, 2020

nickevansuk commented Jun 23, 2020 • edited Loading

robredpath commented Jul 14, 2020

thill-odi commented Jun 22, 2020 •

edited

Loading

nickevansuk commented Jun 22, 2020 •

edited

Loading

nickevansuk commented Jun 23, 2020 •

edited

Loading

nickevansuk commented Jun 23, 2020 •

edited

Loading

nickevansuk commented Jun 23, 2020 •

edited

Loading