Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add redundant/missing legs to STAR assertions #30

Open
tla opened this issue Jan 23, 2024 · 9 comments
Open

Add redundant/missing legs to STAR assertions #30

tla opened this issue Jan 23, 2024 · 9 comments
Assignees

Comments

@tla
Copy link
Member

tla commented Jan 23, 2024

Many of the things that get claimed (statements) require more than one triple/STAR object (assertion) in the data models we are using. Each of these assertions will have the same authority and source. In order to preserve the sanity of people using WissKI, I configured it so that the authority and source only get specified once per statement, which means that all but one of the assertions will technically be incomplete. We need a maintenance script that completes the missing 'legs' of the STAR assertions.

@lu-pl
Copy link

lu-pl commented Jan 23, 2024

Quick info concerning redundant/missing legs to STAR assertions:

I implemented r11cli which I intend to be a general Command Line Interface for running commands on the R11 triplestore.
Once published on PyPI the tool can be easily installed with pipx (pipx install r11cli), the script can then run e.g. in a pipeline or as a cron job.

The tool now has a subcommand 'starlegs' which runs a set of SPARQL construct queries (so far only for the gender assertions) to produce the missing/redundant assertions. See output.
Note that for the gender query only P14s are generated because the P17s are not asserted for gender assignment.

Options: r11cli starlegs serializes ttl to stdout, r11cli starlegs --output <folder> expects a folder and saves the assertions to files in that folder;
r11cli starlegs --insert (not yet implemented) directly updates the triples back to the triplestore.

The insert command will be implemented using named graphs.

@lu-pl
Copy link

lu-pl commented Jan 29, 2024

Update: I implemented the starlegs --insert flag which directly updates the generated triples back to the triplestore and into a named graph https://r11.eu/rdf/resource/r11cli_starlegs.

The graph is already in the store: r11cli_starlegs (still with just the gender assertions).

My proposal for named graph medadata would be this:

<graph_uri> a rdfg:Graph, sd:NamedGraph, crmdig:D9_Data_Object .

[a crmdig:D10_Software_Execution] crmdig:L11_had_output <graph_uri> ;
    crm:P82_begin_of_the_begin "<time value>" ;
    crmdig:L23_used_software_or_firmware [
        a crmdig:D14_Software ;
        P1_is_identified_by [
            a crm:E42_Identifier ;
            crm:P190_has_symbolic_content <script_uri>
        ] ;
    crmdig:L12_happened_on_device [
      a crmdig:D8_Digital_Device ;
        crm:P129i_is_subject_of [
            a crm:E73_Information_Object ;
            crm:P2_has_type
            <https://vocabs.sshopencloud.eu/browse/media-type/en/page/applicationslashjson> ;
            crm:P190_has_symbolic_content "<json system info>."
        ]
  ]

The system info will be extracted dynamically, e.g. on my machine it would be

{
  "system": "Linux",
  "node": "arch-e14",
  "release": "6.7.1-arch1-1",
  "version": "#1 SMP PREEMPT_DYNAMIC Sun, 21 Jan 2024 22:14:10 +0000",
  "machine": "x86_64",
  "python_implementation": "CPython",
  "python_version": "3.11.6"
}

I will implement the metadata generation as soon as the model is approved.

Todo:

  • Implement more/all construct queries.
  • Write tests.

@tla
Copy link
Member Author

tla commented Feb 13, 2024

Thanks - the metadata schema looks fine, though we might need to think about back-porting the generation metadata for the original PBW script, and for the death / location factoid generation, to this model.

Concerning the STAR legs you have generated in the named graph, many of them are duplicates of triples that already exist in the main data graph (probably because these triples were generated by the original PBW script instead of via WissKI.) So it would be a good idea to check whether these triples already exist before creating them in the second graph.

@lu-pl
Copy link

lu-pl commented Mar 25, 2024

Update: Starleg construct requests are now generated dynamically using a simple (and hopefuly sufficiently generic) query builder based on a revised construct query template.

I also implemented tests for Gender assertions, see tests_starlegs_queries. The tests work by first building a set of gender graphs with different constellations of missing legs (using combinatorics) and running SHACL constraints against every graph expecting SHACL validation to fail. Then every data graph is updated with the results of the respective construct query and the SHACL validation is run again - this time validation is expected to pass.

@lu-pl
Copy link

lu-pl commented Sep 24, 2024

Generated assertions for

  • E13_sdhss_P13,
  • E13_sdhss_P26,
  • E13_sdhss_P36,
  • E13_sdhss_P38,
  • E13_crm_P41

See starlegs output

@lu-pl
Copy link

lu-pl commented Oct 8, 2024

Quick sketch for a very (probably overly) generic starlegs query:

PREFIX crm: <http://www.cidoc-crm.org/cidoc-crm/>
PREFIX star: <https://r11.eu/ns/star/>

construct {
    ?o crm:P14_carried_out_by ?agent .
    ?o crm:P17_was_motivated_by ?source .
}
where {
    ?e13_initial a crm:E13_Attribute_Assignment ;
    	crm:P14_carried_out_by ?agent ;
    	crm:P17_was_motivated_by ?source ;
    	crm:P140_assigned_attribute_to | crm:P141_assigned ?common .


    ?common ^crm:P140_assigned_attribute_to ?o .
    filter (?o != ?e13_initial)

    minus { ?o crm:P14_carried_out_by ?_agent . }
    minus { ?o crm:P17_was_motivated_by ?_source . }
}

This finds star nodes connected to an initial E13 and asserts the initial P14/P17 statements if the connected nodes miss P14/P17 assertions altogether.

@lu-pl
Copy link

lu-pl commented Oct 22, 2024

The status of the Starlegs problem is roughly this: The construct query for generating the missing assertions is actually rather simple, especially since the query does not have to use the OPTIONAL clauses (as Tara pointed out); the difficulty is to reliably indentify the classes that actually need the missing legs constructed.

My new approach for doing this is to extract the initial star pattern classes ("TopStars") from the pathbuilder XML dump obtained from the WissKI API and apply the construct query to those classes.


Quick digression/rant

The incovenience of expressing graph patterns with paths is that edge adjacency is highly verbose and repetitive.

E.g. for stating that instances of E13_sdhss_P36 shall have P14 and P17 asserted about them, the WissKI path expression looks like this:

<path_array>
        <x>http://www.cidoc-crm.org/cidoc-crm/E21_Person</x>
        <y>^http://www.cidoc-crm.org/cidoc-crm/P141_assigned</y>
        <x>https://r11.eu/ns/star/E13_sdhss_P36</x>
</path_array>
<!-- ... -->
<path_array>
        <x>http://www.cidoc-crm.org/cidoc-crm/E21_Person</x>
        <y>^http://www.cidoc-crm.org/cidoc-crm/P141_assigned</y>
        <x>https://r11.eu/ns/star/E13_sdhss_P36</x>
        <y>http://www.cidoc-crm.org/cidoc-crm/P140_assigned_attribute_to</y>
        <x>https://r11.eu/ns/prosopography/C23</x>
        <y>^http://www.cidoc-crm.org/cidoc-crm/P140_assigned_attribute_to</y>
        <x>https://r11.eu/ns/star/E13_sdhss_P35</x>
        <y>http://www.cidoc-crm.org/cidoc-crm/P141_assigned</y>
        <x>https://r11.eu/ns/prosopography/C24</x>
</path_array>
<!-- ... -->
<path_array>
        <x>http://www.cidoc-crm.org/cidoc-crm/E21_Person</x>
        <y>^http://www.cidoc-crm.org/cidoc-crm/P141_assigned</y>
        <x>https://r11.eu/ns/star/E13_sdhss_P36</x>
        <y>http://www.cidoc-crm.org/cidoc-crm/P14_carried_out_by</y>
        <x>http://www.cidoc-crm.org/cidoc-crm/E21_Person</x>
</path_array>
<!-- ... -->
<path_array>
        <x>http://www.cidoc-crm.org/cidoc-crm/E21_Person</x>
        <y>^http://www.cidoc-crm.org/cidoc-crm/P141_assigned</y>
        <x>https://r11.eu/ns/star/E13_sdhss_P36</x>
        <y>http://www.cidoc-crm.org/cidoc-crm/P14_carried_out_by</y>
        <x>https://r11.eu/ns/spec/Author_Group</x>
</path_array>
<!-- ... -->
<path_array>
        <x>http://www.cidoc-crm.org/cidoc-crm/E21_Person</x>
        <y>^http://www.cidoc-crm.org/cidoc-crm/P141_assigned</y>
        <x>https://r11.eu/ns/star/E13_sdhss_P36</x>
        <y>http://www.cidoc-crm.org/cidoc-crm/P17_was_motivated_by</y>
        <x>http://www.cidoc-crm.org/cidoc-crm/E73_Information_Object</x>
</path_array>

Basically the same thing expressed in Turtle:

[a star:E13_sdhss_P36] 
        crm:P140 [a C23] ;
        # inferred: crm:p177 P36 ;
        crm:P141 [a crm:E21] ;
        crm:P14 [a crm:E21] ;
        crm:P17 [a crm:E73] .

Paths are good at expressing depth but very very bad at expressing breadth. RDF graphs usually exhibit high edge adjacency == breadth.

Anyway.


So what I am doing now is to XPATH-extract all first y nodes that are either P140 inverses or P141 inverses and set-cast all second x nodes of those path_arrays - which should give me all initial E13 classes of a star chain - which can then be used to interpolate a construct query template and run that query against the Releven GraphDB store and construct the missing triples.

The advantage of using the WissKI API information for finding the applicable E13s is obviously that Starlegs constructors will always be up to date with the paths that are defined in WissKI.

@lu-pl
Copy link

lu-pl commented Oct 23, 2024

Result of a quick and dirty run of the logic explained above:

WissKITopStar(cls='https://r11.eu/ns/star/E13_lrmoo_R24', connector='^http://www.cidoc-crm.org/cidoc-crm/P141_assigned')
WissKITopStar(cls='https://r11.eu/ns/star/E13_crm_P108', connector='^http://www.cidoc-crm.org/cidoc-crm/P141_assigned')
WissKITopStar(cls='https://r11.eu/ns/star/E13_crm_P1', connector='^http://www.cidoc-crm.org/cidoc-crm/P140_assigned_attribute_to')
WissKITopStar(cls='https://r11.eu/ns/star/E13_crm_P92', connector='^http://www.cidoc-crm.org/cidoc-crm/P141_assigned')
WissKITopStar(cls='https://r11.eu/ns/star/E13_crm_P2', connector='^http://www.cidoc-crm.org/cidoc-crm/P140_assigned_attribute_to')
WissKITopStar(cls='https://r11.eu/ns/star/E13_sdhss_P17', connector='^http://www.cidoc-crm.org/cidoc-crm/P141_assigned')
WissKITopStar(cls='https://r11.eu/ns/star/E13_crm_P89', connector='^http://www.cidoc-crm.org/cidoc-crm/P140_assigned_attribute_to')
WissKITopStar(cls='https://r11.eu/ns/star/E13_crm_P65', connector='^http://www.cidoc-crm.org/cidoc-crm/P140_assigned_attribute_to')
WissKITopStar(cls='https://r11.eu/ns/star/E13_lrmoo_R15', connector='^http://www.cidoc-crm.org/cidoc-crm/P141_assigned')
WissKITopStar(cls='https://r11.eu/ns/star/E13_crm_P51', connector='^http://www.cidoc-crm.org/cidoc-crm/P141_assigned')
WissKITopStar(cls='https://r11.eu/ns/star/E13_sdhss_P36', connector='^http://www.cidoc-crm.org/cidoc-crm/P141_assigned')
WissKITopStar(cls='http://www.cidoc-crm.org/cidoc-crm/E15_Identifier_Assignment', connector='^http://www.cidoc-crm.org/cidoc-crm/P140_assigned_attribute_to')
WissKITopStar(cls='https://r11.eu/ns/star/E13_lrmoo_R17', connector='^http://www.cidoc-crm.org/cidoc-crm/P141_assigned')
WissKITopStar(cls='https://r11.eu/ns/star/E13_crm_P100', connector='^http://www.cidoc-crm.org/cidoc-crm/P141_assigned')
WissKITopStar(cls='https://r11.eu/ns/star/E13_crm_P128', connector='^http://www.cidoc-crm.org/cidoc-crm/P141_assigned')
WissKITopStar(cls='https://r11.eu/ns/star/E13_crm_P196', connector='^http://www.cidoc-crm.org/cidoc-crm/P140_assigned_attribute_to')
WissKITopStar(cls='https://r11.eu/ns/star/E13_spec_L1', connector='^http://www.cidoc-crm.org/cidoc-crm/P140_assigned_attribute_to')
WissKITopStar(cls='https://r11.eu/ns/star/E13_crm_P56', connector='^http://www.cidoc-crm.org/cidoc-crm/P140_assigned_attribute_to')
WissKITopStar(cls='https://r11.eu/ns/star/E13_sdhss_P13', connector='^http://www.cidoc-crm.org/cidoc-crm/P141_assigned')
WissKITopStar(cls='https://r11.eu/ns/star/E13_sdhss_P26', connector='^http://www.cidoc-crm.org/cidoc-crm/P141_assigned')
WissKITopStar(cls='https://r11.eu/ns/star/E13_crm_P107', connector='^http://www.cidoc-crm.org/cidoc-crm/P141_assigned')
WissKITopStar(cls='https://r11.eu/ns/star/E13_lrmoo_R5', connector='^http://www.cidoc-crm.org/cidoc-crm/P140_assigned_attribute_to')
WissKITopStar(cls='https://r11.eu/ns/star/E13_crm_P41', connector='^http://www.cidoc-crm.org/cidoc-crm/P141_assigned')
WissKITopStar(cls='https://r11.eu/ns/star/E13_crm_P98', connector='^http://www.cidoc-crm.org/cidoc-crm/P141_assigned')
WissKITopStar(cls='https://r11.eu/ns/star/E13_crm_P45', connector='^http://www.cidoc-crm.org/cidoc-crm/P140_assigned_attribute_to')
WissKITopStar(cls='https://r11.eu/ns/star/E13_sdhss_P38', connector='^http://www.cidoc-crm.org/cidoc-crm/P140_assigned_attribute_to')
WissKITopStar(cls='https://r11.eu/ns/star/E13_crm_P46', connector='^http://www.cidoc-crm.org/cidoc-crm/P140_assigned_attribute_to')

The WissKITopStar data objects can then be used to interpolate the query template.

Todos:

  • properly define the WissKITopStar logic
  • adapt the Starlegs query template
  • adapt the Sarlegs query infrastructure for the new approach (e.g. starlegs loggers)
  • define a Runner for running the required construct queries; queries should run asynchronously

@lu-pl
Copy link

lu-pl commented Nov 5, 2024

The initial problem outline was roughly this: In a "star chain", if only the "top star" has crm:P14 and crm:P17 asserted about it,
all other stars connected to that top star will need that exact P14/P17 assertions constructed.

I thought that by identifying the "top stars" using the WissKI API XML data, the problem could be solved by finding the P14/P17 assertions of a given top star and all connected stars in the chain and run a SPARQL construct query to generate those assertions.

I looked at the WissKI pathbuilder data a bit more closely and noticed, that the above approach would almost certainly be insufficient.

First, not all top stars in the pathbuilder chains even have P14/P17 assertions.

Secondly, most top stars don't have P14 and P17 assertions.

Thirdly, not only top stars have P14 or P17 assertions, but also stars in the chain.

E.g. I translated the (completely incomprehensible) WissKI path representation for the https://r11.eu/ns/spec/Text_Expression shape into (pseudo) Turtle, the model looks like this:

[a E13_lrmoo_R17]
    p140 [a F28] := f28 ;
    p141 [a Text_Expression] := text_expression ;
    p14 [a E21] ;
    p14 [a Author_Group] .

[a E13_crm_p4]
    p140 f28 ;
    p141 [a E52] ;
    p14 [a E21] ;
    p14 [a Author_Group]
    p17 [a E73] .

[a E13_crm_p14]
    p140 f28 ;
    p141 [a E21] ;
    p141 [a Author_Group] ;
    p14 [a E21] ;
    p14 [a Author_Group] .

This shows much better what is actually going on: Three E13 assertions are connected to the same F28.

However, it remains kind of unclear, which entities need which P14/P17 assertions constructed.
One possibility would be to assert the set of P14/P17 of all E13s for all E13s in the chain.

This cannot be generalized though, which becomes clear if one looks e.g. at the https://r11.eu/ns/spec/Boulloterion model:

[a E15] p140 [a Boulloterion] := boulloterion ;
    p37 [a E42] ;
    p14 [a F11] .

[a E13_spec_L1] p140 boulloterion ;
    p141 [a Lead_Seal] ;
    p14 [a E21] ;
    p14 [a Author_Group] ;
    p17 [a E73] .

The first P14 assertion is meant to have an F11 object, the second P14 assertion is meant to have either an E21 or Author_Group object.
So I don't think that all the P14/P17 of all E13s are valid for all E13s.


Generally, I feel like I do not have enough information to come up with a generic solution.

Looking at the WissKI pathbuilder data showed that (my) previous assumptions about the data shapes and the actual problem at hand were faulty.

So I think we need to discuss how the actual task can be exactly defined.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants