Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import death factoids to RDF data #1

Open
tla opened this issue Oct 12, 2022 · 19 comments
Open

Import death factoids to RDF data #1

tla opened this issue Oct 12, 2022 · 19 comments

Comments

@tla
Copy link
Member

tla commented Oct 12, 2022

No description provided.

@tla tla self-assigned this Oct 12, 2022
@tla tla added the People label Oct 12, 2022
@tla
Copy link
Member Author

tla commented Aug 28, 2023

Factoid data attached

c11deaths-AA.xlsx
c11deaths-MR.xlsx

@tla
Copy link
Member Author

tla commented Aug 28, 2023

The fact of the deaths themselves are already in the database; here we are parsing and adding the date information. We can discuss the details further on Wednesday, and make notes in this issue.

@tla tla changed the title Import death factoids to Neo4J data Import death factoids to RDF data Aug 28, 2023
@Aaleks93
Copy link

also related to issue #2 revised version of the death factoids, completed. you can access the updated spreadsheet through this link

@Aaleks93
Copy link

Aaleks93 commented Jan 9, 2024

The spreadsheet with death records has been updated with sources on which I based the datings where my name is the authority. Therefore, the file from 21.11.2023 has been updated to the file named "C11 PBW Death records, AA_revised version_09.01.2024." xlsx, accessible here https://ucloud.univie.ac.at/index.php/f/797833040

@tla
Copy link
Member Author

tla commented Jan 23, 2024

Report from @lu-pl 💯
I implemented the table conversion for the editor rows, see example output.
The P14 assertion for assigning Aleks or Marton is still missing, will add it today (+ some minor fixes).

Note that some SPARQL queries return empty, in which case no RDF is generated. See the logs.
I haven't really looked into that (yet) because I think you said you would like to investigate the empty queries yourself.

@lu-pl
Copy link

lu-pl commented Jan 29, 2024

Update: Implemented the missing P14 assertions, see output.

@tla
Copy link
Member Author

tla commented Feb 13, 2024

Note that some SPARQL queries return empty, in which case no RDF is generated. See the logs. I haven't really looked into that (yet) because I think you said you would like to investigate the empty queries yourself.

Some of these are expected (where they are based on sources that we ended up not using), but others have to do with the fact that the Name column has something added in parentheses. So for example Ioannes (Smbat) 106 should just be queried as Ioannes 106. I don't know where the parenthetical text came from, but it needs to be stripped / ignored in all cases.

For sanity-checking purposes, it might be helpful to keep a list of the sources we aren't using; these include Council of 1157, Italikos, Niketas Choniates, Historia, Pantokrator Typikon, Prodromos, Historische Gedichte, Tzetzes, Letters at least. If you could implement these as exclusions (i.e. if the Source canonical name is one of these, just skip the row) and output in the log what the source was every time a query returns nothing, this would help me audit a new run.

@lu-pl
Copy link

lu-pl commented Feb 19, 2024

Update:

Parenthetical text in Name fields gets ignored now and unused Source values are skipped (see the log).

The script now generates a trig file deaths.trig with a named graph for every table partition.

I also investigated the empty queries, some of those were caused by typos or incomplete PBW strings in the tables.
I queried the store for the correct PBW strings and manually updated the tables in the r11tab/tables/xlsx folder.

For the remaining empty queries in most cases the PBW data is missing in the triplestore, so I don't really know what to do about that.

@lu-pl
Copy link

lu-pl commented Feb 19, 2024

Note: I would like to/will port the metadata schema used in the r11cli application to the table conversion at some point, if that is alright.

@tla
Copy link
Member Author

tla commented Feb 20, 2024

I've now looked at the empty queries, which have three causes:

  • They are about Basileios 2 (Basil II), who is not in our database except insofar as he was kin to others.
  • The source should have been Pantokrator Typikon but the string was modified.
  • The source is not exactly a primary source (in this case, it is Christos Philanthropos, note every time) and so has a slightly different modeling structure (we didn't create a Text Expression for this publication, but instead we created a Manifestation Creation event whose authority is the publication author, i.e. the editor of the text). The following query should work.
PREFIX crm: <http://www.cidoc-crm.org/cidoc-crm/>
PREFIX star: <https://r11.eu/ns/star/>

select ?pub ?d ?a4 ?e
where { 
    ?a1 a star:E13_crm_P3 ;
        crm:P140_assigned_attribute_to ?d ;
        crm:P141_assigned """She died on a November 1 [shortly after 1100, a year before <Isaakios 61>]"""@en ;
        crm:P14_carried_out_by ?authority ;
        crm:P17_was_motivated_by ?source .
    ?d a crm:E69_Death .
    ?a2 a star:E13_crm_P100 ;
        crm:P140_assigned_attribute_to ?d ;
        crm:P141_assigned ?p .
    ?p a crm:E21_Person .
    ?id a crm:E15_Identifier_Assignment ;
        crm:P140_assigned_attribute_to ?p ;
        crm:P37_assigned ?e42 .
    ?e42 a crm:E42_Identifier ;
         crm:P190_has_symbolic_content "Anna 61" .
    ?a3 a star:E13_lrmoo_R15 ;
        crm:P140_assigned_attribute_to ?pub ;
        crm:P141_assigned ?source .
    ?a4 a star:E13_lrmoo_R24 ;
        crm:P140_assigned_attribute_to ?pubcreation ;
        crm:P141_assigned ?pub ;
        crm:P14_carried_out_by ?e .
    ?e crm:P3_has_note ?editor . 
} limit 1

@tla
Copy link
Member Author

tla commented Feb 20, 2024

I forgot the fourth case, which was a death record for Symbatios 101 from Iveron 2.178.5; this is from a document in the Iveron archive that was produced in 1098, which is past our cutoff point of 1095.

@lu-pl
Copy link

lu-pl commented Mar 11, 2024

All empty query cases are handled now (see logs and I updated the script to the new metadata schema.

The way this is impemented now, a named named + metadata is generated for every table partition, see deaths.trig. Another option would be to merge all graphs in to a single named graph and generate metadata only for that graph.

@lu-pl
Copy link

lu-pl commented Mar 11, 2024

note: Metadata of course gets generated only once for every software execution, but every named graph is registered as being an output of that software execution, see the metadata graph.

@lu-pl
Copy link

lu-pl commented Mar 13, 2024

The script now produces a single turtle file with all subgraphs merged, see deaths.ttl.

I had to slightly modify the metadata schema, metadata assertions are now pointing to E13 subject nodes instead of named graphs along L11_had_output. Since the range of L11 is D1_Digital_Object this implies (and a reasoner would inference) that E13 assertions are D1s i.e. E73_Information_Objects - which is not wrong but maybe something worth pointing out.

@laletuver1
Copy link

Meeting notes: Lukas has changed the metadata schema, which Tara will put on the Graph database. A new issue might be necessary for converting all old metadata into new metadata schema.

@lu-pl
Copy link

lu-pl commented May 21, 2024

Ingested deaths data to https://r11.eu/rdf/resource/deaths.

@lu-pl
Copy link

lu-pl commented May 21, 2024

Note: Consolidation/merging of named graphs into another named graph can be automated using SPARQL update (INSERT) requests.

This should be implemented in r11cli.

edit: DROPing a named graph would not be reflected in the merged graph though, so one would need to SPARQL the merged triples out of target graph before deleting the named graph!

delete { ?s ?p ?o . }
where {
    graph <named_graph> {
        ?s ?p ?o .
    }
}

drop graph <named_graph>

@tla
Copy link
Member Author

tla commented Jul 2, 2024

Hi @lu-pl , concerning the metadata schema, I've just noticed a problem with the timestamps...

star:cd81994d8e a crmdig:D10_Software_Execution ;
    crm:P82_begin_of_the_begin "2024-03-25T08:07:23.267077"^^xsd:dateTime ;

The first issue is that begin_of_the_begin is actually P82a, not P82 itself; the second issue is that a crmdig:D10_Software_Execution is a subclass of E7, not E52, which is what the domain of P82* is supposed to be. So this would need to be rewritten to something like

star:cd81994d8e a crmdig:D10_Software_Execution ;
    crm:P4_has_time-span [ crm:P82a_begin_of_the_begin "2024-03-25T08:07:23.267077"^^xsd:dateTime ] ;

@lu-pl
Copy link

lu-pl commented Jul 8, 2024

hi @tla, the metadata issue should be fixed, see deaths.ttl.

LODKit now has a feature for Ontology derived ClosedNamespaces, so at least typos won't be an issue anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants