Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ghost records #162

Open
kkdavis14 opened this issue Nov 25, 2024 · 1 comment
Open

Ghost records #162

kkdavis14 opened this issue Nov 25, 2024 · 1 comment
Assignees
Labels
bug The code does not behave as expected / designed

Comments

@kkdavis14
Copy link
Contributor

Pipeline is losing some Agent records, which are being reidentified but not linked together properly.

Example:
This object:
https://lux.collections.yale.edu/view/object/ccca43ea-1fd7-4449-9f3f-fb026edf7b07

was published by Martinus van den Enden:
(ycba rec vended)
https://ycba-lux.s3.amazonaws.com/v3/person/a4/a4d1963c-d3cc-4f57-bb49-0204574106ca.json
(lux rec, which returns a 404):
https://lux.collections.yale.edu/data/person/0133a1e2-998e-447b-bd33-657d36941876

There's a live Martinus van den Enden in LUX:
https://lux.collections.yale.edu/view/person/e2990454-a285-4b92-bb4f-dcd8b62a344b

which doesn't have the YCBA as a contributor.

Brent to attach a list of 65 unique missing agents with this issue.

@kkdavis14 kkdavis14 added the bug The code does not behave as expected / designed label Nov 25, 2024
@brent-hartwig
Copy link

brent-hartwig commented Nov 25, 2024

dt-162-ghost-agents-report.xlsx contains three tabs:

  1. Unique Producers (item producers and work creators): The "Unique: Combined" column contains the unique values of the other two visible columns. The other two visible columns are the unique producers/creators from the other two tabs.
  2. Started with Items Report: provides the unique producer, item, set, curator, and unit combinations. The same producer may appear in multiple rows.
  3. Started with Works Report: same as above but also identifies the work.

Due to the amount of data in play, dt-162-ghost-agents-query.js.txt had to be run in three modes. The list numbers do not correlate to the above list numbers.

  1. Set startWithItems to true.
  2. Set startWithItems to false, worksOffset to 0, and worksLimit to 10000000.
  3. Set startWithItems to false, worksOffset to 10000000, and worksLimit to 11000000. There were about 20.7m rows.

@clarkepeterf and @azaroth42, below is the technique that was used to find the disjoint of IRIs found in the triple store and URIs of documents, where starter plan included the producer column that was either the item's agent of production or work's agent of creation.

starterPlan
  .notExistsJoin(
    op.fromLexicons({ iri: cts.iriReference() }),
    op.on(producer, op.col('iri'))
  )

Because the above does not also incorporate the URI lexicon, I'm left to believe the IRI lexicon is populated by the IRIs of the documents in the database, as opposed to all IRIs in the triple store.

See the attached query for additional context/details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The code does not behave as expected / designed
Projects
None yet
Development

No branches or pull requests

3 participants