Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Page total count does not work for grouped models #98

Closed
lu-pl opened this issue Oct 18, 2024 · 6 comments · Fixed by #99
Closed

Page total count does not work for grouped models #98

lu-pl opened this issue Oct 18, 2024 · 6 comments · Fixed by #99
Assignees
Labels
bug Something isn't working

Comments

@lu-pl
Copy link
Contributor

lu-pl commented Oct 18, 2024

Given the following query/model constellation

query = """
select ?x ?y ?z
where {
    values (?x ?y ?z) {
        (1 2 3)
        (1 22 33)
        (2 22 33)
    }
}
"""

class MyModel(BaseModel):
    x: int
    y: int

the total count in a Page object will be 3, which is correct:

{

      "items": [
            {
                  "x": 1,
                  "y": 2
            },
            {
                  "x": 1,
                  "y": 22
            },
            {
                  "x": 2,
                  "y": 22
            }
      ],
      "page": 1,
      "size": 100,
      "total": 3,
      "pages": 1

}

If the model uses grouping however, the count will still be 3, which is NOT correct:

class MyModel(BaseModel):
    model_config = ConfigDict(group_by="x")

    x: int
    y: list[int]

Result:

{

      "items": [
            {
                  "x": 1,
                  "y": [
                        2,
                        22
                  ]
            },
            {
                  "x": 2,
                  "y": [
                        22
                  ]
            }
      ],
      "page": 1,
      "size": 100,
      "total": 3,
      "pages": 1
}

Obviously this happens because the count query constructor is too dumb to recognize grouping and simply counts all the SPARQL result rows.

A remedy for the problem is ergo to have an alternative code path in the count query constructor in case the outermost model defines a group_by config.

@lu-pl lu-pl self-assigned this Oct 18, 2024
@lu-pl
Copy link
Contributor Author

lu-pl commented Oct 18, 2024

Now, how to test the fix implemented in #99?

Imo ideally this should have end2end testing for SPARQLModelAdapter -- for which I don't have a solution because I haven't found the time to implement a proper mock facility for SPARQLWrapper yet and coming up with an approach for testing against a real (containerized) triplestore needs time and tinkering as well.

Testing against a remote endpoint is not an option for me.

@b1rger
Copy link
Collaborator

b1rger commented Oct 18, 2024

Now, how to test the fix implemented in #99?

mock urllib.

@lu-pl
Copy link
Contributor Author

lu-pl commented Oct 18, 2024

Now, how to test the fix implemented in #99?

mock urllib.

Yes, but this is easier said than done, simply mocking urllib and telling SPARQLWrapper what to return won't do the trick.

The mocking solution I was trying to implement would allow to generically mock SPARQLWrapper and run a query against a local rdflib.Graph target which would be ideal for testing, because the tests could exercise control over everything that is happening.

graph = Graph().parse("some_actual_rdf.ttl") 

def code_under_test():
    s = SPARQLWrapper("https://some.inexistent.endpoint")
    s.setQuery("select * where {?s ?p ?o .}")

    result = s.query()
    return result

# target rdflib.Graph instead of remote endpoint for code under test
with SPARQLWrapperLocalTarget(graph) as graph:
    result = code_under_test() 

See https://github.com/lu-pl/sparqwrapper_mock_draft.
Unfortunately I ran into problems with that and I need time to come up with something.

@lu-pl
Copy link
Contributor Author

lu-pl commented Oct 18, 2024

I can implement very basic tests for rdfproxy.utils.sparql_utils.construct_count_query that manually target a local rdflib.Graph instance, but this is NOT proper testing for the behavior of #99. 😐

@lu-pl
Copy link
Contributor Author

lu-pl commented Oct 18, 2024

See https://github.com/acdh-oeaw/rdfproxy/blob/lupl/fix-group-counting/tests/test_construct_count_query.py

Something like this might not even be wildly insufficient for properly testing the behavior implemented in #99. Opinions?

@lu-pl
Copy link
Contributor Author

lu-pl commented Oct 19, 2024

NOTE: SPARQL counting and thus the implemented rdfproxy count mechanism does not work for Wikidata queries using the label service. E.g. for models defining a wikidata label variable as a group_by value, the count will always be 0.

The problem might be caused by how the label service binds label variables, it seems like those variables are not processed by the SPARQL engine directly but bolted on to the query result set as part of a post-processing step, which could explain why the COUNT is not behaving as expected.

This post serves the purpose of recording this issue. Nothing more.

This is clearly a Wikidata specific problem, all COUNTs of label service bindings are affected.
I don't feel responsible for working around Wikidata's secret custom Blazegraph wizardry.

@lu-pl lu-pl added the bug Something isn't working label Oct 22, 2024
@lu-pl lu-pl closed this as completed in #99 Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants