-
Notifications
You must be signed in to change notification settings - Fork 1
Idea for server-side SPARQL over a Solid Pod #7
Comments
Agreed (emphasis mine)
…or authenticated interfaces, since the problems are inherent to the public nature of endpoints. However, let us clarify a couple of things:
We very likely want @timbl's input on this; the document-oriented nature of Solid was chosen deliberately, and we might want to stick to graphs for representing those documents. (That said, it is an open problem how to deal with quad-based formats and pods, e.g., how to treat a TriG document on a Solid pod.)
That's interesting; I guess simple things should be simple, and hard things possible.
I like that idea, however, spelling out the obvious here:
So we need to contrast this to alternatives.
Check, but:
We'll need to do it somewhere 🙂
You probably want to give or point to a reminder of
You say that very casually, but this is a major bottleneck 🙂
Does it?
Well, only if the WAC is the same for the entire graph, no? |
Indeed!
True.
Yes, but that is an orthogonal problem, as I see it. Apart from the academic work on federation, what we can do short/medium term is turn the query engine inside out, i.e. expose cardinality and cache info.
Aha, but this is actually the key: The proposal preserves the document-oriented nature, and also the triple oriented nature of Solid, it just enables a limited query interface over it.
Absolutely!
Indeed, that's by design.
Ah, but conceptually, I think this is the same as enumerating all the "folders" documents as
It will not need to be named, as it will only be part of a virtual default graph.
Then, it breaks down. Which is kinda by design on my part, since I think we want to keep Solid triple-oriented. The thing is that SPARQL is really a quad-oriented language, so, it is more about how we reconcile the two, in a way that brings the most triple-feel to it. :-) Now, it could be fixed by simply saying that quad-based documents can't be queried from any default graph, they would always need to be named. So, it isn't such a big deal, I think.
:-)
Oh, yeah. So, SPARQL queries are always evaluated over an RDF Dataset. You'll refind that in RDF 1.1 specs too. The RDF Dataset is a collection of graphs, so you will always operate over quads in SPARQL. Now, it offers the option to simplify that to the classical RDF triple by the notion of the default graph. The complexity of figuring out what is a reasonable default graph falls on us, because getting that right is crucial to make SPARQL simple. There are various ways to add a graph to the default graph. The query engine could itself define its unnamed default graph. Then, the most visible way to use a But now, I just found one thing in the SPARQL spec that poses a problem for this proposal, since the presence of a Anyway, regarding
Oh, indeed, very much, but also, as I said, and orthogonal problem. :-)
I'm pretty sure it is. I would be nice to explore in Attean, I think it is just a matter of referencing the resources.
Ah, but that referred to the idea of using the Request-URI as the graph name of RDF Documents. I wrote a partial implementation of this on the plane home from Boston last time. What I do there is that whenever the planner encounters a quad, it checks the WAC for the graph name of that quad: So, it will work in general if the graph name can be matched with the WAC, not just for entire graphs. |
It is; just that such an enumeration might be expensive if I have a lot of folders; both generating the enumeration (since it is recursive) and then querying from a large number of graphs. |
(dropping in here without any context, so keep that in mind if I've misunderstood the conversation)
I think conceptually this is a simple thing, but many SPARQL implementations might have worse performance for custom-defined union graphs than their defaults. My guess would be that the more performant the store, the more discrepancy there might be between querying the default graph(s) and querying a custom-defined union. |
Why would that happen? Indexes that stop at graph boundaries? Statistics that is just computed for one default graph? Hmmm, right, I can see stuff like that being the case... It does not seem like an insurmountable problem though, if this is something that is helpful to users... Now, I understand more what you mean by "document-oriented", @RubenVerborgh . I have always thought of it in abstract terms, that the document interface that is exposed through LDP is not an optimization, rather a classical way of doing things, where, once you start working with the data would be merely an interface, and also one that wouldn't be used a lot. So, I have just thought that I'd use a quad store behind it, set the graph to the Request-URI (but have something like your |
@kjetilk I think I tried to explain the same point to @timbl a while ago: https://gitter.im/linkeddata/chat?at=5b3e3c537b811a6d63d26581 TL;DR: container hierarchies can be virtual, they don't have to live in the physical filesystem as files and folders. This page uses such hierarchy, for example: https://linkeddatahub.com/docs/ |
Would also be my guess.
It never is, but could mean that it won't work well with off-the-shell SPARQL endpoints.
It's a matter of how common "select from all in folder (recursive)" queries are; and I expect they are not that common, even though the logical abstraction is an attractive one. |
Absolutely! Indeed, that is how I'd be implementing things, if I were. However, it is about the performance characteristics that you would see on the surface. |
Possibly, only empiricism can tell for sure. However, I think it also means that we can and should focus on how a SPARQL server-side implementation can be most helpful to users, even if it means that we have to address those problems.
So, the thing is that I suspect that the "folder" structure is going nowhere in terms of knowledge organisation... It will be a practical division for apps, but the queries that are needed to derive anything useful will be criss-crossing this structure along all kinds of axes, independently of folder structure, and therefore some are likely to work up and down the tree too. But given that we probably can't have just that endpoint for everything, a reasonable partitioning is something we might want to have. |
Ah, getting further down that discussion (which happened when I was holiday :-) ), I see that that you point to that indeed, the using the graph name for the RDF Document URL is what @timbl has done too, so good, then that is uncontroversial. :-) So, then the proposal is how that can be used in constructing a default graph in the case where query evaluation is done on the server side. |
Graph name as document URL is uncontroversial, that is exactly what SPARQL 1.1 Graph Store HTTP Protocol manages. |
Right, and on the topic of multiple SPARQL endpoints, we already have that since every RDF Document has one for patching. I believe it is uncontroversial that they have a default graph that is identical to their URL, so the question is if we can do something empowering for containers, and that's where I think the union graph of members is an interesting choice. |
Why the default graph? Only named graphs have names (URLs in this case). |
Because then people don't need to consider the fact that SPARQL is a quad-oriented language, and so wisely choosing the default graph simplifies the use of SPARQL greatly.
Hmmmm, I'm not sure what you mean by that. In SPARQL, you use graph names to add graphs to the default graph (through |
But GSP does what you want here :) You supply a graph name (directly or indirectly), you get an RDF graph back - triples, not quads. What you describe is merging named graphs into a default graph at query time. But in an RDF dataset, the default graph does not have name by definition:
|
Yes I am aware of that (I was amongst those who started the GSP work in the WG), but we're not implementing it, and in some cases, LDP and GSP are at odds (to great loss :-) )
Yes, I am aware of that too, but you're misunderstanding my intention, and the relationship between the RDF specification and the SPARQL specification, where named graphs can both be added to the default graph at query time, and they can be named to be addressed separately. And, importantly, the query service is free to construct the RDF Dataset from any graph it pleases, see Section 13.1. |
Relevant to this discussion... w3c/sparql-dev#43 |
Introduction
There are numerous problems with having public SPARQL endpoints, stemming mainly from the very power of SPARQL, it is very expressive. Therefore, more lightweight interfaces should be the first concern, but since there might be use cases where SPARQL would make sense, it is also worth discussing relatively simple approaches to enable it. The more elaborate ways to enable SPARQL endpoints is to limit its expressiveness, and there are various ways to do it. However, this proposal focuses on rather limiting the amount of data that would be queried, and by that, try to limit the impact on the server.
We also note that SPARQL has the notion of quad patterns, not just triple patterns. However, we can ensure that most queries stay rather simple (to write), by wisely choosing what is known as the default graph. This will enable most queries to just use simple triple patterns, and not enter the complexity of named graphs. This proposal deals with these two problems, 1) Ensuring server-side SPARQL is evaluated over reasonably sized graphs, and 2) defining graphs to make most queries simple to write and help with problem 1).
Quad Semantics
An RDF graph is a set of triples, where each triple contains a subject, a predicate and an object. In the Turtle serialization (and others), graphs consisting of just triples are what are serialized. Typically, resources on Solid are just a bunch of triples.
With graph names, the triple is extended to a quad. The graph names can be used to name a set of triples, and may be useful to group triples, and to partition the dataset for various purposes. In a Solid context where users may be authorized to write data, one might, for example, want to partition the data so that data from users that are unverified are kept in a different graph than data that have been verified by some party.
The default graph
The SPARQL 1.1 Query Language specification defines the RDF Dataset, and SPARQL queries will be executed over the data in that dataset. What comprises the dataset may be influenced in the query itself or in the protocol.
Moreover,
In other words, the default graph is what is queried if nothing else is defined, and these queries are then quite simple in that they only use triple patterns.
How can we limit the amount of data that is queried in this case in Solid? There is a clear partitioning of data with Linked Data Platform, the Container. My proposal is therefore:
A Container may expose a SPARQL endpoint, and if so, the RDF Dataset of that endpoint must be have a default graph that is given by the RDF Merge of all of its contained RDF Documents.
Further named graphs
Although it is not needed for the above, I think it is interesting to discuss what could reasonably be considered a "named graph" in Solid context:
Since any data can be written to a resource in, not just data where the Request-URI matches the URIs in the data, for example, I might
PUT
to https://alice.dev.inrupt.net/foo/bar.ttl
In effect, that makes https://alice.dev.inrupt.net/foo/bar.ttl similar to the graph names in that it groups certain triples. I think we should simply formalize this intuition: Systems using quad semantics should use the Request-URI of a resource as the graph name of that RDF Document.
Discussing all implications of this is beyond the scope of this issue, the advantage is that the RDF dataset can be amended with some data from outside the container in the cases where the query writer just needs some data beyond the containers graph.
So, for example,
FROM
clause can be used in the query, say that the endpoint and thus the default graph is https://alice.dev.inrupt.net/bar/ , thenwould select from everything in
/bar/
as well as/foo/bar.ttl
. For advanced queries, the entire repertoire of the dataset section can be used, thus making simple things easy and hard things possible.With this, a Pod would have many SPARQL endpoints, each with different default graphs, but they could query all documents in the Pod by naming them. Cross-pod queries would still require
SERVICE
or some client side federation though,FROM
would only be for queries within Pods.While this feature is a bit dangerous, since in principle, the client might add all resources in a Pod to the dataset. Some defenses should be added to SPARQL endpoint implementations against such things anyway.
Considerations for internals
While this feature could make it harder to use named graphs for more advanced purposes, I note that it simplifies the implementation of quad store based storage layers under Solid greatly: Web Access Control can be computed over the graph names, and thus integration with some existing SPARQL implementations is greatly simplified. Also, the
resourceStore
interface can also simply use the graph name of a backend quad store for the concrete implementation of the interface.The text was updated successfully, but these errors were encountered: