Skip to content
This repository has been archived by the owner on Jul 6, 2020. It is now read-only.

Idea for server-side SPARQL over a Solid Pod #7

Open
kjetilk opened this issue Jun 3, 2019 · 17 comments
Open

Idea for server-side SPARQL over a Solid Pod #7

kjetilk opened this issue Jun 3, 2019 · 17 comments

Comments

@kjetilk
Copy link

kjetilk commented Jun 3, 2019

Introduction

There are numerous problems with having public SPARQL endpoints, stemming mainly from the very power of SPARQL, it is very expressive. Therefore, more lightweight interfaces should be the first concern, but since there might be use cases where SPARQL would make sense, it is also worth discussing relatively simple approaches to enable it. The more elaborate ways to enable SPARQL endpoints is to limit its expressiveness, and there are various ways to do it. However, this proposal focuses on rather limiting the amount of data that would be queried, and by that, try to limit the impact on the server.

We also note that SPARQL has the notion of quad patterns, not just triple patterns. However, we can ensure that most queries stay rather simple (to write), by wisely choosing what is known as the default graph. This will enable most queries to just use simple triple patterns, and not enter the complexity of named graphs. This proposal deals with these two problems, 1) Ensuring server-side SPARQL is evaluated over reasonably sized graphs, and 2) defining graphs to make most queries simple to write and help with problem 1).

Quad Semantics

An RDF graph is a set of triples, where each triple contains a subject, a predicate and an object. In the Turtle serialization (and others), graphs consisting of just triples are what are serialized. Typically, resources on Solid are just a bunch of triples.

With graph names, the triple is extended to a quad. The graph names can be used to name a set of triples, and may be useful to group triples, and to partition the dataset for various purposes. In a Solid context where users may be authorized to write data, one might, for example, want to partition the data so that data from users that are unverified are kept in a different graph than data that have been verified by some party.

The default graph

The SPARQL 1.1 Query Language specification defines the RDF Dataset, and SPARQL queries will be executed over the data in that dataset. What comprises the dataset may be influenced in the query itself or in the protocol.

Moreover,

An RDF Dataset comprises one graph, the default graph, which does not have a name, and zero or more named graphs, where each named graph is identified by an IRI.

In other words, the default graph is what is queried if nothing else is defined, and these queries are then quite simple in that they only use triple patterns.

How can we limit the amount of data that is queried in this case in Solid? There is a clear partitioning of data with Linked Data Platform, the Container. My proposal is therefore:

A Container may expose a SPARQL endpoint, and if so, the RDF Dataset of that endpoint must be have a default graph that is given by the RDF Merge of all of its contained RDF Documents.

Further named graphs

Although it is not needed for the above, I think it is interesting to discuss what could reasonably be considered a "named graph" in Solid context:

Since any data can be written to a resource in, not just data where the Request-URI matches the URIs in the data, for example, I might PUT

<https://example.org/foo> a <https://example.com/Bar> .

to https://alice.dev.inrupt.net/foo/bar.ttl

In effect, that makes https://alice.dev.inrupt.net/foo/bar.ttl similar to the graph names in that it groups certain triples. I think we should simply formalize this intuition: Systems using quad semantics should use the Request-URI of a resource as the graph name of that RDF Document.

Discussing all implications of this is beyond the scope of this issue, the advantage is that the RDF dataset can be amended with some data from outside the container in the cases where the query writer just needs some data beyond the containers graph.

So, for example, FROM clause can be used in the query, say that the endpoint and thus the default graph is https://alice.dev.inrupt.net/bar/ , then

FROM <https://alice.dev.inrupt.net/foo/bar.ttl>
SELECT ?foo WHERE {
 ?foo a [] .
}

would select from everything in /bar/ as well as /foo/bar.ttl. For advanced queries, the entire repertoire of the dataset section can be used, thus making simple things easy and hard things possible.

With this, a Pod would have many SPARQL endpoints, each with different default graphs, but they could query all documents in the Pod by naming them. Cross-pod queries would still require SERVICE or some client side federation though, FROM would only be for queries within Pods.

While this feature is a bit dangerous, since in principle, the client might add all resources in a Pod to the dataset. Some defenses should be added to SPARQL endpoint implementations against such things anyway.

Considerations for internals

While this feature could make it harder to use named graphs for more advanced purposes, I note that it simplifies the implementation of quad store based storage layers under Solid greatly: Web Access Control can be computed over the graph names, and thus integration with some existing SPARQL implementations is greatly simplified. Also, the resourceStore interface can also simply use the graph name of a backend quad store for the concrete implementation of the interface.

@RubenVerborgh
Copy link

There are numerous problems with having public SPARQL endpoints

Agreed (emphasis mine)

Therefore, more lightweight interfaces should be the first concern

…or authenticated interfaces, since the problems are inherent to the public nature of endpoints.

However, let us clarify a couple of things:

  • We are talking small data here, in many cases. "Small" being any number of triples less than a million, which I expect to be the case for many personal pods. So might not be that much of an issue in general.
  • Additional concern: data has different access levels (which might make things more expensive again).
  • Furthermore, SPARQL endpoints might not expose sufficient primitives to speed up federated querying, which is very likely what we will need.

However, we can ensure that most queries stay rather simple (to write), by wisely choosing what is known as the default graph.

We very likely want @timbl's input on this; the document-oriented nature of Solid was chosen deliberately, and we might want to stick to graphs for representing those documents. (That said, it is an open problem how to deal with quad-based formats and pods, e.g., how to treat a TriG document on a Solid pod.)

defining graphs to make most queries simple to write

That's interesting; I guess simple things should be simple, and hard things possible.

My proposal is therefore:

A Container may expose a SPARQL endpoint, and if so, the RDF Dataset of that endpoint must be have a default graph that is given by the RDF Merge of all of its contained RDF Documents.

I like that idea, however, spelling out the obvious here:

  • This means multiple endpoints per data pod. (By itself, not that bad.)
  • Those endpoints would query over a virtual graph. These virtual graphs could be rather expensive to construct (from a traditional triple store perspective) for arbitrary folder structures. (Imagine 3 levels of folders with 10 subfolders each.)

So we need to contrast this to alternatives.

Systems using quad semantics should use the Request-URI of a resource as the graph name of that RDF Document.

Check, but:

  • If /foo is a container, what will be the name of the Merge Graph of that container?
  • What with documents that are quad-based themselves?

Discussing all implications of this is beyond the scope of this issue

We'll need to do it somewhere 🙂

So, for example, FROM clause can be used in the query […]
would select from everything in /bar/ as well as /foo/bar.ttl.

You probably want to give or point to a reminder of FROM, FROM NAMED and what not, because the above can be very confusing (it is to me).

Cross-pod queries would still require SERVICE or some client side federation though,

You say that very casually, but this is a major bottleneck 🙂

I note that it simplifies the implementation of quad store based storage layers under Solid greatly:

Does it?
Not sure if constructing the Merge Graph will be that cheap.

Web Access Control can be computed over the graph names

Well, only if the WAC is the same for the entire graph, no?

@kjetilk
Copy link
Author

kjetilk commented Jun 4, 2019

Therefore, more lightweight interfaces should be the first concern

…or authenticated interfaces, since the problems are inherent to the public nature of endpoints.

Indeed!

However, let us clarify a couple of things:

* We are talking **small data** here, in many cases. "Small" being any number of triples less than a million, which I expect to be the case for many personal pods. So might not be _that_ much of an issue in general.

True.

* Furthermore, SPARQL endpoints might not expose sufficient primitives to speed up **federated querying**, which is very likely what we will need.

Yes, but that is an orthogonal problem, as I see it. Apart from the academic work on federation, what we can do short/medium term is turn the query engine inside out, i.e. expose cardinality and cache info.

However, we can ensure that most queries stay rather simple (to write), by wisely choosing what is known as the default graph.

We very likely want @timbl's input on this; the document-oriented nature of Solid was chosen deliberately, and we might want to stick to graphs for representing those documents. (That said, it is an open problem how to deal with quad-based formats and pods, e.g., how to treat a TriG document on a Solid pod.)

Aha, but this is actually the key: The proposal preserves the document-oriented nature, and also the triple oriented nature of Solid, it just enables a limited query interface over it.

defining graphs to make most queries simple to write

That's interesting; I guess simple things should be simple, and hard things possible.

Absolutely!

My proposal is therefore:
A Container may expose a SPARQL endpoint, and if so, the RDF Dataset of that endpoint must be have a default graph that is given by the RDF Merge of all of its contained RDF Documents.

I like that idea, however, spelling out the obvious here:

* This means _multiple_ endpoints per data pod. (By itself, not that bad.)

Indeed, that's by design.

* Those endpoints would query over a virtual graph. These virtual graphs could be rather expensive to construct (from a traditional triple store perspective) for arbitrary folder structures. (Imagine 3 levels of folders with 10 subfolders each.)

Ah, but conceptually, I think this is the same as enumerating all the "folders" documents as FROM clauses. RDF Merge is what SPARQL endpoints do, and doing that virtually shouldn't be alien to them. So, I don't think there is anything special about this. I'd run this by @kasei to be sure, but my hunch is that this is going to be easy, both from a performance perspective and an implementation perspective.

So we need to contrast this to alternatives.

Systems using quad semantics should use the Request-URI of a resource as the graph name of that RDF Document.

Check, but:

* If /foo is a container, what will be the name of the Merge Graph of that container?

It will not need to be named, as it will only be part of a virtual default graph.

* What with documents that are quad-based themselves?

Then, it breaks down. Which is kinda by design on my part, since I think we want to keep Solid triple-oriented. The thing is that SPARQL is really a quad-oriented language, so, it is more about how we reconcile the two, in a way that brings the most triple-feel to it. :-)

Now, it could be fixed by simply saying that quad-based documents can't be queried from any default graph, they would always need to be named. So, it isn't such a big deal, I think.

Discussing all implications of this is beyond the scope of this issue

We'll need to do it somewhere slightly_smiling_face

:-)

So, for example, FROM clause can be used in the query […]
would select from everything in /bar/ as well as /foo/bar.ttl.

You probably want to give or point to a reminder of FROM, FROM NAMED and what not, because the above can be very confusing (it is to me).

Oh, yeah.

So, SPARQL queries are always evaluated over an RDF Dataset. You'll refind that in RDF 1.1 specs too. The RDF Dataset is a collection of graphs, so you will always operate over quads in SPARQL. Now, it offers the option to simplify that to the classical RDF triple by the notion of the default graph. The complexity of figuring out what is a reasonable default graph falls on us, because getting that right is crucial to make SPARQL simple.

There are various ways to add a graph to the default graph. The query engine could itself define its unnamed default graph. Then, the most visible way to use a FROM clause, that will add the triples from the referenced document to the default graph (but the server is not obliged to dereference it and may refuse a query if it doesn't allow a certain graph to be added. Which means, you could legitimately create a SPARQL engine that operates only on triples, but you'd have to refuse all queries that tries to build a different dataset.

But now, I just found one thing in the SPARQL spec that poses a problem for this proposal, since the presence of a FROM clause overrides any predefined default graphs. So, you can't just add like I proposed... Hmmmm, I think I'll play that into SPARQL 1.2.

Anyway, regarding FROM NAMED, the RDF Dataset can contain zero or more named graphs. Then, you declare with this clause that you will query only specific sets of the data later in the query, and then, around some Basic Graph Pattern, you name the graph using the GRAPH keyword. This good for partitioning data. The graph named like this is called the active graph when it is queried.

Cross-pod queries would still require SERVICE or some client side federation though,

You say that very casually, but this is a major bottleneck slightly_smiling_face

Oh, indeed, very much, but also, as I said, and orthogonal problem. :-)

I note that it simplifies the implementation of quad store based storage layers under Solid greatly:

Does it?
Not sure if constructing the Merge Graph will be that cheap.

I'm pretty sure it is. I would be nice to explore in Attean, I think it is just a matter of referencing the resources.

Web Access Control can be computed over the graph names

Well, only if the WAC is the same for the entire graph, no?

Ah, but that referred to the idea of using the Request-URI as the graph name of RDF Documents. I wrote a partial implementation of this on the plane home from Boston last time. What I do there is that whenever the planner encounters a quad, it checks the WAC for the graph name of that quad:
https://github.com/kjetilk/p5-web-access-control/blob/master/lib/Web/Access/Control/AccessPlan.pm#L16
This works if the Request-URI of the RDF Document is the graph name. If WAC says you do not have access to the graph, it replaces the original quad plan with a different restricted quad plan, and then in the cost planner, I give this restricted plan an infinite cost. I'd love to make take a couple of days to make this code actually run.

So, it will work in general if the graph name can be matched with the WAC, not just for entire graphs.

@RubenVerborgh
Copy link

Ah, but conceptually, I think this is the same as enumerating all the "folders" documents as FROM clauses.

It is; just that such an enumeration might be expensive if I have a lot of folders; both generating the enumeration (since it is recursive) and then querying from a large number of graphs.

@kasei
Copy link

kasei commented Jun 4, 2019

(dropping in here without any context, so keep that in mind if I've misunderstood the conversation)

Ah, but conceptually, I think this is the same as enumerating all the "folders" documents as FROM clauses. RDF Merge is what SPARQL endpoints do, and doing that virtually shouldn't be alien to them. So, I don't think there is anything special about this. I'd run this by @kasei to be sure, but my hunch is that this is going to be easy, both from a performance perspective and an implementation perspective.

I think conceptually this is a simple thing, but many SPARQL implementations might have worse performance for custom-defined union graphs than their defaults. My guess would be that the more performant the store, the more discrepancy there might be between querying the default graph(s) and querying a custom-defined union.

@kjetilk
Copy link
Author

kjetilk commented Jun 4, 2019

I think conceptually this is a simple thing, but many SPARQL implementations might have worse performance for custom-defined union graphs than their defaults. My guess would be that the more performant the store, the more discrepancy there might be between querying the default graph(s) and querying a custom-defined union.

Why would that happen? Indexes that stop at graph boundaries? Statistics that is just computed for one default graph? Hmmm, right, I can see stuff like that being the case... It does not seem like an insurmountable problem though, if this is something that is helpful to users...

Now, I understand more what you mean by "document-oriented", @RubenVerborgh . I have always thought of it in abstract terms, that the document interface that is exposed through LDP is not an optimization, rather a classical way of doing things, where, once you start working with the data would be merely an interface, and also one that wouldn't be used a lot. So, I have just thought that I'd use a quad store behind it, set the graph to the Request-URI (but have something like your resourceStore as interface between the LDP and the quad store), and it would have the performance characteristics of a quad store (both externally and internally), not of a hierarchical file system, unless it was a quad store implemented on the top of a hierarchical file system... If that assumption falls, that the performance characteristics of a hierarchical document system is preferable to that of a quad store, then indeed, it is hard to implement this idea over it, but so would any triple/quad pattern-based query language. Not sure I see the compelling use cases where the document-oriented approach would excel...

@namedgraph
Copy link

@kjetilk I think I tried to explain the same point to @timbl a while ago: https://gitter.im/linkeddata/chat?at=5b3e3c537b811a6d63d26581
A long thread follows.

TL;DR: container hierarchies can be virtual, they don't have to live in the physical filesystem as files and folders. This page uses such hierarchy, for example: https://linkeddatahub.com/docs/

@RubenVerborgh
Copy link

Indexes that stop at graph boundaries? Statistics that is just computed for one default graph? Hmmm, right, I can see stuff like that being the case...

Would also be my guess.

It does not seem like an insurmountable problem though

It never is, but could mean that it won't work well with off-the-shell SPARQL endpoints.

If that assumption falls, that the performance characteristics of a hierarchical document system is preferable to that of a quad store

It's a matter of how common "select from all in folder (recursive)" queries are; and I expect they are not that common, even though the logical abstraction is an attractive one.

@kjetilk
Copy link
Author

kjetilk commented Jun 4, 2019

TL;DR: container hierarchies can be virtual, they don't have to live in the physical filesystem as files and folders. This page uses such hierarchy, for example: https://linkeddatahub.com/docs/

Absolutely! Indeed, that is how I'd be implementing things, if I were. However, it is about the performance characteristics that you would see on the surface.

@kjetilk
Copy link
Author

kjetilk commented Jun 4, 2019

It does not seem like an insurmountable problem though

It never is, but could mean that it won't work well with off-the-shell SPARQL endpoints.

Possibly, only empiricism can tell for sure. However, I think it also means that we can and should focus on how a SPARQL server-side implementation can be most helpful to users, even if it means that we have to address those problems.

If that assumption falls, that the performance characteristics of a hierarchical document system is preferable to that of a quad store

It's a matter of how common "select from all in folder (recursive)" queries are; and I expect they are not that common, even though the logical abstraction is an attractive one.

So, the thing is that I suspect that the "folder" structure is going nowhere in terms of knowledge organisation... It will be a practical division for apps, but the queries that are needed to derive anything useful will be criss-crossing this structure along all kinds of axes, independently of folder structure, and therefore some are likely to work up and down the tree too. But given that we probably can't have just that endpoint for everything, a reasonable partitioning is something we might want to have.

@kjetilk
Copy link
Author

kjetilk commented Jun 4, 2019

TL;DR: container hierarchies can be virtual, they don't have to live in the physical filesystem as files and folders. This page uses such hierarchy, for example: https://linkeddatahub.com/docs/

Absolutely! Indeed, that is how I'd be implementing things, if I were. However, it is about the performance characteristics that you would see on the surface.

Ah, getting further down that discussion (which happened when I was holiday :-) ), I see that that you point to that indeed, the using the graph name for the RDF Document URL is what @timbl has done too, so good, then that is uncontroversial. :-)

So, then the proposal is how that can be used in constructing a default graph in the case where query evaluation is done on the server side.

@namedgraph
Copy link

Graph name as document URL is uncontroversial, that is exactly what SPARQL 1.1 Graph Store HTTP Protocol manages.

@kjetilk
Copy link
Author

kjetilk commented Jun 5, 2019

Right, and on the topic of multiple SPARQL endpoints, we already have that since every RDF Document has one for patching. I believe it is uncontroversial that they have a default graph that is identical to their URL, so the question is if we can do something empowering for containers, and that's where I think the union graph of members is an interesting choice.

@namedgraph
Copy link

Why the default graph? Only named graphs have names (URLs in this case).

@kjetilk
Copy link
Author

kjetilk commented Jun 5, 2019

Why the default graph?

Because then people don't need to consider the fact that SPARQL is a quad-oriented language, and so wisely choosing the default graph simplifies the use of SPARQL greatly.

Only named graphs have names (URLs in this case).

Hmmmm, I'm not sure what you mean by that. In SPARQL, you use graph names to add graphs to the default graph (through FROM), not only when using named graphs (through FROM NAMED and GRAPH.

@namedgraph
Copy link

But GSP does what you want here :) You supply a graph name (directly or indirectly), you get an RDF graph back - triples, not quads.
Quad semantics for graph stores exist as well, but they are not standard - at least not yet, SPARQL 1.2 might address that: w3c/sparql-dev#56

What you describe is merging named graphs into a default graph at query time. But in an RDF dataset, the default graph does not have name by definition:

An RDF dataset is a collection of RDF graphs. All but one of these graphs have an associated IRI or blank node. They are called named graphs, and the IRI or blank node is called the graph name. The remaining graph does not have an associated IRI, and is called the default graph of the RDF dataset.

@kjetilk
Copy link
Author

kjetilk commented Jun 5, 2019

But GSP does what you want here :) You supply a graph name (directly or indirectly), you get an RDF graph back - triples, not quads.

Yes I am aware of that (I was amongst those who started the GSP work in the WG), but we're not implementing it, and in some cases, LDP and GSP are at odds (to great loss :-) )

Quad semantics for graph stores exist as well, but they are not standard - at least not yet, SPARQL 1.2 might address that: w3c/sparql-12#56

What you describe is merging named graphs into a default graph at query time. But in an RDF dataset, the default graph does not have name by definition:

An RDF dataset is a collection of RDF graphs. All but one of these graphs have an associated IRI or blank node. They are called named graphs, and the IRI or blank node is called the graph name. The remaining graph does not have an associated IRI, and is called the default graph of the RDF dataset.

Yes, I am aware of that too, but you're misunderstanding my intention, and the relationship between the RDF specification and the SPARQL specification, where named graphs can both be added to the default graph at query time, and they can be named to be addressed separately. And, importantly, the query service is free to construct the RDF Dataset from any graph it pleases, see Section 13.1.

@TallTed
Copy link

TallTed commented Jun 13, 2019

Relevant to this discussion... w3c/sparql-dev#43

@kjetilk kjetilk transferred this issue from solid/solid-spec Jan 17, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants