-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Computed relationships - First-class computed/virtual columns - a generalization of foreign-key-based resource embedding #2144
Comments
First-class computed/virtual columns are a sure win!
Likely better to discuss this on a different issue, but would that |
Yes, yes and yes. |
Another thing that's good about this one is that it could be used to override our detected relationships as well, so it provides a way forward from breaking changes in embedding. For example, if we start detecting one-to-one relationships users could go back to redefining as a one-to-many. It's much better than doing some postgREST specific stuff(like on this proposal) to override relationships because it reduces user cognitive load.
So a function that
IIUC, this would not need hints because the virtual column would be unambiguous as it's explicitly defining the join with the table and columns.
data Cardinality
= O2M {relConnection :: Either FKConstraint ComputedColumn}
-- ^ one-to-many
| M2O {relConnection :: Either FKConstraint ComputedColumn}
-- ^ many-to-one
| M2M Junction
-- ^ many-to-many
deriving (Eq, Ord, Generic, JSON.ToJSON)
type ComputedColumn = QualifiedIdentifier
newtype FKConstraint = FKConstraint Text [(FieldName, FieldName)] -- | Relationship between two tables.
data Relationship = Relationship
{ relTable :: QualifiedIdentifier
, relForeignTable :: QualifiedIdentifier
, relIsSelf :: Bool -- ^ Whether is a self relationship
, relCardinality :: Cardinality
}
deriving (Eq, Ord, Generic, JSON.ToJSON)
The target name, the relTable and relForeginTable can be inferred from the function as: create function target_name(relTable) returns relForeignTable Kinda makes me think that |
Came up with this query: -- Adding this on the fixtures:
-- CREATE FUNCTION manual_client_rel(test.projects) RETURNS SETOF test.clients AS $$
-- SELECT * FROM test.clients WHERE id = $1.client_id;
--$$ LANGUAGE sql STABLE ROWS 1;
with
all_relations as (
select reltype
from pg_class
where relkind in ('v','r','m','f','p')
)
select
p.pronamespace::regnamespace as schema,
p.proname as name,
p.proargtypes[0]::regtype as rel_table,
p.prorettype::regtype as rel_ftable,
p.prorows = 1 as single_row
FROM pg_proc p
WHERE
p.pronargs = 1
and p.proargtypes[0] in (select reltype from all_relations)
and p.prorettype in (select reltype from all_relations);
-[ RECORD 1 ]-+------------------
schema | test
name | manual_client_rel
rel_table | test.projects
rel_ftable | test.clients
single_row | t |
Perf is great as mentioned above, I get a cheaper plan when calling the computed relationship: -- our generated query
explain analyze WITH pgrst_source AS (
SELECT "test"."projects".*, row_to_json("projects_clients".*) AS "clients"FROM "test"."projects"
LEFT JOIN LATERAL (
SELECT "test"."clients".*
FROM "test"."clients" WHERE "test"."projects"."client_id" = "test"."clients"."id" ) AS "projects_clients" ON TRUE)
SELECT
pg_catalog.count(_postgrest_t) AS page_total,
coalesce(json_agg(_postgrest_t), '[]')::character varying AS body
FROM ( SELECT * FROM pgrst_source ) _postgrest_t;
Aggregate (cost=75.74..75.76 rows=1 width=112) (actual time=0.114..0.117 rows=1 loops=1)
-> Hash Left Join (cost=38.58..63.74 rows=1200 width=72) (actual time=0.045..0.054 rows=5 loops=1)
Hash Cond: (projects.client_id = clients.id)
-> Seq Scan on projects (cost=0.00..22.00 rows=1200 width=40) (actual time=0.013..0.015 rows=5 loops=1)
-> Hash (cost=22.70..22.70 rows=1270 width=36) (actual time=0.019..0.020 rows=2 loops=1)
Buckets: 2048 Batches: 1 Memory Usage: 17kB
-> Seq Scan on clients (cost=0.00..22.70 rows=1270 width=36) (actual time=0.010..0.013 rows=2 loops=1)
Planning Time: 0.416 ms
Execution Time: 0.198 ms
-- using a computed relationship
CREATE FUNCTION manual_client_rel(test.projects) RETURNS test.clients AS $$
SELECT * FROM test.clients WHERE id = $1.client_id;
$$ LANGUAGE sql STABLE;
explain analyze WITH pgrst_source AS (
SELECT x.*, row_to_json("projects_clients".*) AS "clients"FROM "test"."projects" x
LEFT JOIN LATERAL (
SELECT "manual_client_rel".*
FROM "manual_client_rel"(x)) AS "projects_clients" ON TRUE)
SELECT
pg_catalog.count(_postgrest_t) AS page_total,
coalesce(json_agg(_postgrest_t), '[]')::character varying AS body
FROM ( SELECT * FROM pgrst_source ) _postgrest_t;
Aggregate (cost=58.25..58.28 rows=1 width=112) (actual time=0.359..0.361 rows=1 loops=1)
-> Nested Loop Left Join (cost=0.25..46.25 rows=1200 width=72) (actual time=0.209..0.290 rows=5 loops=1)
-> Seq Scan on projects x (cost=0.00..22.00 rows=1200 width=104) (actual time=0.021..0.029 rows=5 loops=1)
-> Function Scan on manual_client_rel (cost=0.25..0.26 rows=1 width=32) (actual time=0.050..0.050 rows=1 loops=5)
Planning Time: 0.426 ms
Execution Time: 0.437 ms Of course the row samples are too small(and somehow the execution time is always higher for the computed relationships) but still shows the direction is good. |
Wait a moment, you're mixing a few things up. What ultimately matters for performance is execution time. And that is clearly worse with the computed relationship. Costs are always only an estimation. Performance is worse, because the query does not inline the function body at the moment. You can tell this from the explain for the second query - it has a "function scan on manual_client_rel" in it. As long as you can see the function name in the explain output, it's not inlined, afaik. So it's not surprising that the cost is estimated lower here, because PG does not look inside the function at all and just assumes the default cost for a function. According to the inlining requirements, the reason why it's currently not doing that is:
Currently your helper function only How do the query plans look like with that change? |
Ah, didn't know that was an inlining requirement. Hm, so that means we cannot use
Did you thought of using SETOF for multiple rows? Or is there another way?
explain analyze WITH pgrst_source AS (
SELECT x.*, row_to_json("projects_clients".*) AS "clients"FROM "test"."projects" x
LEFT JOIN LATERAL (
SELECT "manual_client_rel".*
FROM "manual_client_rel"(x)) AS "projects_clients" ON TRUE)
SELECT
pg_catalog.count(_postgrest_t) AS page_total,
coalesce(json_agg(_postgrest_t), '[]')::character varying AS body
FROM ( SELECT * FROM pgrst_source ) _postgrest_t;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=75.74..75.75 rows=1 width=40) (actual time=0.105..0.108 rows=1 loops=1)
-> Hash Left Join (cost=38.58..63.74 rows=1200 width=72) (actual time=0.043..0.052 rows=5 loops=1)
Hash Cond: (x.client_id = clients.id)
-> Seq Scan on projects x (cost=0.00..22.00 rows=1200 width=40) (actual time=0.012..0.014 rows=5 loops=1)
-> Hash (cost=22.70..22.70 rows=1270 width=36) (actual time=0.018..0.019 rows=2 loops=1)
Buckets: 2048 Batches: 1 Memory Usage: 17kB
-> Seq Scan on clients (cost=0.00..22.70 rows=1270 width=36) (actual time=0.010..0.013 rows=2 loops=1)
Planning Time: 0.480 ms
Execution Time: 0.182 ms Execution time now looks about the same as the query without the computed relationship. "Function scan" is also gone. |
There's the ROWS parameter. Maybe we can use a
Edit: That's it. It works fine with inlining too. I should correct the pg_proc query above to use |
So far this has been working great. I mean it even works for CREATE FUNCTION test.clients(test.projects) RETURNS SETOF test.clients AS $$
SELECT * FROM test.clients WHERE id = $1.client_id;
$$ LANGUAGE sql STABLE ROWS 1; curl 'localhost:3000/rpc/getallprojects?select=*,clients(id,name)'
[{"id":1,"name":"Windows 7","client_id":1,"clients":{"id":1,"name":"Microsoft"}},
{"id":2,"name":"Windows 10","client_id":1,"clients":{"id":1,"name":"Microsoft"}},
{"id":3,"name":"IOS","client_id":2,"clients":{"id":2,"name":"Apple"}},
{"id":4,"name":"OSX","client_id":2,"clients":{"id":2,"name":"Apple"}}, generated queryWITH pgrst_source AS ( SELECT "getallprojects".* FROM "test"."getallprojects"()) SELECT coalesce(json_agg(_postgrest_t), '[]')::character varying AS body FROM ( SELECT "projects".*, row_to_json("projects_clients".*) AS "clients" FROM "pgrst_source" AS "projects" LEFT JOIN LATERAL ( SELECT "clients"."id", "clients"."name" FROM "test"."clients"("projects") ) AS "projects_clients" ON TRUE ) _postgrest_t; Other thing I noticed is that thanks to overloaded functions, we can use the same function name to define relationships for many tables: CREATE FUNCTION test.clients(test.projects) RETURNS SETOF test.clients
-- curl 'localhost:3000/projects?select=*,clients(id,name)'
CREATE FUNCTION test.clients(test.companies) RETURNS SETOF test.clients
-- curl 'localhost:3000/companies?select=*,clients(id,name)'
CREATE FUNCTION test.clients(test.others) RETURNS SETOF test.clients
-- curl 'localhost:3000/others?select=*,clients(id,name)' |
There is one problem though. It's not working for mutations. curl 'localhost:3000/projects?select=id,name,clients(id,name)' -H "Prefer: return=representation" -H "Content-Type: application/json" -d @- <<JSON
{"id":6,"name":"New Project","client_id":2}
JSON
{"code":"42846","details":"Input has too few columns.","hint":null,"message":"cannot cast type record to projects"} WITH pgrst_source AS (
WITH
pgrst_payload AS (SELECT '{"id":6,"name":"New Project","client_id":2}'::json AS json_data),
pgrst_body AS ( SELECT CASE WHEN json_typeof(json_data) = 'array' THEN json_data ELSE json_build_array(json_data) END AS val FROM pgrst_payload)
INSERT INTO "test"."projects"("client_id", "id", "name")
SELECT "client_id", "id", "name" FROM json_populate_recordset (null::"test"."projects", (SELECT val FROM pgrst_body)) _
RETURNING "test"."projects"."id", "test"."projects"."name")
SELECT
coalesce(json_agg(_postgrest_t), '[]')::character varying AS body
FROM (
SELECT
"projects"."id",
"projects"."name",
row_to_json("projects_clients".*) AS "clients"
FROM "pgrst_source" AS "projects"
LEFT JOIN LATERAL (
SELECT "clients"."id", "clients"."name"
FROM "test"."clients"("projects")
) AS "projects_clients" ON TRUE
) _postgrest_t;
ERROR: cannot cast type record to test.projects
LINE 18: FROM "test"."clients"("projects")
^
DETAIL: Input has too few columns. At first I thought it was because of the CTE but the RPC query above proves it's not. It's actually because the Found this trick that allows to cast like |
You already have that in the outer So it's just a matter of making it I suggest to only make it that way if any computed relationship is used... otherwise, we'd lose the ability to do mutations with select privileges on only a subset of columns, right? |
Yeah, though we'd lose the same ability for computed relationships. Say if you want to insert a row of "projects" but don't have privilege on a new column "cost" and you'd do I found another way to fix the type, but it would require the schema cache; basically we need to fill the missing columns in the RETURNING with WITH pgrst_source AS (
WITH
pgrst_payload AS (SELECT '{"id":6,"name":"New Project","client_id":2}'::json AS json_data),
pgrst_body AS ( SELECT CASE WHEN json_typeof(json_data) = 'array' THEN json_data ELSE json_build_array(json_data) END AS val FROM pgrst_payload)
INSERT INTO "test"."projects"("client_id", "id", "name")
SELECT "client_id", "id", "name" FROM json_populate_recordset (null::"test"."projects", (SELECT val FROM pgrst_body)) _
RETURNING "test"."projects"."id", "test"."projects"."name", null::int)
SELECT
coalesce(json_agg(_postgrest_t), '[]')::character varying AS body
FROM (
SELECT
"projects"."id",
"projects"."name",
row_to_json("projects_clients".*) AS "clients"
FROM "pgrst_source" AS "projects"
LEFT JOIN LATERAL (
SELECT "clients"."id", "clients"."name"
FROM "test"."clients"("projects")
) AS "projects_clients" ON TRUE
) _postgrest_t; |
An alternative that doesn't require the schema cache could also be: WITH pgrst_source AS (
WITH
pgrst_payload AS (SELECT '{"id":6,"name":"New Project","client_id":2}'::json AS json_data),
pgrst_body AS ( SELECT CASE WHEN json_typeof(json_data) = 'array' THEN json_data ELSE json_build_array(json_data) END AS val FROM pgrst_payload)
INSERT INTO "test"."projects"("client_id", "id", "name")
SELECT "client_id", "id", "name" FROM json_populate_recordset (null::"test"."projects", (SELECT val FROM pgrst_body)) _
RETURNING json_build_object('id', "test"."projects"."id", 'name', "test"."projects"."name"))
SELECT
coalesce(json_agg(_postgrest_t), '[]')::character varying AS body
FROM (
SELECT
"projects"."id",
"projects"."name",
row_to_json("projects_clients".*) AS "clients"
FROM json_populate_record(null::test.projects, (SELECT * FROM "pgrst_source")) AS "projects"
LEFT JOIN LATERAL (
SELECT "clients"."id", "clients"."name"
FROM "test"."clients"("projects")
) AS "projects_clients" ON TRUE
) _postgrest_t; But it doesn't feel right to use it for all returnings, the json casting could mess up some types. Using json functions on the body(defined by content-type) and the response(defined by accept) is fine but not for the generic returning. |
Besides the type casting, there's another problem with mutations. For our detected relationships, we enhance the RETURNING with the foreign keys so an embedding can succeed: postgrest/src/PostgREST/Request/DbRequestBuilder.hs Lines 350 to 377 in 3950b36
However with computed relationships the join conditions are basically a private detail, we cannot know for sure which conditions(FKs included) we need for the embedding to succeed. That leaves the Should be noted that computed relationships do work on mutations when specifying curl 'localhost:3000/projects?select=*,clients(id,name)' -H "Prefer: return=representation" -H "Content-Type: application/json" -d @- <<JSON
{"id":6,"name":"New Project","client_id":2}
JSON
[{"id":6,"name":"nom","client_id":2,"clients":{"id":2,"name":"Apple"}}] |
Since |
Yes, came to the same conclusion.
Please don't. I can easily see computed relationships that could be marked |
Note: computed relationships will not work for "relational inserts" later, but we have another way on #818 (comment). |
Problem
This builds on some ideas we had earlier:
@steve-chavez understood resource embedding as functions:
And I gave a practical example, showing the similarities:
@steve-chavez then suggested this as the way forward to do custom embeddings.
However, the current support for virtual columns is only implicit through field notation. This has several drawbacks for embedding, i.e. the following can only be done through fk-based embedding right now:
Additionally it's currently impossible to use a virtual column that has the same name as a real column of the same relation - field notation will always choose the latter.
Basic implementation
To solve these issues, I propose to introduce first-class support for virtual columns: When creating the schema cache, we should parse function definitions to detect virtual columns and then make them callable the same way as resource embedding is currently done.
The good thing: This is actually much simpler than it sounds. We need to add relationships based on function definitions in:
postgrest/src/PostgREST/DbStructure.hs
Line 96 in 8934190
Those relationships would have a cardinality of
M2O
orO2M
depending whether the function returns multiple rows or not. This would generate the right query for aggregation here:postgrest/src/PostgREST/Query/QueryBuilder.hs
Lines 52 to 81 in 8934190
We would then extend the
Cardinality
type to hold either theFKConstraint
or aQualifiedIdentifier
pointing to the virtual column's function. This would allowaddRels
to add the type of embedding (virtual vs fk-based) to theReadRequest
in:postgrest/src/PostgREST/Request/DbRequestBuilder.hs
Lines 133 to 137 in 8934190
And based on this, we can then make sure that the actual
SELECT
from the embedded table is changed to call the function instead:postgrest/src/PostgREST/Query/QueryBuilder.hs
Lines 36 to 42 in 8934190
This currently creates a subquery like the following:
and we'd need to make this:
The result would be embedding based on virtual columns, with the full support of features we currently have: Nesting, selecting subfields, inlining for performance - there is basically no difference, afaict. It should even be possible to use hints, if we fill the
Relationship
fields in a smart way.Consequences for Embedding in general
This will solve #591, #1179 and #1230 immediately and would provide a workaround solution for #1643, #2566, #1907 and #1984 - basically any issue with disambiguation, because you can always roll your own in a more explicit way.
Furthermore, we could remove some of the auto-detected fk-based relationships easily, that have recently posed problems with disambiguation in general. We could restrict
M2M
relationships to simple junction tables with only FK columns, we might decide to remove self-references completely, because those create a lot of challenges on their own. Those can all be replaced with explicit virtual column embedding.Taking this one step further, we could add a config option
db-fk-embedding
, which defaults totrue
. Setting this tofalse
would disable fk-based embedding entirely. This would remove our dependency on constraint names, at least partially, getting us one step closer to true schema isolation. We'd still expose those in error messages for failed constraints, I think, but at least our query API wouldn't depend on it. This would also allow users to be very explicit in which embeddings they want to allow - and would also guarantee, that further improvements in our view-parsing will not break existing queries when updating PostgREST due to new embedding possibilities resulting in a "can't disambiguate" response. This would solve #2070.Further generalization for function calls
At this stage,
?select=*,func(*)
is essentially a function call, which takes the current row as the only argument and allows selecting partial output.Scalars
The next step would be to allow
?select=*,func()
to be used to call functions which return a scalar instead of a composite type. This would solve the problem mentioned above, regarding real and virtual columns with the same name on a given table.Aggregates
Given that we have a lot of similarities with those virtual columns and the field notation call syntax, we can think of both of the following to be the same (pseudo-code):
We don't need the table prefix, because we're calling the function in a a better way, but we can take a similar looking approach to pass columns instead of rows as the only argument to a function:
This would be the same as
SELECT sum(employees.salary) FROM employees
, essentially.This is very similar to this suggestion on how to call aggregate functions:
However, the new proposal has less boilerplate, less indirection and is more consistent with how we treat functions as virtual columns. It also avoids potential name conflicts with column filters on the operator etc.
It would still be possible to implement the
FILTER
orHAVING
clauses:Even more
It should be possible to extend these concepts to:
?select=*,col.func1().func2()
, which looks like a syntax that could be used for Multiple "hops" for embed(Unnest/flatten embeds) #1233, too.The text was updated successfully, but these errors were encountered: