Make the Null Filtering query more performant #2961

laurenceisla · 2023-09-21T17:27:18Z

After the discussion that started in
#2951 (comment), the conclusion is that changing table IS DISTINCT FROM NULL to table.join_column IS NOT NULL should be more performant because PostgreSQL would treat that last one as an INNER JOIN.

Perhaps we need to change how we build the embedding sub queries to allow this. The IS DISTINCT FROM is specified here:

postgrest/src/PostgREST/Query/SqlFragment.hs

Line 340 in add4dfe

    
           pgFmtFilter _ (CoercibleFilterNullEmbed hasNot fld) = pgFmtIdent fld <> " IS " <> (if not hasNot then "NOT " else mempty) <> "DISTINCT FROM NULL"

The text was updated successfully, but these errors were encountered:

wolfgangwalther · 2023-09-21T18:40:16Z

Some examples. I'm adding this to my fixtures/data.sql file:

INSERT INTO test.clients (id, name) SELECT n+2, 'Test ' || n FROM generate_series(1,1000) AS n;

INSERT INTO test.projects (id, name, client_id) SELECT (n-1)*100+m+5, 'Test ' || m, n+2 FROM generate_series(1,1000) AS n, LATERAL generate_series(1,100) AS m;

CREATE INDEX projects_client_id ON projects (client_id);

Not a lot of data, but it shows.

`IS DISTINCT FROM NULL` for a single row

Query

EXPLAIN ANALYZE
WITH pgrst_source AS
  (SELECT projects.id, "test"."projects"."name",
          row_to_json("projects_clients_1".*) AS "clients"
   FROM "test"."projects"
   LEFT JOIN LATERAL
     (SELECT "clients_1"."name"
      FROM "test"."clients" AS "clients_1"
      WHERE "clients_1"."id" = "test"."projects"."client_id") AS "projects_clients_1" ON TRUE
   WHERE "projects_clients_1" IS DISTINCT FROM NULL)
SELECT NULL::bigint AS total_result_set,
       pg_catalog.count(_postgrest_t) AS page_total,
       coalesce(json_agg(_postgrest_t), '[]') AS body,
       nullif(current_setting('response.headers', TRUE), '') AS response_headers,
       nullif(current_setting('response.status', TRUE), '') AS response_status
FROM
  (SELECT *
   FROM pgrst_source WHERE id=1000) _postgrest_t;

                                                                 QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=16.63..16.65 rows=1 width=112) (actual time=0.028..0.029 rows=1 loops=1)
   ->  Nested Loop  (cost=0.57..16.62 rows=1 width=43) (actual time=0.014..0.016 rows=1 loops=1)
         ->  Index Scan using projects_pkey on projects  (cost=0.29..8.31 rows=1 width=15) (actual time=0.007..0.008 rows=1 loops=1)
               Index Cond: (id = 1000)
         ->  Index Scan using clients_pkey on clients clients_1  (cost=0.28..8.29 rows=1 width=36) (actual time=0.005..0.005 rows=1 loops=1)
               Index Cond: (id = projects.client_id)
               Filter: (ROW(name) IS DISTINCT FROM NULL)
 Planning Time: 0.219 ms
 Execution Time: 0.056 ms
(9 rows)

"join column is not null" for a single row

Query

EXPLAIN ANALYZE WITH pgrst_source AS
  (SELECT projects.id, "test"."projects"."name",
          "projects_clients_1"."clients_1_json" AS "clients"
   FROM "test"."projects"
   LEFT JOIN LATERAL
     (SELECT row_to_json("clients_1_row".*) AS "clients_1_json",
             "clients_1"."id"
      FROM "test"."clients" AS "clients_1",
      LATERAL (SELECT "clients_1"."name") AS "clients_1_row") AS "projects_clients_1"
     ON "projects_clients_1"."id" = "test"."projects"."client_id"
   WHERE "projects_clients_1"."id" IS NOT NULL)
SELECT NULL::bigint AS total_result_set,
       pg_catalog.count(_postgrest_t) AS page_total,
       coalesce(json_agg(_postgrest_t), '[]') AS body,
       nullif(current_setting('response.headers', TRUE), '') AS response_headers,
       nullif(current_setting('response.status', TRUE), '') AS response_status
FROM
  (SELECT *
   FROM pgrst_source WHERE id=1000) _postgrest_t;

                                                                 QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=16.63..16.65 rows=1 width=112) (actual time=0.021..0.021 rows=1 loops=1)
   ->  Nested Loop  (cost=0.57..16.62 rows=1 width=43) (actual time=0.013..0.013 rows=1 loops=1)
         ->  Index Scan using projects_pkey on projects  (cost=0.29..8.31 rows=1 width=15) (actual time=0.005..0.006 rows=1 loops=1)
               Index Cond: (id = 1000)
         ->  Index Scan using clients_pkey on clients clients_1  (cost=0.28..8.30 rows=1 width=36) (actual time=0.005..0.005 rows=1 loops=1)
               Index Cond: ((id = projects.client_id) AND (id IS NOT NULL))
 Planning Time: 0.650 ms
 Execution Time: 0.044 ms
(8 rows)

`IS DISTINCT FROM NULL` for all rows

Query

EXPLAIN ANALYZE
WITH pgrst_source AS
  (SELECT projects.id, "test"."projects"."name",
          row_to_json("projects_clients_1".*) AS "clients"
   FROM "test"."projects"
   LEFT JOIN LATERAL
     (SELECT "clients_1"."name"
      FROM "test"."clients" AS "clients_1"
      WHERE "clients_1"."id" = "test"."projects"."client_id") AS "projects_clients_1" ON TRUE
   WHERE "projects_clients_1" IS DISTINCT FROM NULL)
SELECT NULL::bigint AS total_result_set,
       pg_catalog.count(_postgrest_t) AS page_total,
       coalesce(json_agg(_postgrest_t), '[]') AS body,
       nullif(current_setting('response.headers', TRUE), '') AS response_headers,
       nullif(current_setting('response.status', TRUE), '') AS response_status
FROM
  (SELECT *
   FROM pgrst_source) _postgrest_t;

                                                             QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=2829.22..2829.24 rows=1 width=112) (actual time=158.937..158.939 rows=1 loops=1)
   ->  Hash Join  (cost=28.48..1834.16 rows=99506 width=43) (actual time=0.533..25.425 rows=100004 loops=1)
         Hash Cond: (projects.client_id = clients_1.id)
         ->  Seq Scan on projects  (cost=0.00..1542.05 rows=100005 width=15) (actual time=0.006..4.866 rows=100005 loops=1)
         ->  Hash  (cost=16.02..16.02 rows=997 width=36) (actual time=0.515..0.516 rows=1002 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 73kB
               ->  Seq Scan on clients clients_1  (cost=0.00..16.02 rows=997 width=36) (actual time=0.007..0.337 rows=1002 loops=1)
                     Filter: (ROW(name) IS DISTINCT FROM NULL)
 Planning Time: 0.287 ms
 Execution Time: 158.977 ms
(10 rows)

"join column is not null" for all rows

Query

EXPLAIN ANALYZE WITH pgrst_source AS
  (SELECT projects.id, "test"."projects"."name",
          "projects_clients_1"."clients_1_json" AS "clients"
   FROM "test"."projects"
   LEFT JOIN LATERAL
     (SELECT row_to_json("clients_1_row".*) AS "clients_1_json",
             "clients_1"."id"
      FROM "test"."clients" AS "clients_1",
      LATERAL (SELECT "clients_1"."name") AS "clients_1_row") AS "projects_clients_1"
     ON "projects_clients_1"."id" = "test"."projects"."client_id"
   WHERE "projects_clients_1"."id" IS NOT NULL)
SELECT NULL::bigint AS total_result_set,
       pg_catalog.count(_postgrest_t) AS page_total,
       coalesce(json_agg(_postgrest_t), '[]') AS body,
       nullif(current_setting('response.headers', TRUE), '') AS response_headers,
       nullif(current_setting('response.status', TRUE), '') AS response_status
FROM
  (SELECT *
   FROM pgrst_source) _postgrest_t;

                                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=2336.75..2336.77 rows=1 width=112) (actual time=75.254..75.255 rows=1 loops=1)
   ->  Hash Join  (cost=31.05..1836.73 rows=100005 width=39) (actual time=0.493..21.285 rows=100004 loops=1)
         Hash Cond: (projects.client_id = clients_1.id)
         ->  Seq Scan on projects  (cost=0.00..1542.05 rows=100005 width=11) (actual time=0.004..4.369 rows=100005 loops=1)
         ->  Hash  (cost=18.52..18.52 rows=1002 width=36) (actual time=0.485..0.486 rows=1002 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 63kB
               ->  Seq Scan on clients clients_1  (cost=0.00..18.52 rows=1002 width=36) (actual time=0.007..0.376 rows=1002 loops=1)
                     Filter: (id IS NOT NULL)
 Planning Time: 0.241 ms
 Execution Time: 75.278 ms
(10 rows)

Twice as fast compared to the IS DISTINCT approach.

Note that until now, all plans where basically the same. But consider this:

`IS NOT DISTINCT FROM NULL` for all rows (anti join)

Query

EXPLAIN ANALYZE
WITH pgrst_source AS
  (SELECT projects.id, "test"."projects"."name",
          row_to_json("projects_clients_1".*) AS "clients"
   FROM "test"."projects"
   LEFT JOIN LATERAL
     (SELECT "clients_1"."name"
      FROM "test"."clients" AS "clients_1"
      WHERE "clients_1"."id" = "test"."projects"."client_id") AS "projects_clients_1" ON TRUE
   WHERE "projects_clients_1" IS NOT DISTINCT FROM NULL)
SELECT NULL::bigint AS total_result_set,
       pg_catalog.count(_postgrest_t) AS page_total,
       coalesce(json_agg(_postgrest_t), '[]') AS body,
       nullif(current_setting('response.headers', TRUE), '') AS response_headers,
       nullif(current_setting('response.status', TRUE), '') AS response_status
FROM
  (SELECT *
   FROM pgrst_source) _postgrest_t;

                                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1839.22..1839.24 rows=1 width=112) (actual time=24.212..24.214 rows=1 loops=1)
   ->  Hash Left Join  (cost=28.55..1834.22 rows=500 width=43) (actual time=0.347..24.203 rows=1 loops=1)
         Hash Cond: (projects.client_id = clients_1.id)
         Filter: ((ROW(clients_1.name)) IS NOT DISTINCT FROM NULL)
         Rows Removed by Filter: 100004
         ->  Seq Scan on projects  (cost=0.00..1542.05 rows=100005 width=15) (actual time=0.006..6.062 rows=100005 loops=1)
         ->  Hash  (cost=16.02..16.02 rows=1002 width=36) (actual time=0.334..0.335 rows=1002 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 73kB
               ->  Seq Scan on clients clients_1  (cost=0.00..16.02 rows=1002 width=36) (actual time=0.006..0.191 rows=1002 loops=1)
 Planning Time: 0.215 ms
 Execution Time: 24.242 ms
(11 rows)

Note that we still have a left join here for the distinct approach...

"join column is null" for all rows (anti join)

Query

EXPLAIN ANALYZE WITH pgrst_source AS
  (SELECT projects.id, "test"."projects"."name",
          "projects_clients_1"."clients_1_json" AS "clients"
   FROM "test"."projects"
   LEFT JOIN LATERAL
     (SELECT row_to_json("clients_1_row".*) AS "clients_1_json",
             "clients_1"."id"
      FROM "test"."clients" AS "clients_1",
      LATERAL (SELECT "clients_1"."name") AS "clients_1_row") AS "projects_clients_1"
     ON "projects_clients_1"."id" = "test"."projects"."client_id"
   WHERE "projects_clients_1"."id" IS NULL)
SELECT NULL::bigint AS total_result_set,
       pg_catalog.count(_postgrest_t) AS page_total,
       coalesce(json_agg(_postgrest_t), '[]') AS body,
       nullif(current_setting('response.headers', TRUE), '') AS response_headers,
       nullif(current_setting('response.status', TRUE), '') AS response_status
FROM
  (SELECT *
   FROM pgrst_source) _postgrest_t;

                                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1948.13..1948.15 rows=1 width=112) (actual time=22.187..22.188 rows=1 loops=1)
   ->  Hash Anti Join  (cost=31.05..1948.12 rows=1 width=39) (actual time=0.736..22.178 rows=1 loops=1)
         Hash Cond: (projects.client_id = clients_1.id)
         ->  Seq Scan on projects  (cost=0.00..1542.05 rows=100005 width=11) (actual time=0.004..6.207 rows=100005 loops=1)
         ->  Hash  (cost=18.52..18.52 rows=1002 width=36) (actual time=0.726..0.726 rows=1002 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 63kB
               ->  Seq Scan on clients clients_1  (cost=0.00..18.52 rows=1002 width=36) (actual time=0.008..0.562 rows=1002 loops=1)
 Planning Time: 0.217 ms
 Execution Time: 22.217 ms
(9 rows)

... but a hash anti join node for the is null approach.

Of course the performance of this query is roughly the same - because there is only one project without client. But this shows that postgresql can rewrite / optimize the query better when not using the distinct approach.

laurenceisla mentioned this issue Sep 21, 2023

fix: Bug when Null Filtering on embedded resources #2951

Merged

laurenceisla added the perf label Sep 21, 2023

laurenceisla self-assigned this Oct 11, 2023

laurenceisla mentioned this issue Oct 16, 2023

perf: make the Null Filtering query more performant #3009

Closed

3 tasks

laurenceisla removed their assignment Nov 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the Null Filtering query more performant #2961

Make the Null Filtering query more performant #2961

laurenceisla commented Sep 21, 2023

wolfgangwalther commented Sep 21, 2023

Make the Null Filtering query more performant #2961

Make the Null Filtering query more performant #2961

Comments

laurenceisla commented Sep 21, 2023

wolfgangwalther commented Sep 21, 2023

IS DISTINCT FROM NULL for a single row

"join column is not null" for a single row

IS DISTINCT FROM NULL for all rows

"join column is not null" for all rows

IS NOT DISTINCT FROM NULL for all rows (anti join)

"join column is null" for all rows (anti join)

`IS DISTINCT FROM NULL` for a single row

`IS DISTINCT FROM NULL` for all rows

`IS NOT DISTINCT FROM NULL` for all rows (anti join)