Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the Null Filtering query more performant #2961

Open
laurenceisla opened this issue Sep 21, 2023 · 1 comment
Open

Make the Null Filtering query more performant #2961

laurenceisla opened this issue Sep 21, 2023 · 1 comment
Labels

Comments

@laurenceisla
Copy link
Member

After the discussion that started in
#2951 (comment), the conclusion is that changing table IS DISTINCT FROM NULL to table.join_column IS NOT NULL should be more performant because PostgreSQL would treat that last one as an INNER JOIN.

Perhaps we need to change how we build the embedding sub queries to allow this. The IS DISTINCT FROM is specified here:

pgFmtFilter _ (CoercibleFilterNullEmbed hasNot fld) = pgFmtIdent fld <> " IS " <> (if not hasNot then "NOT " else mempty) <> "DISTINCT FROM NULL"

@wolfgangwalther
Copy link
Member

Some examples. I'm adding this to my fixtures/data.sql file:

INSERT INTO test.clients (id, name) SELECT n+2, 'Test ' || n FROM generate_series(1,1000) AS n;

INSERT INTO test.projects (id, name, client_id) SELECT (n-1)*100+m+5, 'Test ' || m, n+2 FROM generate_series(1,1000) AS n, LATERAL generate_series(1,100) AS m;

CREATE INDEX projects_client_id ON projects (client_id);

Not a lot of data, but it shows.

IS DISTINCT FROM NULL for a single row

Query
EXPLAIN ANALYZE
WITH pgrst_source AS
  (SELECT projects.id, "test"."projects"."name",
          row_to_json("projects_clients_1".*) AS "clients"
   FROM "test"."projects"
   LEFT JOIN LATERAL
     (SELECT "clients_1"."name"
      FROM "test"."clients" AS "clients_1"
      WHERE "clients_1"."id" = "test"."projects"."client_id") AS "projects_clients_1" ON TRUE
   WHERE "projects_clients_1" IS DISTINCT FROM NULL)
SELECT NULL::bigint AS total_result_set,
       pg_catalog.count(_postgrest_t) AS page_total,
       coalesce(json_agg(_postgrest_t), '[]') AS body,
       nullif(current_setting('response.headers', TRUE), '') AS response_headers,
       nullif(current_setting('response.status', TRUE), '') AS response_status
FROM
  (SELECT *
   FROM pgrst_source WHERE id=1000) _postgrest_t;
                                                                 QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=16.63..16.65 rows=1 width=112) (actual time=0.028..0.029 rows=1 loops=1)
   ->  Nested Loop  (cost=0.57..16.62 rows=1 width=43) (actual time=0.014..0.016 rows=1 loops=1)
         ->  Index Scan using projects_pkey on projects  (cost=0.29..8.31 rows=1 width=15) (actual time=0.007..0.008 rows=1 loops=1)
               Index Cond: (id = 1000)
         ->  Index Scan using clients_pkey on clients clients_1  (cost=0.28..8.29 rows=1 width=36) (actual time=0.005..0.005 rows=1 loops=1)
               Index Cond: (id = projects.client_id)
               Filter: (ROW(name) IS DISTINCT FROM NULL)
 Planning Time: 0.219 ms
 Execution Time: 0.056 ms
(9 rows)

"join column is not null" for a single row

Query
EXPLAIN ANALYZE WITH pgrst_source AS
  (SELECT projects.id, "test"."projects"."name",
          "projects_clients_1"."clients_1_json" AS "clients"
   FROM "test"."projects"
   LEFT JOIN LATERAL
     (SELECT row_to_json("clients_1_row".*) AS "clients_1_json",
             "clients_1"."id"
      FROM "test"."clients" AS "clients_1",
      LATERAL (SELECT "clients_1"."name") AS "clients_1_row") AS "projects_clients_1"
     ON "projects_clients_1"."id" = "test"."projects"."client_id"
   WHERE "projects_clients_1"."id" IS NOT NULL)
SELECT NULL::bigint AS total_result_set,
       pg_catalog.count(_postgrest_t) AS page_total,
       coalesce(json_agg(_postgrest_t), '[]') AS body,
       nullif(current_setting('response.headers', TRUE), '') AS response_headers,
       nullif(current_setting('response.status', TRUE), '') AS response_status
FROM
  (SELECT *
   FROM pgrst_source WHERE id=1000) _postgrest_t;
                                                                 QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=16.63..16.65 rows=1 width=112) (actual time=0.021..0.021 rows=1 loops=1)
   ->  Nested Loop  (cost=0.57..16.62 rows=1 width=43) (actual time=0.013..0.013 rows=1 loops=1)
         ->  Index Scan using projects_pkey on projects  (cost=0.29..8.31 rows=1 width=15) (actual time=0.005..0.006 rows=1 loops=1)
               Index Cond: (id = 1000)
         ->  Index Scan using clients_pkey on clients clients_1  (cost=0.28..8.30 rows=1 width=36) (actual time=0.005..0.005 rows=1 loops=1)
               Index Cond: ((id = projects.client_id) AND (id IS NOT NULL))
 Planning Time: 0.650 ms
 Execution Time: 0.044 ms
(8 rows)

IS DISTINCT FROM NULL for all rows

Query
EXPLAIN ANALYZE
WITH pgrst_source AS
  (SELECT projects.id, "test"."projects"."name",
          row_to_json("projects_clients_1".*) AS "clients"
   FROM "test"."projects"
   LEFT JOIN LATERAL
     (SELECT "clients_1"."name"
      FROM "test"."clients" AS "clients_1"
      WHERE "clients_1"."id" = "test"."projects"."client_id") AS "projects_clients_1" ON TRUE
   WHERE "projects_clients_1" IS DISTINCT FROM NULL)
SELECT NULL::bigint AS total_result_set,
       pg_catalog.count(_postgrest_t) AS page_total,
       coalesce(json_agg(_postgrest_t), '[]') AS body,
       nullif(current_setting('response.headers', TRUE), '') AS response_headers,
       nullif(current_setting('response.status', TRUE), '') AS response_status
FROM
  (SELECT *
   FROM pgrst_source) _postgrest_t;
                                                             QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=2829.22..2829.24 rows=1 width=112) (actual time=158.937..158.939 rows=1 loops=1)
   ->  Hash Join  (cost=28.48..1834.16 rows=99506 width=43) (actual time=0.533..25.425 rows=100004 loops=1)
         Hash Cond: (projects.client_id = clients_1.id)
         ->  Seq Scan on projects  (cost=0.00..1542.05 rows=100005 width=15) (actual time=0.006..4.866 rows=100005 loops=1)
         ->  Hash  (cost=16.02..16.02 rows=997 width=36) (actual time=0.515..0.516 rows=1002 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 73kB
               ->  Seq Scan on clients clients_1  (cost=0.00..16.02 rows=997 width=36) (actual time=0.007..0.337 rows=1002 loops=1)
                     Filter: (ROW(name) IS DISTINCT FROM NULL)
 Planning Time: 0.287 ms
 Execution Time: 158.977 ms
(10 rows)

"join column is not null" for all rows

Query
EXPLAIN ANALYZE WITH pgrst_source AS
  (SELECT projects.id, "test"."projects"."name",
          "projects_clients_1"."clients_1_json" AS "clients"
   FROM "test"."projects"
   LEFT JOIN LATERAL
     (SELECT row_to_json("clients_1_row".*) AS "clients_1_json",
             "clients_1"."id"
      FROM "test"."clients" AS "clients_1",
      LATERAL (SELECT "clients_1"."name") AS "clients_1_row") AS "projects_clients_1"
     ON "projects_clients_1"."id" = "test"."projects"."client_id"
   WHERE "projects_clients_1"."id" IS NOT NULL)
SELECT NULL::bigint AS total_result_set,
       pg_catalog.count(_postgrest_t) AS page_total,
       coalesce(json_agg(_postgrest_t), '[]') AS body,
       nullif(current_setting('response.headers', TRUE), '') AS response_headers,
       nullif(current_setting('response.status', TRUE), '') AS response_status
FROM
  (SELECT *
   FROM pgrst_source) _postgrest_t;
                                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=2336.75..2336.77 rows=1 width=112) (actual time=75.254..75.255 rows=1 loops=1)
   ->  Hash Join  (cost=31.05..1836.73 rows=100005 width=39) (actual time=0.493..21.285 rows=100004 loops=1)
         Hash Cond: (projects.client_id = clients_1.id)
         ->  Seq Scan on projects  (cost=0.00..1542.05 rows=100005 width=11) (actual time=0.004..4.369 rows=100005 loops=1)
         ->  Hash  (cost=18.52..18.52 rows=1002 width=36) (actual time=0.485..0.486 rows=1002 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 63kB
               ->  Seq Scan on clients clients_1  (cost=0.00..18.52 rows=1002 width=36) (actual time=0.007..0.376 rows=1002 loops=1)
                     Filter: (id IS NOT NULL)
 Planning Time: 0.241 ms
 Execution Time: 75.278 ms
(10 rows)

Twice as fast compared to the IS DISTINCT approach.

Note that until now, all plans where basically the same. But consider this:

IS NOT DISTINCT FROM NULL for all rows (anti join)

Query
EXPLAIN ANALYZE
WITH pgrst_source AS
  (SELECT projects.id, "test"."projects"."name",
          row_to_json("projects_clients_1".*) AS "clients"
   FROM "test"."projects"
   LEFT JOIN LATERAL
     (SELECT "clients_1"."name"
      FROM "test"."clients" AS "clients_1"
      WHERE "clients_1"."id" = "test"."projects"."client_id") AS "projects_clients_1" ON TRUE
   WHERE "projects_clients_1" IS NOT DISTINCT FROM NULL)
SELECT NULL::bigint AS total_result_set,
       pg_catalog.count(_postgrest_t) AS page_total,
       coalesce(json_agg(_postgrest_t), '[]') AS body,
       nullif(current_setting('response.headers', TRUE), '') AS response_headers,
       nullif(current_setting('response.status', TRUE), '') AS response_status
FROM
  (SELECT *
   FROM pgrst_source) _postgrest_t;
                                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1839.22..1839.24 rows=1 width=112) (actual time=24.212..24.214 rows=1 loops=1)
   ->  Hash Left Join  (cost=28.55..1834.22 rows=500 width=43) (actual time=0.347..24.203 rows=1 loops=1)
         Hash Cond: (projects.client_id = clients_1.id)
         Filter: ((ROW(clients_1.name)) IS NOT DISTINCT FROM NULL)
         Rows Removed by Filter: 100004
         ->  Seq Scan on projects  (cost=0.00..1542.05 rows=100005 width=15) (actual time=0.006..6.062 rows=100005 loops=1)
         ->  Hash  (cost=16.02..16.02 rows=1002 width=36) (actual time=0.334..0.335 rows=1002 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 73kB
               ->  Seq Scan on clients clients_1  (cost=0.00..16.02 rows=1002 width=36) (actual time=0.006..0.191 rows=1002 loops=1)
 Planning Time: 0.215 ms
 Execution Time: 24.242 ms
(11 rows)

Note that we still have a left join here for the distinct approach...

"join column is null" for all rows (anti join)

Query
EXPLAIN ANALYZE WITH pgrst_source AS
  (SELECT projects.id, "test"."projects"."name",
          "projects_clients_1"."clients_1_json" AS "clients"
   FROM "test"."projects"
   LEFT JOIN LATERAL
     (SELECT row_to_json("clients_1_row".*) AS "clients_1_json",
             "clients_1"."id"
      FROM "test"."clients" AS "clients_1",
      LATERAL (SELECT "clients_1"."name") AS "clients_1_row") AS "projects_clients_1"
     ON "projects_clients_1"."id" = "test"."projects"."client_id"
   WHERE "projects_clients_1"."id" IS NULL)
SELECT NULL::bigint AS total_result_set,
       pg_catalog.count(_postgrest_t) AS page_total,
       coalesce(json_agg(_postgrest_t), '[]') AS body,
       nullif(current_setting('response.headers', TRUE), '') AS response_headers,
       nullif(current_setting('response.status', TRUE), '') AS response_status
FROM
  (SELECT *
   FROM pgrst_source) _postgrest_t;
                                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1948.13..1948.15 rows=1 width=112) (actual time=22.187..22.188 rows=1 loops=1)
   ->  Hash Anti Join  (cost=31.05..1948.12 rows=1 width=39) (actual time=0.736..22.178 rows=1 loops=1)
         Hash Cond: (projects.client_id = clients_1.id)
         ->  Seq Scan on projects  (cost=0.00..1542.05 rows=100005 width=11) (actual time=0.004..6.207 rows=100005 loops=1)
         ->  Hash  (cost=18.52..18.52 rows=1002 width=36) (actual time=0.726..0.726 rows=1002 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 63kB
               ->  Seq Scan on clients clients_1  (cost=0.00..18.52 rows=1002 width=36) (actual time=0.008..0.562 rows=1002 loops=1)
 Planning Time: 0.217 ms
 Execution Time: 22.217 ms
(9 rows)

... but a hash anti join node for the is null approach.

Of course the performance of this query is roughly the same - because there is only one project without client. But this shows that postgresql can rewrite / optimize the query better when not using the distinct approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging a pull request may close this issue.

2 participants