fix: Bug when Null Filtering on embedded resources #2951

laurenceisla · 2023-09-15T00:49:45Z

wolfgangwalther · 2023-09-15T07:04:17Z

I'm not sure whether the IS DISTINCT FROM fix is the best way to go here. We always have at least one column that is joined on. The correct check would be to translate table=not.is.null to something like table.join_column IS NOT NULL.

wolfgangwalther · 2023-09-15T07:05:05Z

Or even better... maybe just translate table=not.is.null to an INNER JOIN entirely?

steve-chavez · 2023-09-15T16:18:46Z

Or even better... maybe just translate table=not.is.null to an INNER JOIN entirely?

That might be difficult as joins much change the shape of the query.

Edit: additionally ANTI JOINs and OR queries wouldn't work no more.

steve-chavez · 2023-09-15T16:21:06Z

The correct check would be to translate table=not.is.null to something like table.join_column IS NOT NULL.

What would be the advantage of that one? DISTINCT FROM null is more similar to table=not.is.null.

laurenceisla · 2023-09-15T19:11:11Z

Or even better... maybe just translate table=not.is.null to an INNER JOIN entirely?

It could be possible for anti joins but the OR queries wouldn't work this way. Although I get that it's not 1-1 with PostgreSQL (table=isdistinct.null might be more precise). With that in mind, using is not null gives the errors from the issue: it checks if there's a null value for every column in the row instead of the existence of the row.

wolfgangwalther · 2023-09-18T07:15:45Z

Or even better... maybe just translate table=not.is.null to an INNER JOIN entirely?

That might be difficult as joins much change the shape of the query.

Edit: additionally ANTI JOINs and OR queries wouldn't work no more.

Yes. I didn't say table=is.null should be replaced with a join. Only table=is.not.null.

The correct check would be to translate table=not.is.null to something like table.join_column IS NOT NULL.

What would be the advantage of that one? DISTINCT FROM null is more similar to table=not.is.null.

Performance. I don't think the distinct from approach will be optimized very well, while comparing only the join columns is very likely optimized nicely.

laurenceisla · 2023-09-19T02:55:35Z

I did some benchmarks using pgbench for each of the approaches to the resulting query of:

curl "http://localhost:3000/projects?select=name,clients(name)&clients=not.is.null"

Queries:

-- IS NOT NULL (Before the change)

WITH pgrst_source AS
  (SELECT "test"."projects"."name",
          row_to_json("projects_clients_1".*) AS "clients"
   FROM "test"."projects"
   LEFT JOIN LATERAL
     (SELECT "clients_1"."name"
      FROM "test"."clients" AS "clients_1"
      WHERE "clients_1"."id" = "test"."projects"."client_id") AS "projects_clients_1" ON TRUE
   WHERE "projects_clients_1" IS NOT NULL)
SELECT NULL::bigint AS total_result_set,
       pg_catalog.count(_postgrest_t) AS page_total,
       coalesce(json_agg(_postgrest_t), '[]') AS body,
       nullif(current_setting('response.headers', TRUE), '') AS response_headers,
       nullif(current_setting('response.status', TRUE), '') AS response_status
FROM
  (SELECT *
   FROM pgrst_source) _postgrest_t;


-- IS DISTINCT FROM NULL

WITH pgrst_source AS
  (SELECT "test"."projects"."name",
          row_to_json("projects_clients_1".*) AS "clients"
   FROM "test"."projects"
   LEFT JOIN LATERAL
     (SELECT "clients_1"."name"
      FROM "test"."clients" AS "clients_1"
      WHERE "clients_1"."id" = "test"."projects"."client_id") AS "projects_clients_1" ON TRUE
   WHERE "projects_clients_1" IS DISTINCT FROM NULL)
SELECT NULL::bigint AS total_result_set,
       pg_catalog.count(_postgrest_t) AS page_total,
       coalesce(json_agg(_postgrest_t), '[]') AS body,
       nullif(current_setting('response.headers', TRUE), '') AS response_headers,
       nullif(current_setting('response.status', TRUE), '') AS response_status
FROM
  (SELECT *
   FROM pgrst_source) _postgrest_t;


-- INNER JOIN

WITH pgrst_source AS
  (SELECT "test"."projects"."name",
          row_to_json("projects_clients_1".*) AS "clients"
   FROM "test"."projects"
   INNER JOIN LATERAL
     (SELECT "clients_1"."name"
      FROM "test"."clients" AS "clients_1"
      WHERE "clients_1"."id" = "test"."projects"."client_id") AS "projects_clients_1" ON TRUE)
SELECT NULL::bigint AS total_result_set,
       pg_catalog.count(_postgrest_t) AS page_total,
       coalesce(json_agg(_postgrest_t), '[]') AS body,
       nullif(current_setting('response.headers', TRUE), '') AS response_headers,
       nullif(current_setting('response.status', TRUE), '') AS response_status
FROM
  (SELECT *
   FROM pgrst_source) _postgrest_t;

The `pgbench` for each of the queries:

-- IS NOT NULL (Before the change)

scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
maximum number of tries: 1
duration: 25 s
number of transactions actually processed: 152304
number of failed transactions: 0 (0.000%)
latency average = 0.164 ms
initial connection time = 1.737 ms
tps = 6092.564790 (without initial connection time)


 -- IS DISTINCT FROM NULL

scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
maximum number of tries: 1
duration: 25 s
number of transactions actually processed: 158569
number of failed transactions: 0 (0.000%)
latency average = 0.158 ms
initial connection time = 2.025 ms
tps = 6343.240056 (without initial connection time)


-- INNER JOIN

scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
maximum number of tries: 1
duration: 25 s
number of transactions actually processed: 169379
number of failed transactions: 0 (0.000%)
latency average = 0.148 ms
initial connection time = 1.733 ms
tps = 6775.588488 (without initial connection time)

Performance. I don't think the distinct from approach will be optimized very well, while comparing only the join columns is very likely optimized nicely.

The INNER JOIN is indeed the most performant, but I don't see how it can be easily rewritten to allow the OR between ?or=(table1.not.is.null,table2.not.is.null), for instance.

The correct check would be to translate table=not.is.null to something like table.join_column IS NOT NULL.

I didn't know how to implement this for the test (and allow the OR across joins), just used the IS NOT NULL before the change in this PR.

steve-chavez · 2023-09-19T13:53:22Z

-- IS NOT NULL (Before the change)
tps = 6092.564790 (without initial connection time)

-- IS DISTINCT FROM NULL
tps = 6343.240056 (without initial connection time)

So we getter perf now! 🎉

The INNER JOIN is indeed the most performant, but I don't see how it can be easily rewritten to allow the OR

Yeah, INNER JOIN cannot be done and this goes back to the introduction of top-level filtering: #1949 (comment). Iced-Sun's query helped us there.

wolfgangwalther · 2023-09-21T06:40:03Z

The INNER JOIN is indeed the most performant, but I don't see how it can be easily rewritten to allow the OR between ?or=(table1.not.is.null,table2.not.is.null), for instance.

Right, the OR approach can't be written with JOIN easily. But in my latest comment I didn't say "write as JOIN", but "check only join columns". This is possible to do with OR as well.

The test you did is very artificial, because the tables are really small (few columns!). Once you add a lot more columns, those tests will show that both before the change and after the change with IS DISTINCT are both very slow compared to just testing the join columns.

So we getter perf now!

As I said before this is misleading. The performance is not really better, it's less bad. But not good, yet.

Yeah, INNER JOIN cannot be done

If you test the join columns, PG might be able to rewrite this to an inner join automatically.

laurenceisla · 2023-09-21T17:16:05Z

But in my latest comment I didn't say "write as JOIN", but "check only join columns".

Yes, at the end of the tests I added that I couldn't find a way on how to do this, that's why it's not included there.

If you test the join columns, PG might be able to rewrite this to an inner join automatically.

Cool, then the problem right now is to rewrite the queries to allow filtering on the JOIN columns. It does not seem to be trivial: as it is now, we select from the LEFT JOIN LATERAL subquery, if the join column is not selected by the user ?select=joincol, then it's not included in that subquery, hence the filter cannot be done. We need to find a way around that, if anyone have some ideas I could give them a try.

Edit: opened an issue to keep track of this #2961

laurenceisla added 3 commits September 14, 2023 18:29

Add failing tests

61dedf9

change 'is null' to 'is not distinct from null'

d79a478

Add is.null test

4ffb29c

laurenceisla marked this pull request as ready for review September 15, 2023 02:08

steve-chavez approved these changes Sep 15, 2023

View reviewed changes

laurenceisla merged commit fa182c2 into PostgREST:main Sep 15, 2023
30 checks passed

laurenceisla deleted the fix-notisnull branch September 18, 2023 18:41

laurenceisla mentioned this pull request Sep 18, 2023

Add missing changelog entries #2957

Merged

laurenceisla mentioned this pull request Sep 21, 2023

Make the Null Filtering query more performant #2961

Open

laurenceisla mentioned this pull request Oct 14, 2023

perf: make the Null Filtering query more performant #3009

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Bug when Null Filtering on embedded resources #2951

fix: Bug when Null Filtering on embedded resources #2951

laurenceisla commented Sep 15, 2023

wolfgangwalther commented Sep 15, 2023

wolfgangwalther commented Sep 15, 2023

steve-chavez commented Sep 15, 2023 •

edited

Loading

steve-chavez commented Sep 15, 2023

laurenceisla commented Sep 15, 2023

wolfgangwalther commented Sep 18, 2023

laurenceisla commented Sep 19, 2023 •

edited

Loading

steve-chavez commented Sep 19, 2023

wolfgangwalther commented Sep 21, 2023

laurenceisla commented Sep 21, 2023 •

edited

Loading

fix: Bug when Null Filtering on embedded resources #2951

fix: Bug when Null Filtering on embedded resources #2951

Conversation

laurenceisla commented Sep 15, 2023

wolfgangwalther commented Sep 15, 2023

wolfgangwalther commented Sep 15, 2023

steve-chavez commented Sep 15, 2023 • edited Loading

steve-chavez commented Sep 15, 2023

laurenceisla commented Sep 15, 2023

wolfgangwalther commented Sep 18, 2023

laurenceisla commented Sep 19, 2023 • edited Loading

steve-chavez commented Sep 19, 2023

wolfgangwalther commented Sep 21, 2023

laurenceisla commented Sep 21, 2023 • edited Loading

steve-chavez commented Sep 15, 2023 •

edited

Loading

laurenceisla commented Sep 19, 2023 •

edited

Loading

laurenceisla commented Sep 21, 2023 •

edited

Loading