Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[batch] make batches query go brrrrrrr (for realsies) #14649

Merged

Conversation

ehigham
Copy link
Member

@ehigham ehigham commented Aug 1, 2024

#14629 improved the speed of listing a user's batches but introduced a large
regression for listing all batches readble by that user. This change fixes that
regression by making use index hints and STRAIGHT_JOINs.

The index hint tells MySQL to never consider the index batches_deleted as it
has very low cardinality. In some forms of this query, the planner tries to use
it to its peril.

A problem in query 0 with #14629 (see below) was that fewer filters on batches
made the optimiser consider joins in a suboptimal order - it did a table scan
on job_groups first then sorted the results by to batches.id DESC instead
of doing an index scan on batches in reverse.

Using STRAIGHT_JOINs instead of INNER JOIN mades the optimiser start from
batches and read its index in reverse before considering other tables in
subsequent joins. From the documentation:

STRAIGHT_JOIN is similar to JOIN, except that the left table is always read
before the right table. This can be used for those (few) cases for which the
join optimizer processes the tables in a suboptimal order.

This is advantageous for a couple of reasons:

  • We want to list newer batches first
  • For this query, the batches table has more applicables indexes
  • We want the variable to order by to be in the primary key of the first
    table so we can read the index in reverse

Before and after timings, collected by running the query 5 times, then using
profiles gathered by MySQL.

+-------+---------------------------------------------------*
| query |  description                                      |                                                                                                                                                                                                                                                         
+-------+---------------------------------------------------+
|     0 | All batches accessible to user `ci`               |
|     1 | All batches accessible to user `ci` owned by `ci` |
+-------+---------------------------------------------------*

+-------+--------+--------------------------------------------------------+------------+------------+
| query | branch | timings                                                |    mean    |    stdev   |                                                                                                                                                                                                                                             
+-------+--------+--------------------------------------------------------+------------+------------+
|     0 |  main  | 0.05894400,0.05207850,0.07067875,0.06281800,0.060250   | 0.06095385 | 0.00602255 |
|     1 |  main  | 14.1106150,12.2619323,13.8442850,12.0749633,14.0297822 | 13.2643156 | 0.90087263 |
+-------+--------+--------------------------------------------------------+------------+------------+
|     0 |   PR   | 0.04717375,0.04974350,0.04312150,0.04070850,0.04193650 | 0.04453675 | 0.00339069 |
|     1 |   PR   | 0.04423925,0.03967550,0.03935425,0.04056875,0.05286850 | 0.04334125 | 0.00507140 |
+-------+--------+--------------------------------------------------------+------------+------------+

I'm hopeful that this won't introduce regressions for most use cases. While I
haven't benchmarked other queries, the MySQL client does feel more responsive
for a wider array of users. One notable exception is for the user dking who
owns 3.7x more batches than has access to, of which all have been deleted. I
don't think this is a common enough use case to make this query even more
complicated than it already is.

Resolves #14599

@ehigham
Copy link
Member Author

ehigham commented Aug 2, 2024

image

@ehigham ehigham marked this pull request as ready for review August 2, 2024 18:42
@ehigham ehigham requested review from iris-garden and removed request for chrisvittal August 2, 2024 19:21
@cjllanwarne
Copy link
Collaborator

@ehigham ehigham requested a review from iris-garden August 6, 2024 19:11
Copy link
Collaborator

@cjllanwarne cjllanwarne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good reinterpretation of the original PR, with more specific control over the index being used and the ordering of the joins.

Since this is some specific customization based on knowledge of the underlying data, it may be worth documenting in the PR description (or code? or docs?) why this particular ordering (and choice of indexes) is important?

Comment on lines +130 to +136
, cancelled_t.cancelled IS NOT NULL AS cancelled
, job_groups_n_jobs_in_complete_states.n_completed
, job_groups_n_jobs_in_complete_states.n_succeeded
, job_groups_n_jobs_in_complete_states.n_failed
, job_groups_n_jobs_in_complete_states.n_cancelled
, cost_t.cost
, cost_t.cost_breakdown
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably this layout makes adding and removing lines easier? But did you intend to keep this in the final version?

Copy link
Member Author

@ehigham ehigham Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did. MySQL (rightly) does not support trailing commas. This is the One True Way (TM) to do split comma-separated lists.

, job_groups_n_jobs_in_complete_states.n_cancelled
, cost_t.cost
, cost_t.cost_breakdown
FROM batches IGNORE INDEX (batches_deleted)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a slight forward-looking concern over introducing so much mysql-specific syntax to the query. I have a suspicion that postgres support might be a useful option to have in certain potential terra futures. Not enough to block this, but enough to make me mildly uneasy...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly, rawls and sam use MySQL (or cloud sql). Anyway, all that is kind of irrelevant as

  • different databases have different optimisers and may plan queries differently
  • many of the queries in batch are very tuned to MySQL (eg STRAIGHT_JOIN, LATERAL, (a, b, ..) IN Expression, etc)
  • we'd therefore have to consider using different DBMSs very carefully.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point well taken that this is not "the thin end of the wedge" and more like "a pattern already throughout batch". Point also well taken that adding support is not necessarily a trivial thing to do. So 👍 for following the pattern here.

For more context on my thoughts here - there are certainly 'ways to terra' that do not go through psql, but our terra-on-azure app, for example, cannot have a managed database because of this reliance, though maybe that's more because it was never added than some fundamental blocker

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I have a natural prior that 'all this optimization for mysql' is just not as necessary in psql because their optimizer is less likely to do the weird things that mysql does. But there's not data behind that, just biases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our terra-on-azure app, for example, cannot have a managed database because of this reliance, though maybe that's more because it was never added than some fundamental blocker

That's good to know for future work, thanks for sharing.

@ehigham
Copy link
Member Author

ehigham commented Aug 7, 2024

This looks like a good reinterpretation of the original PR, with more specific control over the index being used and the ordering of the joins.

Since this is some specific customization based on knowledge of the underlying data, it may be worth documenting in the PR description (or code? or docs?) why this particular ordering (and choice of indexes) is important?

Good idea. I'll do that.

@hail-ci-robot hail-ci-robot merged commit 7d25779 into hail-is:main Aug 8, 2024
4 checks passed
@ehigham ehigham deleted the ehigham/batches-go-brrrrr-take-2 branch August 8, 2024 23:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[batch] Searching in the batches UI is slow for users with many batches
4 participants