Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query building performance discussion #435

Open
tomjaguarpaw opened this issue Jun 30, 2019 · 14 comments
Open

Query building performance discussion #435

tomjaguarpaw opened this issue Jun 30, 2019 · 14 comments

Comments

@tomjaguarpaw
Copy link
Owner

Migrating from #434

@tomjaguarpaw
Copy link
Owner Author

I would be very happy to have benchmarks in the CI. That would be a good way of catching performance regressions early. I'm also open to improving rendering performance by switching away from String. What exactly did you have in mind? I'm surprised that the pretty package is not already optimised as much as possible!

@csasarak
Copy link

csasarak commented Jun 30, 2019

I had a look at the pretty source code and it appears that while the Doc type is a rope structure, when it comes to rendering it uses Strings, which have the O(n) cost of appending. I'm actually not too surprised, the documentation says its well suited to things like compilers and if the whole point is to generate user-facing messages then the slowness probably isn't worth fixing. That SPJ did a lot of work on it suggests to me also that this is the case, given his work on GHC.

I was running into some trouble with the more complex queries in my program at work (I do data warehousing/analytics), what I found is that if I dump the generated code into postgres it runs quite quickly, but for whatever reason opaleye was taking a long time. When I profiled, most of the time was attributed to formatAndShowSql and the only CAF that it creates is the query String. You can see an example of flamegraph (made with ghc-prof-flamegraph) showing this here. This graph shows where time is spent proportionally, but for larger queries it can turn into a lot of real time.

For most of my simpler queries this isn't much of a problem, but I have a few which aggregate data across multiple tables, with multiple left-joins, inner joins, and unions. Opaleye is really great because I can abstract a lot of the common behavior into query combinators, but maybe this is abusive?

In any case, you can see a really rough non-working example of what I'm envisioning here: https://github.com/csasarak/haskell-opaleye/tree/bytestrings It's basically a direct copy of the existing rendering code, just with bytestring's Builder type. One thing it doesn't do is print a pretty version though.

The bytestring builders have O(1) append and postgresql-simple appears to be able to handle them directly. I can try to work on an example in the benchmark repo that I worked in for the unionAll question. I'd like to convince someone else that this is indeed a problem or that it can be done better before pushing too hard for change.

@tomjaguarpaw
Copy link
Owner Author

Opaleye is really great because I can abstract a lot of the common behavior into query combinators, but maybe this is abusive?

This is exactly what Opaleye is designed for so this is definitely a use case I want to perform well.

Could it be possible just to rewrite the internals of pretty to use bytestring, or come up with a roughly API-compatible version that does? That sounds like the right place to fix the problem and would benefit the community.

I will take a look at your analysis and example later.

@csasarak
Copy link

That is also possible, I was looking for a render function from Doc -> Text but could not find one - text also doesn't seem to have the O(1) api for appending. There does seem to be some work in this direction: haskell/pretty#42 and haskell/pretty#36.

I will familiarize myself more with the pretty library in order to know what writing an API compatible version would look like. I'm hoping I'm understanding what's happening correctly.

@ocharles
Copy link
Collaborator

You may want to explore the prettyprinter library - imo it's best in class

@csasarak
Copy link

csasarak commented Jun 30, 2019 via email

@tomjaguarpaw
Copy link
Owner Author

Thanks @ocharles. I happy to change library if there's something with better performance.

@ocharles
Copy link
Collaborator

I'm going to do some work on this, because I'm also getting hit by the performance of formatAndShowSQL:

image

This is from a Hedgehog test run - it takes 20s to run 100 tests, and the majority of this time is just formatting queries - this is pretty mad!

@csasarak
Copy link

Hey, I apologize for not working more diligently on this. The company I was working for closed-up so I've been busy working on other things. So, anecodotally PR #436 shows a bit of a decrease in time when we switch to prettyprinter.

I have more time recently and since there's interest here I'm happy to try to help out if you like. I have a few very simple benchmarks which might be useful here: https://github.com/csasarak/haskell-opaleye/tree/bench
The main thing they try to do is a few successive unions or joins since this seems to be where queries start getting bigger. In particular, I recall seeing a lot of time spent in the commaV function, I think where it would be concatenating column names.

@tomjaguarpaw
Copy link
Owner Author

Great, it would be good to get some more insight into what is causing the slow performance.

@ocharles
Copy link
Collaborator

Investigations so far:

  • prettyprinter doesn't make a difference
  • I defined Opaleye.Pretty, which has just the pretty printing API we need, but just uses Endo [String] and doesn't actually do any pretty printing at all (just concatenates strings). This is barely faster than the pretty printer we already have.

This could be tougher than we think!

@tomjaguarpaw
Copy link
Owner Author

tomjaguarpaw commented Feb 20, 2020

Is it possible that generating the SQL expression tree from the PrimQuery expression tree is the thing that is taking a lot of time, and it is wrongly being attributed to showSqlExplicit?

@ocharles
Copy link
Collaborator

I don't think so, but I'll play around with deepseq and see if giving that a fully evaluated thing to print changes anything

@saurabhnanda
Copy link
Contributor

I think in view of #476 this issue should be re-opened. Probably prepare if for Hackoberfest so that we can tackle this as a community and get it fixed once and for all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants