Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A database client should be based on asyncio #141

Open
elrik75 opened this issue Mar 4, 2023 · 17 comments
Open

A database client should be based on asyncio #141

elrik75 opened this issue Mar 4, 2023 · 17 comments
Labels
enhancement New feature or request

Comments

@elrik75
Copy link

elrik75 commented Mar 4, 2023

Major DBs have their asyncio client nowdays:

  • postgresql: asyncpg
  • redis: aioredis
  • kafka: aiokafka
  • etc

It's logical for any I/O access for python to be async.
Note that there is an unmaintained async client for clickhouse: https://github.com/maximdanilchenko/aiochclient
Having an official async client for clickhouse with cython and full-featured will be great !

@elrik75 elrik75 added the enhancement New feature or request label Mar 4, 2023
@genzgd
Copy link
Collaborator

genzgd commented Mar 4, 2023

Agreed, this should have had its own enhancement request/issue before now.

@pkit
Copy link

pkit commented Apr 20, 2023

@elrik75 No, it shouldn't yet.
asyncio is so brain-dead right now that it would need a couple of years and a lot of changes to become usable.

@vsbaldeev
Copy link

Hello everyone!
As I understand official aio client for python is not developed yet.
So, which programming language and library should I use to make asynchronous requests to clickhouse ?

@genzgd
Copy link
Collaborator

genzgd commented Jul 14, 2023

The await/async keywords are part of the Python standard asyncio library and work fine with clickhouse-connect. The main caveat is that ClickHouse doesn't allow concurrent queries in the same session, so you should disable the autogenerate_session_id common setting before calling get_client. See the explanation here: https://clickhouse.com/docs/en/integrations/python#managing-clickhouse-session-ids

@genzgd
Copy link
Collaborator

genzgd commented Jul 14, 2023

To clarify, I see this issue as an enhancement request to either create a clean async wrapper around the library that feels more "async native" or even better, use a true async HTTP client. The urllib3 library doesn't use any async io so in that sense clickhouse-connect doesn't really do things in a real event driven way. But you can still use clickhouse-connect with async code.

@araa47
Copy link

araa47 commented Aug 21, 2023

+1 , feel like this is a required feature for a modern python library

@chrisprobst
Copy link

@elrik75 No, it shouldn't yet. asyncio is so brain-dead right now that it would need a couple of years and a lot of changes to become usable.

Mind elaborating what you mean with brain-dead? It's very usable and used everywhere. What aspect of it is brain-dead and better than thread-blocking urllib3 (which means much more brain-dead to me)?

@zerafachris
Copy link

Hi @elrik75 @araa47 @genzgd @vsbaldeev ,
there seems to be a package which wraps on the old clickhouse driver here https://github.com/long2ice/asynch .
Maybe this could help?

@pkit
Copy link

pkit commented May 16, 2024

Mind elaborating what you mean with brain-dead? It's very usable and used everywhere. What aspect of it is brain-dead and better than thread-blocking urllib3 (which means much more brain-dead to me)?

See here: https://stackoverflow.com/questions/46827007/runtimeerror-this-event-loop-is-already-running-in-python

Essentially brain-dead is the "all or nothing" implementation. No project can be converted to async ever without a complete rewrite of that project and all of the libraries it uses. And the libraries that use these libraries.
No nested tasks are allowed, no gradual I/O conversion, full rewrite only.

@elrik75
Copy link
Author

elrik75 commented May 16, 2024

Essentially brain-dead is the "all or nothing" implementation.

That's correct. The point is more: what kind of DB client should CH support then? sync or async? What kind makes more sense? My point of view (because I work on backend servers) is that a sync IO lib is not very helpful.

@genzgd
Copy link
Collaborator

genzgd commented May 18, 2024

@alexted We in the process of defining the roadmap for clickhouse-connect and other integrations, but for the moment this work is not planned for the immediate future. The current thinking is that true async support would involve swapping out to the httpx HTTP client with integrated async support (instead of the venerable urllib3), but that is of course a fairly major change.

To be clear, this is definitely work we want to do, but it may take a while to come to the top of the priority list given all of the work on other projects. As always, community contributions to help us get moving in the right direction are always appreciated. :)

@samuelcolvin
Copy link

samuelcolvin commented Jun 13, 2024

In case anyone is struggling with this, if the postgres connection port is open, you can connect to clickhouse with psycopg3:

    conn = await psycopg.AsyncConnection.connect(
        dbname='default',
        user='default',
        password='...',
        host='localhost',
        port='9005',
        cursor_factory=psycopg.AsyncClientCursor,
        autocommit=True,
    )

You need cursor_factory=psycopg.AsyncClientCursor, autocommit=True to get psycopg to use the simple postgres protocol since clickhouse doesn't support the extended protocol, for the same reason you can't use asyncpg.

All data types seem to be returned as strings, but it's better than nothing (and pydantic does a pretty good job of coercing data to an expected type)

@genzgd
Copy link
Collaborator

genzgd commented Jun 13, 2024

I've added an example of how to run clickhouse-connect queries asynchronously (it includes a semaphore just for fun as well). Based on very limited testing, this solution still significantly outperforms other ClickHouse "async" libraries. You're still running on a single Python thread at a time, but the HTTP requests are obviously very much I/O bound and the Cython transform code is extremely fast, so the GIL isn't as big a problem as you might think.

If you do experiment with similar solutions, please report your results (good or bad) here.

@slvrtrn
Copy link
Contributor

slvrtrn commented Jul 9, 2024

Since 0.7.16, ClickHouse-Connect provides a convenience AsyncClient wrapper over the standard Client, so that it is no longer required to write your own. See this new entry in the docs. The async usage example was also updated.

@samuelcolvin
Copy link

samuelcolvin commented Jul 10, 2024

Since 0.7.16, ClickHouse-Connect provides a convenience AsyncClient wrapper over the standard Client, so that it is no longer required to write your own. See this new entry in the docs. The async usage example was also updated.

This is very unfortunate. As I think I said to @tbragin, pretending to have any async client by just running the sync code in a thread pool is an own goal for ClickHouse. It'll often be slower than just using the sync client. You'd be better off either recommending loop.run_in_executor or the HTTP API and httpx.

def _query():
return self.client.query(query=query, parameters=parameters, settings=settings, query_formats=query_formats,
column_formats=column_formats, encoding=encoding, use_none=use_none,
column_oriented=column_oriented, use_numpy=use_numpy, max_str_len=max_str_len,
context=context, query_tz=query_tz, column_tzs=column_tzs,
external_data=external_data)
loop = asyncio.get_running_loop()
result = await loop.run_in_executor(None, _query)
return result

To actually solve this properly, you need to either:

  • implement a real async client from the ground up
  • or (preferable) support Arrow Flight SQL in ClickHouse, and then rely on a real high quality Arrow Flight SQL client library in Python, there's already a Rust client for Arrow Flight SQL, so building a Python one should be pretty easy

@mshustov
Copy link
Member

mshustov commented Jul 10, 2024

@samuelcolvin, As you can see, the issue is not closed. We will support asyncio natively, however the implementation will take time. AsyncClient wrapper is created to close the gap already today

@genzgd
Copy link
Collaborator

genzgd commented Jul 10, 2024

@samuelcolvin Putting aside the Arrow Flight question for the moment, I'm not sure what problem you are trying to "solve". Queries run using the wrapper yield the asyncio event loop when waiting for the HTTP I/O. So your main thread will spend its time in the CPU bound parts of the query (generally, transforming data). The fact that urllib3 itself doesn't have an async API doesn't change how the asyncio event loop behaves while waiting for network data.

I can assure you that using clickhouse-connect with the async wrapper will outperform any code using a client based on httpx and a text format like JSON or CSV. So if you are getting an async API that yields the main event loop thread while waiting for I/O, and has high performance, what exactly is missing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

10 participants