You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using Presto Cluster for processing large amount of data.
To visualize the data I use the connector provided and suggested by the official Superset documentation, which is PyHive from the SQLAlchemy library and I'm using the default settings for the connection.
When using the provided pyhive presto connector and executing a very simple query - "SELECT * FROM test_table", the returned number of rows by the resultset is incorrect compared with the same query executed in the presto-cli app, the official connector provided by the Presto documentation.
I created two simple python scripts to test Presto connection using PyHive and the official jdbc.jar driver.
The PyHive connector returned wrong number of rows in the resultset about 817000 rows, exactly the same number of rows that was returned by the Superset chart. The connector with the official jdbc driver returned the correct amount of data - 875000 rows.
It looks like the issue is caused by the PyHive connector. Is it possible to change the connection method from PyHive to the official JDBC driver?
I'm attaching the two python scripts that I used to reproduce the issue.
#This Python script is using PyHive
from pyhive import presto
def execute_presto_query(host, port, user, catalog, schema, table, max_rows):
connection = presto.connect(host=host, port=port, username=user, catalog=catalog, schema=schema, session_props={'query_max_output_size': '1TB'})
try:
cursor = connection.cursor()
query = f"""SELECT * FROM test_table"""
cursor.execute(query)
total_rows = 0
while True:
rows = cursor.fetchmany(max_rows)
if not rows:
break
for row in rows:
total_rows += 1
print(row)
except Exception as e:
print("Error executing the query:", e)
finally:
print(total_rows)
cursor.close()
connection.close()
if __name__ == "__main__":
host = "localhost"
port = 30000
user = "testUser"
catalog = "pinot"
schema = "default"
table = "test_table"
max_rows = 1000000
execute_presto_query(host, port, user, catalog, schema, table, max_rows)
#This Python script is using the official JDBC driver
import jaydebeapi
import jpype
def execute_presto_query(host, port, user, catalog, schema, table, max_rows):
jar_file = '/home/admin1/Downloads/presto-jdbc-0.282.jar'
jpype.startJVM(jpype.getDefaultJVMPath(), "-Djava.class.path=" + jar_file)
connection_url = f'jdbc:presto://{host}:{port}/{catalog}/{schema}'
conn = jaydebeapi.connect(
'com.facebook.presto.jdbc.PrestoDriver',
connection_url,
{'user': user},
jar_file
)
try:
cursor = conn.cursor()
query = f"SELECT * FROM test_table"
cursor.execute(query)
rows = cursor.fetchall()
for row in rows:
print(row)
print(f"Total rows returned: {len(rows)}")
except Exception as e:
print("Error executing the query:", e)
finally:
cursor.close()
conn.close()
jpype.shutdownJVM()
if __name__ == "__main__":
host = "localhost"
port = 30000
user = "testUsername"
catalog = "pinot"
schema = "default"
table = "test_table"
max_rows = 1000000
execute_presto_query(host, port, user, catalog, schema, table, max_rows)
The text was updated successfully, but these errors were encountered:
Sorry for my late reply, I was on a vacation.
The version of PyHive is 0.6.2 and the version of SQLAlchemy is 2.0.19.
The issue is reproducible with result sets larger than 150000 records.
I'm using Presto Cluster for processing large amount of data.
To visualize the data I use the connector provided and suggested by the official Superset documentation, which is PyHive from the SQLAlchemy library and I'm using the default settings for the connection.
When using the provided pyhive presto connector and executing a very simple query - "SELECT * FROM test_table", the returned number of rows by the resultset is incorrect compared with the same query executed in the presto-cli app, the official connector provided by the Presto documentation.
I created two simple python scripts to test Presto connection using PyHive and the official jdbc.jar driver.
The PyHive connector returned wrong number of rows in the resultset about 817000 rows, exactly the same number of rows that was returned by the Superset chart. The connector with the official jdbc driver returned the correct amount of data - 875000 rows.
It looks like the issue is caused by the PyHive connector. Is it possible to change the connection method from PyHive to the official JDBC driver?
I'm attaching the two python scripts that I used to reproduce the issue.
The text was updated successfully, but these errors were encountered: