-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-431] Support for Pyspark driver #305
Comments
created a pull request #308 |
@cccs-jc Really interesting! Thanks for opening the issue, and the accompanying PR. We just merged a new connection method ( As always, the questions are:
Neat :) We're definitely thinking about what it might look like to implement "dataframe" support in dbt + Spark/Databricks. I don't think a dbt-managed pyspark is session is the way we'd want to go, though. |
@JCZuurmond's implementation is very close to mine. The main differences are that mine has these additional features A hook to create custom spark session
A hook to register a custom view
Being able to create custom python UDF and views is important to us. Curious to know what alternatives you are thinking about? |
@cccs-jc : Great minds think alike! 💯
You solve this by first creating the Spark session with the config and configuration that you prefer. Then,
Why not do this outside of dbt or right before you call dbt programmatically with the Spark session approach? The example you give about the json schema is solved by using dbt external tables. If you have improvements to the current implementation, feel free to add changes to the |
Our initial proof of concept used the approach you mention. That is a modified dbt launch script which creates a spark context and registers views. The in the dbt adapter retrieve the session and execute sql. The issues with this approach are:
I'm aware of Also I'm not sure dbt-external could handle use cases like reading from Azure Kusto. https://docs.microsoft.com/en-us/azure/data-explorer/spark-connector The improvement I would like to contribute is hooks into spark session creation and view creation. |
Hi @cccs-jc , excuse me for the late reply. I want to help you with solving your issue whilst finding a way to not add too specific code the code base. Note that I am not able to approve PRs, that is up to the dbt folks. spark session creation Will that help you? Hooks Will that help you? dbt-external-tables
|
I'm reconsidering the way I've been creating a spark session. One of the issues I have at the moment is that I'm creating a brand new spark session for every dbt run. Which is a bit slow. I've come to realize that I could separate the creation of the spark JVM from the creation of the pyspark client. I can launch the PythonGatewayServer and give it file location to write the listening port and secret of the gateway export _PYSPARK_DRIVER_CONN_INFO_PATH=/tmp/DbtPySparkShell
spark-submit --name DbtPySparkShell pyspark-shell This will start the pyspark-shell JVM and wait for a python client to connect to the gateway. I can then extract the port and secret from this file. Here's an example in python to do that. from pyspark.serializers import read_int, UTF8Deserializer
conn_info_file='/tmp/DbtPySparkShell'
info = open(conn_info_file, "rb")
gateway_port = read_int(info)
gateway_secret = UTF8Deserializer().loads(info) Setting these environment variables will tell the export PYSPARK_GATEWAY_PORT=
export PYSPARK_GATEWAY_SECRET= So the new dbt-spark Note it's important to close the connection. Failing to do so will cause next client to fail to connect. Do you currently close the spark context? This makes it much faster to run dbt commands since the python client only connects an already running spark JVM. I still like the hook we have to create pyspark based views. We might just keep this feature in our version of dbt-spark for now. |
We do not close the Spark session. I think we should add this to the |
@cccs-jc : Do you want to contribute closing the Spark connection? And will that solve your questions? |
I don't have the time to contribute this at the moment. Does it resolve my questions yes and no. I can work around the limitation of the spark context creation but still like to be able to create "views" so hook in the source creation is still of interest. For now I'll keep our fork of dbt-spark which includes these hooks. Maybe these will become available in a more general form in dbt core as a pre-hook or something similar. |
Hi all. Is the spark session still not closed? I could potentially help out with this. @JCZuurmond In the cursor class, is the dataframe that's used and its associated spark session safe to close when the cursor is closed? e.g. when |
Oh nevermind I wasn't aware that #308 would potentially resolve this. Let me know if I'm wrong. Otherwise, the active spark session can be easily pulled off of the dataframe itself (though I don't know if it's necessarily safe to close it right away). |
It really depends if the Spark session associated with the dataframe inside of the cursor is going to be used any more once the cursor is exhausted / closed. If there’s tests I can just open a PR and we’ll see? |
Sounds like a plan! And the data frame is not used outside the class. I think it fits dbt's standard workflow: after a cursor is closed not call any methods on the cursor anymore. |
Conceptually, dbt expects to close connections, and the dbt-spark/dbt/adapters/spark/connections.py Lines 174 to 182 in ca1b5b6
|
Sorry for the late reply on this. Got side tracked for a bit. I will go ahead and open a PR just trying to close the Spark Session when the cursor is closed. Given how shared the Spak Session is (typical applications have one Spark Session throughout their whole lifetime but I can't say for sure yet about dbt-spark), so I'm not sure if this approach is really going to work, but i think it is worth investigating and giving a try. |
If Spark 2.x weren't supported, we could simply use I'm correct in my understanding that Spark 2.4.x is supported still? |
@kbendick We are planning to add functionality in a forthcoming version of At that point, future versions of |
@kbendick : Are you still interested in creating the PR? |
I don't really have time at the moment. I thought the close cursor was added already..? |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days. |
Describe the feature
Add a forth connection option to the dbt-adapter. This forth connection would create a pyspark context and utilize the
spark.sql(<sql>)
function to execute sql statements.Describe alternatives you've considered
The alternative is to run your own thrift server which is difficult to setup.
Who will this benefit?
A pyspark context gives you control over the spark application configuration and amount of resources to use in the spark cluster. It also enables the user to register custom pyspark UDF functions and to create custom views based on pyspark Dataframes.
Are you interested in contributing this feature?
Yes for sure. I have an implementation of this feature which I would like to contribute to this project.
Here's how it works from the user's point of view.
Specify a new connection method
pyspark
profiles.yml
The
python_module
points to a python file found in the PYTHONPATH. In this case spark/spark_context.py. The pyspark adapter will call thecreate_spark_context()
function found in this file. The user can thus create their own spark context (either a local instance or one that leverages a cluster). This hook also lets you create custom pyspark UDF registrations.spark/spark_context.py
A second hook is available in the the sources. This hook lets the user register custom pyspark Dataframe as an sql view
df.createOrReplaceTempView(<view name>)
. The user is free to create any Dataframe logic they need.sources.yml
The
python_module
is loaded the same way except that the hook function has a different signature.def create_dataframe(spark, start_time, end_time)
Using pyspark you can work around the limitations of SparkSQL when reading datafiles. For example using pyspark you can load the schema of the json you are going to read. Thus avoiding sparks schema discovery process and making sure that the data is read in the schema you want.
models/staging/raw_users.py
The implementation of this feature consist of a new
PysparkConnectionWrapper
which executes sql statements via the spark contextAnd a new method on the
SparkRelation
which register pyspark dataframes as sql views.Registration of pyspark views is initiated by a modified source macro
The text was updated successfully, but these errors were encountered: