Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extensions API for third-party table/catalog providers #43

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ccciudatu
Copy link

@ccciudatu ccciudatu commented Nov 12, 2024

Add support for third party table/catalog providers, to allow integrations with databases, table formats etc. to be enabled with dedicated features.
This patch adds an extensions API and a first extension for Flight SQL tables.
As a side note, the Delta Lake extension can be added once the datafusion dependency in the official project is upgraded to 42+.


/// A composite extension registry for enabled extensions.
#[derive(Debug)]
pub(crate) struct Extensions(Box<[Box<dyn Extension>]>);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not Vec<Box<dyn Extension>>? What is the advantage of Box<[T]>?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit lighter (hardly relevant in this scenario) and it's immutable by design (arguably also irrelevant, though I personally prefer it when I can choose). I think I got this habit after watching this video. :)
But I don't mind switching to Vec if you think that would improve readability.

@ccciudatu ccciudatu marked this pull request as ready for review November 25, 2024 12:16
@ccciudatu
Copy link
Author

Rebased on top of the main branch and extracted the ShuffleCodec changes as a separate PR marked as a dependency for the one here.

Comment on lines +51 to +72
if let Ok(py_plan) = py_plan.to_object(py).downcast_bound::<PyExecutionPlan>(py) {
// For session contexts created with datafusion_ray.extended_session_context(), the inner
// execution plan can be used as such (and the enabled extensions are all available).
Ok(py_plan.borrow().plan.clone())
} else {
// The session context originates from outside our library, so we'll grab the protobuf plan
// by calling the python method with no extension codecs.
let py_proto = py_plan.call_method0("to_proto")?;
let plan_bytes: &[u8] = py_proto.extract()?;
let plan_node = protobuf::PhysicalPlanNode::try_decode(plan_bytes).map_err(|e| {
PyRuntimeError::new_err(format!(
"Unable to decode physical plan protobuf message: {}",
e
))
})?;

let runtime = RuntimeEnv::default();
let registry = SessionContext::new();
plan_node
.try_into_physical_plan(&registry, &runtime, Extensions::codec())
.map_err(|e| e.into())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of my goals was to remove the datafusion-python dependency from datafusion-ray so that you wouldn't have hard requirements about having the exact same version between the two. It can be worse in that you also have to have the same compiler for both. Now for datafusion-ray we may be able to get away with it for official releases since we control the build pipeline for both. This does place a restriction on end users in that they have to make sure they keep these versions synced on their machine. In my opinion it would be better to lean on things like the FFI interface that is coming in datafusion-python 43.0.0. I know that right now doesn't solve the problem of having all extensions, though.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @timsaucer !
Makes sense, and I'm entirely onboard with this goal. Which is why I tried to preserve this guarantee by keeping the existing code within the else branch here, where no assumption is made about the datafusion-python version/compiler.

The "embedded" datafusion-python dependency is only supposed to be an opt-in alternative for users who decide to call the new function for creating an ABI compatible context preconfigured with the enabled extensions (that's the only way the above downcast can succeed, if I'm not mistaken).
So any existing or future code that doesn't switch to using the new extended_session_context() function will continue to work without any compatibility restrictions.
However, if I somehow failed to preserve this guarantee or if I missed something that introduces any potential risks, please let me know so I'll revisit the approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to hold up progress, but this feels like a step in the wrong direction. But I also don't have enough time right now to give a better solution. I think I'd want to do something like use the FFI_ExecutionPlan in df43 to share these across packages.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I'll look into that. Thanks!

@ccciudatu ccciudatu marked this pull request as draft November 26, 2024 09:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants