Skip to content

Calcite Table Functions

Paul Rogers edited this page Nov 17, 2022 · 4 revisions

Planning Flow

Startup

  • MSQSqlModule provides a binding to ExternalOperatorConversion:
    SqlBindings.addOperatorConversion(binder, ExternalOperatorConversion.class);
  • ExternalOperator is created via Guice. It is not registered in Guice, it is just created as needed, which is...
  • ExternalOperatorConversion is created via Guice, passing in the ExternalOperator instance.
  • ExternalOperatorConversion holds an instance of SqlOperator, specifically SqlUserDefinedTableMacro, which is created via the ExternalOperatorConversion (called from Guice).
  • In this case, the Druid-specific class is ExternalOperator, which extends SqlUserDefinedTableMacro.
  • The ExternalOperator constructor causes the parameters to be created so they can be passed to the super constructor.
  • The ExternalOperator macro is given the ExternalTableMacro instance, and calls ExternalTableMacro.getParameters() to get the list of parameters.

Relationships:

CalcitePlannerModule      MSQSqlModule
      |                        |
    Guice                    Guice                                             Guice
      |                        |                                                 |
DruidOperatorTable o-- ExternalOperatorConversion o-- ExternalOperator o-- ExternalTableMacro
                               |                            |                    |
                               v                            v                    v
                        SqlOperatorConversion     SqlUserDefinedTableMacro   TableMacro
                                                            |
                                                            v
                                                       SqlFunction

This means:

  • ExternalOperatorConversion are statically defined, via Guice.
  • Each ExternalOperatorConversion holds onto the Calcite operator, in this case, ExternalOperator extends SqlUserDefinedTableMacro.
  • So, ExternalOperator is also a singleton, created at startup.
  • ExternalOperator is an operator definition, which holds onto a ExternalTableMacro, which is also a definition, in its tableMacro field.
  • The ExternalTableMacro parameters are created once, via the Guice-created instance.

Questions:

  • Why ExternalTableMacro is created via Guice, other than for completeness. It is only ever used by ExternalOperatorConversion and probably could have been created directly within the constructor. The answer is probably the ObjectMapper required by the constructor. Changed to not create through Guice.

AST

Resolution

  • BaseDruidSqlValidator extends SqlValidatorImpl validateNamespace(.) calls
  • ProcedureNamespace.validateImpl(.) which special cases SqlUserDefinedTableMacro
  • The special case calls udf.getTable(.) where udf is the ExternalOperator extends SqlUserDefinedTableMacro instance.
  • getTable(.) retrieves the TableMacro tableMacro instance, in this case, ExternalTableMacro.
  • SqlUserDefinedTableMacro.getTable(.) calls convertArguments(.)
  • convertArguments() calls ExternalTableMacro extends TableMacro getParameters() (which creates another instance of the parameters.)
  • SqlUserDefinedTableMacro.getTable() then calls ExternalOperator extends SqlUserDefinedTableMacro apply(.) to apply the arguments.
  • The arguments are given as a list of Java objects which match up to the parameters by position. The values are coerced to Java types using the TypeFactory associated with the planner.
  • ExternalTableMacro.apply() grabs the three String arguments, converts the value to JSON, and returns an instance of ExternalTable that has an ExternalDataSource that holds the converted arguments.
  • The ExternalTable then becomes the "real" table referenced in the FROM clause.
  • ProcedureNamespace.validateImpl(.) then calls ExternalTable extends TranslatableTable getRowType() to get the row signature.

Basic structure:

Validator
   |
   | (calls)
   |
ProcedureNamespace
   |
   | (is given instance of)
   |
RelDataType o-- SqlUserDefinedTableMacro o-- TableMacro
   |                                             |
   |                                             | (creates)
   |                                             |
ProcedureNamespace                          ExternalTable o-- ExternalDataSource

Notes:

  • It would seem that we can create the ExternalTableMacro parameters once, and reuse them: no need to create them over and over. (Done, use DruidTypeSystem.TYPE_FACTORY as the type factory.)

Authorization

  • SqlResourceCollectorShuttlecalls gets theSqlOperatorfrom theSqlCall` node when walking the tree.
  • The SqlCall.getOperator() method returns the associated operator, here ExternalOperator.
  • After casting to AuthorizableOperator, the shuttle calls ExternalOperator.computeResources(.) to return the resource, which is EXTERNAL_RESOURCE_ACTION.

Conversion

  • SqlToRelConverter.convertCollectionTable(.) call obtains the SqlOperator from `SqlCall.getOperator().
  • The operator here is ExternalOperatorConversion extends SqlOperatorConversion.
  • convertCollectionTable(.) special-cases SqlUserDefinedTableMacro and again calls getTable().
  • getTable() repeats the process above: again creating the parameters and again creating an instance of ExternalTable.
  • convertCollectionTable(.) calls RelOptTableImpl.toRel(.) which calls ExternalTable.toRel(.).
  • ExternalTable.toRel(.) creates an ExternalTableScan instance to represent the scan.
  • ExternalTableScan.deriveRowType() again calls ExternalTable.getRowType() to convert the row type.

Questions:

  • Can the row type be cached in ExternalTable to avoid multiple conversions? (Yes, this works.)
  • Can the ExternalTable be cached to avoid multiple conversions? (Possibly not possible as coded, since the table is created from the table macro, which is a singleton. There is no place to hang a cached instance, that I can easily see.)

Optimization

  • ExternalTableScan calls ExternalTable.getDataSource() multiple times.

Extends Functionality

See UserDefinedTableMacroFunction for details.

  • The parser creates an instance of a SqlCall with ExtendsOperator as the operator.
  • The first argument to the above call is the table function, the second is the schema.
  • ExtendsOperator.rewriteCall(.) gets the first argument, which must be an instance of UserDefinedTableMacroFunction.
  • It then calls UserDefinedTableMacroFunction.rewriteCall(tableFnCall, schema) where tableFnCall is a SqlBasicCall to the UserDefinedTableMacroFunction.
  • UserDefinedTableMacroFunction.rewriteCall(.) passes the schema into an ad-hoc copy of the InputTableMacro, which now holds onto the schema for later use.
  • Calcite obtains the table macro from the macro function, so both are cloned.
  • The above also creates a new call as an instance of ExtendedCall that also holds the schema, primarily for use in unparse(.).
  • From here on, the flow is like that described earlier.

Note that the need to make a copy of the macro provides an opportunity to cache the ExternalTable instance.

Create a New Table Function

To create a external-table like table function:

  • Define a subclass of TableMacro to define function parameters and convert arguments to a TranslatableTable.
  • Define a subclass of SqlUserDefinedTableMacro which defines the above macro.
  • Define a subclass of SqlOperatorConversion to hold the above function definition.
  • Add the above operator conversion to MSQSqlModule.

For a Catalog-Linked Table Function

  • Bridge is ExternalTableSpec: produced by a ExternalTableDefn, used to construct an ExternalDatasource and ExternalTable`.
  • ExternalTableDefn.applyParameters() converts from a resolved table and parameter map to a ExternalTableSpec.

To do:

  • Need a way to do the above without a resolved table.
  • When unparameterized, no merging. Instead, use table properties.
  • Must validate the resulting properties.
  • Easiest to create a ResolvedTable, but with properties from SQL.
  • Need to merge in the extends schema.
  • Filter the list of properties to get the SQL arguments for "raw" function. Probably just a list of names, in preferred order.
  • Custom macros must be defined statically, which means they need access to a table defn statically or via Guice.
  • Code to translate Table Defn properties to parameters. Control ordering so pass-by-position is stable.

So:

  • The TableMacro takes an injected table defn or registry.
  • The TableMacro converts the defn properties to Calcite parameters (once, statically, using a fixed type factory).
  • TableMacro.apply(.) converts positional args to a map (using param definitions)
  • Then uses the TableDefn to create a ExternalTableDefn.
  • From there, create an ExternalTable as in the existing code.

Dynamic Table Functions

A dynamic table function is one that is defined via metadata, rather than via a static definition at compile time. In our case, we want external tables to appear as table function so the user can parameterize them.

Resolution:

  • CalcitePlanner is Druid's version of the Calcite planner, with customizations.
  • CalcitePlanner defines the operator table to be used to resolve functions.
  • CalcitePlanner creates an instance of ChainedSqlOperatorTable to hold both the usual DruidOperatorTable and a representation of dynamic table functions.
  • The chained table also holds an instance of CalciteCatalogReader which can retrieve functions from a Calcite schema.
  • BaseDruidSqlValidator extends SqlValidatorImpl validateNamespace(.) calls
  • SqlValidatorImpl.validate(.) calls validateScopedExpression(.) to resolve our table function.
  • After several more steps, the above calls ChainedSqlOperatorTable.lookupOperatorOverloads(.) to find the function.
  • ChainedSqlOperatorTable first calls DruidOperatorTable.lookupOperatorOverloads(.). Of course, our table-specific function isn't found there.
  • Then, ChainedSqlOperatorTable calls CalciteCatalogReader.lookupOperatorOverloads(.) to resolve. This code checks that the category is FUNCTION and, in particular, a table function. Our reference is a table function.
  • CalciteCatalogReader.getFunctionsFrom() determines the schemas to use to resolve the function: both the current (default) schema, which is druid and the root. Since external tables reside in the ext schema, we must explicitly reference them this way: ext.myTable.
  • getFunctionsFrom(.), once it finds the ext schema, calls getFunctions() on that schema.
  • That schema is represented by a CalciteSchema which maps the Schema class that Druid provides.
  • CalciteSchema.getFunctions() loads "implicit" functions from the schema by calling Schema.getFunctions(String name).
  • Druid schemas derive from the Druid AbstractTableSchema, which returns an empty list of functions by default.
  • ExternalSchema overrides this method to return a ParameterizedTableMacro for the table. This class extends TableMacro which matches the predicate that getFunctionsFrom(.) uses to match functions.
  • Calcite wraps the macro in a SqlUserDefinedTableMacro.
  • Function processing then continues as described earlier.

Notes:

ChainedSqlOperatorTable "implements the SqlOperatorTable interface by chaining together any number of underlying operator table instances."

CalciteCatalogReader is a "SqlOperatorTable based on tables and functions defined schemas."

Issues:

  • Function resolution happens twice. How can we cache the values?
  • The visitor misses the dynamic function, thus omitting security checks for the external table.
Clone this wiki locally