-
Notifications
You must be signed in to change notification settings - Fork 0
Calcite Notes
See Glossary
for concepts, and the set of patterns which Calcite uses.
-
RexNode
: scalar expression -
RelNode
: Relational algebra node, AKA relational operator. - Adapter:
- A custom schema is created with a
SchemaFactory
. - How is the schema factory registered from code?
- The schema returns an instance of
Table
. If updatable, thenModifiableTable
. - Also, to push operations,
FilterableTable
orProjectableFilterableTable
. - Table gives rise to
TableScan
. - Optimization done via planner rules.
- Each engine implements its own variation of the operators.
- Operators have "conventions" which is the calling convention. The built-in convention is the enumerable convention.
See slide 23 of this presentation:
- SQL is parsed to 'SqlNode'. A converter produces
RelNode
s.
- Validation occurs in two (?) stages.
-
performUnconditionalRewrites()
which resolves functions then callsrewriteCall()
to optional rewrite the node to something else. "These rewrites massage the expression tree into a standard form so that the rest of the validation logic can be simpler." - Type inference (overload resolution) which seems to look up functions a second time (in the rewritten form?)
-
- The unconditional rewrite stage resolves functions. At this point, we don't have type information, so we just use the name. if there is only one matching name, we set the operator on the call node to this one and only candidate.
- When resolving the new syntax, we have the following node stack:
EXTEND(TABLE(<table-fn>), <schema>)
. After rewrite, we haveTABLE(<table-fn>)
with the schema stashed in a newExtendedCall
node. In detail:TABLE(ExtendedCall(ShimTableMacroFunction(UserDefinedTableMacroFunction(<table-macro>, <schema>))))
. - The validator then walks the tree again to derive types:
SqlValidator.deriveTypeImpl()
. -
SqlUserDefinedTableMacro
defers toSqlFunction
to re-resolve the functions once types are known. This second resolution uses the operator table, which is wrapped to include the catalog. This second lookup uses argument names and types. - As this is happening, each call to
SqlUserDefinedTableMacro.getTable()
converts arguments and callsTableMacro.apply()
which results in multiple recreations of the underlying table. - The
apply()
method is called again in the SQL-to-Rel converter. Thus, the user-defined table macro remains in the parse tree. - As an aside, there is a
SqlValidatorImpl.validateAccess()
method that is intended to do table validation. Perhaps we can switch to that.
To summarize:
- The unconditional rewrite stage first looks up functions. If there is a single match, that match is rewritten. In or case, EXTEND is rewritten.
- The type derivation stage looks up functions again, then uses the underlying table to provide the row type for the function call.
Type inference happens in SqlValidatorImpl
. The validator walks the tree to, among other things, validate the type of each node. Basics:
- The planner invoked the
CalcitePlanner.validate()
method. - The validator invokes
validateQuery()
to walk theSELECT
nodes. Or, in the more recent version of Druid,validateInsert()
to validate theINSERT
node, which then callsvalidateSelect()
which callsvalidateQuery()
. - This code then validates each namespace using
validateNamespace()
. - If our expression is in the
SELECT
list (the most common case), thenvalidateSelectItem()
does the work. -
deriveType()
computes the type of that one item. - The
DeriveTypeVisitor
drives the process by validating operands. The visitor itself is not recursive: a different instance is used for each expression level. - Most expressions are functions. The
visit(SqlCall)
method of theDeriveTypeVisitor()
does the function-specific validation, starting with the call to the function. -
visit()
obtains the operator from the call. The operator is a subclass ofSqlFunction
. It then callsderiveType()
on the operator. - The
SqlFunction.deriveType()
looks up the actualSqlFunction
. (This is confusing: we're already in aSqlFunction
.) - Each function has a
deriveType()
implementation which invokesvalidateOperands
and theninferReturnType
.
The result is that each function is responsible for its own type inference. Along the way:
- The validator builds up a map from node to type. The recursive process will repeatedly ask the type of a node. The validator uses the cache to avoid running the type inference multiple times. Note that this means that node type is not an attribute of the
SqlNode
, it is kept in a separate work space. -
checkOperandTypes
checks if an operand can be cast from its actual type to the desired argument type. - While one might expect
deriveType()
to build up a list of operand types, it does so indirectly: first putting the types in the cache, then pulling them out again. -
deriveType()
says that "type coercion" is disabled "for builtin operator operands". Since "they are handled by the TypeCoercion specifically". -
deriveType()
sets the operator on the call, then lets the call validate the operands. - An elaborate mechanism exists to infer types, based on a set of rules. For plus, the rule in question is
ReturnTypes.LEAST_RESTRICTIVE
, based on a call toSqlTypeFactory.leastRestrictive()
, which computes the widest of the operand types.
Each SqlCall
provides a type inference algorithm in the form of SqlOperandTypeInference
. InferTypes
provides some standard forms: FIRST_KNOWN
, RETURN_TYPE
, etc. Functions (presumably operators) can provide a custom version. Oddly, the addition operator uses FIRST_KNOWN
. (Question, when is widening done?)
Setup:
- A Calcite
Schema
can hold aFunction
, which may be a compile-time function (AKA "macro".) - A
TableMacro
is a kind ofFunction
that produces aTranslatableTable
. - The parser reduces the table call to a
SqlStdOperatorTable.COLLECTION_TABLE.createCall()
, which isSqlCollectionTableOperator
. - On the other hand,
EXPLICIT_TABLE
"allows an identifier to be explicitly flagged as a table, For example,select * from (TABLE t)
orTABLE t
." (If used, theTABLE
prefix form can't literally be used? Would clash with current usage?) - Thus, the parser invokes
SqlCollectionTableOperator
to create the table function call. At this point, all the parser has is names and operands. -
ExternalOperatorConversion
is the bridge from function name to function definition (e.g.ExternalOperator
.) It is registered viaSqlBindings.addOperatorConversion()
in a Guice module. - The
Set
ofSqlOperatorConversion
items is injected into theDruidOperatorTable
.
In validate()
step:
-
SqlValidatorImpl.performUnconditionalRewrites
looks up the function by name using the operator table, with the result as aSqlBasicCall
. -
DruidOperatorTable.lookupOperatorOverloads
looks up the function (by name). Gets theSqlOperatorConversion
from the definition by callingcalciteOperator()
, which becomes the result above. (There is only one overload in this case.) - Above is repeated a second time during validation.
-
SqlUtil.lookupSubjectRoutesByName
does the second call, and filters by type which, for our use, isSqlSyntax
of eitherFUNCTION
orSPECIAL
. -
ProcedureNamespace.validate()
callsvalidateImpl()
which callsgetTable()
on the UDF (theExternalOperatorConversion
) to get aTranslatableTable
on which it then callsgetRowType()
. Here, it seems each procedure defines a "namespace", of nameCATALOG
. -
ExternalOperatorConversion.getTable()
is in the base class which callsSqlUserDefinedTableMacro.convertArguments()
to get the list of arguments, which are computed from a call toExternalTableMacro.getParameters()
compared against the actual operand list. -
ExternalOperatorConversion.getTable
: “Returns the table in this UDF, or null if there is no table.” The rest is details of how this is done by default. -
convertArguments()
converts arguments to Java format. At this point, there is a 1:1 mapping between actual arguments and parameter definitions. The type of the definition is used to coerce the actual argument to the desired type. The list of parameters is obtained fromExternalTableMacro
. -
ExternalOperatorConversion.getTable()
after converting the arguments, callsExternalTableMacro.apply()
to “apply” the arguments and return aTranslatableTable
table. That is, we go from call + arguments to a table node. In this case, aDruidTable
.
Once we have the Druid table, then the table resolution process kicks in.
ExternalTableMacro
-
ExternalTableScan
- Rel node for the table scan. Holds the Druid table and JSON mapper. -
ExternalTableScanRule
- Converts theExternalTableScan
rel node into aDruidQueryRel
. -
ExternalDataSource
- Attached to aDruidTable
to define the external data source.
From Drill:
-
SqlKind.COLLECTION_TABLE
is said to supportTABLE(udx(x, y, z))
, which is close to what we want. - Specifically,
TABLE(<file name>(<pluginProp> => <value>, ...))
- Drill has a way have "udx" above reference a format plugin.
- Used in
DrillTableInfo
to do the parameter/arg conversion. - Inside, the only argument must be a
SqlUserDefinedTableMacro
, with the first argument as an identifier. - Can also be an
SqlKind.IDENTIFIER
directly. - From plugins, get the list of fields, convert to args, use plugin name (?) to define a table within the proper schema namespace.
- This means that the set of arguments is known ahead of time: by inspecting the format plugin.
Other notes:
-
OperandTypes.VARIADIC
: "Operand type-checking strategy for an operator which takes no operands." -
TranslatableTable
: "Extension to Table that specifies how it is to be translated to a relational expression. It is optional for a Table to implement this interface. If Table does not implement this interface, it will be converted to an EnumerableTableScan. Generally a Table will implement this interface to create a particular subclass of RelNode, and also register rules that act on that particular subclass of RelNode." - The operator's
deriveType()
method is called withSqlValidator
andSqlValidatorScope
, which is used to look up names, such as table names.
Calcite provides a planner, Planner
and PlannerImpl
which appears more of an example. Druid works around the fact that the validator is not visible, but has lots of useful information. Drill, for example, incorporates the functionality .
Calcite provides this table syntax, but the syntax seems incomplete relative to the actual parser.
Calcite provides a way to extend the table ref syntax.
From Calcite docs:
tableReference:
tablePrimary
[ FOR SYSTEM_TIME AS OF expression ]
[ pivot ]
[ unpivot ]
[ matchRecognize ]
[ [ AS ] alias [ '(' columnAlias [, columnAlias ]* ')' ] ]
tablePrimary:
[ [ catalogName . ] schemaName . ] tableName
'(' TABLE [ [ catalogName . ] schemaName . ] tableName ')'
| tablePrimary [ hintComment ] [ EXTEND ] '(' columnDecl [, columnDecl ]* ')'
| [ LATERAL ] '(' query ')'
| UNNEST '(' expression ')' [ WITH ORDINALITY ]
| [ LATERAL ] TABLE '(' [ SPECIFIC ] functionName '(' expression [, expression ]* ')' ')'
Inferred syntax from Parser.jj
:
FromClause: FROM <TableRef> ...
TableRef:
<TableRefDetails>
[ <Pivot> ] [ <Unpivot> ]
[ [ AS ] SimpleIdentifier [ ParenthesizedSimpleIdentifierList ] ]
[ TABLESAMPLE ... ]
TableRefDetails:
<TableRefWithHintsOpt> <TableOverOpt>
| [LATERAL] <ParenthesizedExpression> <TableOverOpt>
| UNNEST <ParenthesizedQueryOrCommaList>
| [LATERAL] TABLE '(' <TableFunctionCall> ')'
| <ExtendedTableRef>
TableRefWithHintsOpt:
<CompoundTableIdentifier> [ <CommaSeparatedSqlHints> ]
CompoundTableIdentifier:
<TableIdentifierSegment> ( . <TableIdentifierSegment> )*
The AST is that produced by the Calcite SQL planner. "Rel" (relational element) nodes are a set of relational operators and related items on the path to a physical query plan. The CalcitePlanner
(Druid's version of PlannerImpl
), rel(.)
method does the deed.
- The
SqlToRelConverter
class does the conversion. - It is parameterized with the validator, the catalog reader, the config and a few other items.
- The
convertQuery()
method takes the validated SQL node and produces the root rel. - A few follow-on operations occur.
In the planner test framework the rel returned from conversion is visible in the plan
section of a test case.