Skip to content

Calcite Notes

Paul Rogers edited this page Feb 6, 2023 · 8 revisions

Introduction

See Glossary for concepts, and the set of patterns which Calcite uses.

Terms

Basics

  • A custom schema is created with a SchemaFactory.
  • How is the schema factory registered from code?
  • The schema returns an instance of Table. If updatable, then ModifiableTable.
  • Also, to push operations, FilterableTable or ProjectableFilterableTable.
  • Table gives rise to TableScan.
  • Optimization done via planner rules.
  • Each engine implements its own variation of the operators.
  • Operators have "conventions" which is the calling convention. The built-in convention is the enumerable convention.

See slide 23 of this presentation:

  • SQL is parsed to 'SqlNode'. A converter produces RelNodes.

Function Resolution

  • Validation occurs in two (?) stages.
    • performUnconditionalRewrites() which resolves functions then calls rewriteCall() to optional rewrite the node to something else. "These rewrites massage the expression tree into a standard form so that the rest of the validation logic can be simpler."
    • Type inference (overload resolution) which seems to look up functions a second time (in the rewritten form?)
  • The unconditional rewrite stage resolves functions. At this point, we don't have type information, so we just use the name. if there is only one matching name, we set the operator on the call node to this one and only candidate.
  • When resolving the new syntax, we have the following node stack: EXTEND(TABLE(<table-fn>), <schema>). After rewrite, we have TABLE(<table-fn>) with the schema stashed in a new ExtendedCall node. In detail: TABLE(ExtendedCall(ShimTableMacroFunction(UserDefinedTableMacroFunction(<table-macro>, <schema>)))).
  • The validator then walks the tree again to derive types: SqlValidator.deriveTypeImpl().
  • SqlUserDefinedTableMacro defers to SqlFunction to re-resolve the functions once types are known. This second resolution uses the operator table, which is wrapped to include the catalog. This second lookup uses argument names and types.
  • As this is happening, each call to SqlUserDefinedTableMacro.getTable() converts arguments and calls TableMacro.apply() which results in multiple recreations of the underlying table.
  • The apply() method is called again in the SQL-to-Rel converter. Thus, the user-defined table macro remains in the parse tree.
  • As an aside, there is a SqlValidatorImpl.validateAccess() method that is intended to do table validation. Perhaps we can switch to that.

To summarize:

  • The unconditional rewrite stage first looks up functions. If there is a single match, that match is rewritten. In or case, EXTEND is rewritten.
  • The type derivation stage looks up functions again, then uses the underlying table to provide the row type for the function call.

Type Inference

Type inference happens in SqlValidatorImpl. The validator walks the tree to, among other things, validate the type of each node. Basics:

  • The planner invoked the CalcitePlanner.validate() method.
  • The validator invokes validateQuery() to walk the SELECT nodes. Or, in the more recent version of Druid, validateInsert() to validate the INSERT node, which then calls validateSelect() which calls validateQuery().
  • This code then validates each namespace using validateNamespace().
  • If our expression is in the SELECT list (the most common case), then validateSelectItem() does the work.
  • deriveType() computes the type of that one item.
  • The DeriveTypeVisitor drives the process by validating operands. The visitor itself is not recursive: a different instance is used for each expression level.
  • Most expressions are functions. The visit(SqlCall) method of the DeriveTypeVisitor() does the function-specific validation, starting with the call to the function.
  • visit() obtains the operator from the call. The operator is a subclass of SqlFunction. It then calls deriveType() on the operator.
  • The SqlFunction.deriveType() looks up the actual SqlFunction. (This is confusing: we're already in a SqlFunction.)
  • Each function has a deriveType() implementation which invokes validateOperands and then inferReturnType.

The result is that each function is responsible for its own type inference. Along the way:

  • The validator builds up a map from node to type. The recursive process will repeatedly ask the type of a node. The validator uses the cache to avoid running the type inference multiple times. Note that this means that node type is not an attribute of the SqlNode, it is kept in a separate work space.
  • checkOperandTypes checks if an operand can be cast from its actual type to the desired argument type.
  • While one might expect deriveType() to build up a list of operand types, it does so indirectly: first putting the types in the cache, then pulling them out again.
  • deriveType() says that "type coercion" is disabled "for builtin operator operands". Since "they are handled by the TypeCoercion specifically".
  • deriveType() sets the operator on the call, then lets the call validate the operands.
  • An elaborate mechanism exists to infer types, based on a set of rules. For plus, the rule in question is ReturnTypes.LEAST_RESTRICTIVE, based on a call to SqlTypeFactory.leastRestrictive(), which computes the widest of the operand types.

Each SqlCall provides a type inference algorithm in the form of SqlOperandTypeInference. InferTypes provides some standard forms: FIRST_KNOWN, RETURN_TYPE, etc. Functions (presumably operators) can provide a custom version. Oddly, the addition operator uses FIRST_KNOWN. (Question, when is widening done?)

Calcite Table Functions

Setup:

  • A Calcite Schema can hold a Function, which may be a compile-time function (AKA "macro".)
  • A TableMacro is a kind of Function that produces a TranslatableTable.
  • The parser reduces the table call to a SqlStdOperatorTable.COLLECTION_TABLE.createCall(), which is SqlCollectionTableOperator.
  • On the other hand, EXPLICIT_TABLE "allows an identifier to be explicitly flagged as a table, For example, select * from (TABLE t) or TABLE t." (If used, the TABLE prefix form can't literally be used? Would clash with current usage?)
  • Thus, the parser invokes SqlCollectionTableOperator to create the table function call. At this point, all the parser has is names and operands.
  • ExternalOperatorConversion is the bridge from function name to function definition (e.g. ExternalOperator.) It is registered via SqlBindings.addOperatorConversion() in a Guice module.
  • The Set of SqlOperatorConversion items is injected into the DruidOperatorTable.

In validate() step:

  • SqlValidatorImpl.performUnconditionalRewrites looks up the function by name using the operator table, with the result as a SqlBasicCall.
  • DruidOperatorTable.lookupOperatorOverloads looks up the function (by name). Gets the SqlOperatorConversion from the definition by calling calciteOperator(), which becomes the result above. (There is only one overload in this case.)
  • Above is repeated a second time during validation.
  • SqlUtil.lookupSubjectRoutesByName does the second call, and filters by type which, for our use, is SqlSyntax of either FUNCTION or SPECIAL.
  • ProcedureNamespace.validate() calls validateImpl() which calls getTable() on the UDF (the ExternalOperatorConversion) to get a TranslatableTable on which it then calls getRowType(). Here, it seems each procedure defines a "namespace", of name CATALOG.
  • ExternalOperatorConversion.getTable() is in the base class which calls SqlUserDefinedTableMacro.convertArguments() to get the list of arguments, which are computed from a call to ExternalTableMacro.getParameters() compared against the actual operand list.
  • ExternalOperatorConversion.getTable: “Returns the table in this UDF, or null if there is no table.” The rest is details of how this is done by default.
  • convertArguments() converts arguments to Java format. At this point, there is a 1:1 mapping between actual arguments and parameter definitions. The type of the definition is used to coerce the actual argument to the desired type. The list of parameters is obtained from ExternalTableMacro.
  • ExternalOperatorConversion.getTable() after converting the arguments, calls ExternalTableMacro.apply() to “apply” the arguments and return a TranslatableTable table. That is, we go from call + arguments to a table node. In this case, a DruidTable.

Once we have the Druid table, then the table resolution process kicks in.

  • ExternalTableMacro
  • ExternalTableScan - Rel node for the table scan. Holds the Druid table and JSON mapper.
  • ExternalTableScanRule - Converts the ExternalTableScan rel node into a DruidQueryRel.
  • ExternalDataSource - Attached to a DruidTable to define the external data source.

From Drill:

  • SqlKind.COLLECTION_TABLE is said to support TABLE(udx(x, y, z)), which is close to what we want.
  • Specifically, TABLE(<file name>(<pluginProp> => <value>, ...))
  • Drill has a way have "udx" above reference a format plugin.
  • Used in DrillTableInfo to do the parameter/arg conversion.
  • Inside, the only argument must be a SqlUserDefinedTableMacro, with the first argument as an identifier.
  • Can also be an SqlKind.IDENTIFIER directly.
  • From plugins, get the list of fields, convert to args, use plugin name (?) to define a table within the proper schema namespace.
  • This means that the set of arguments is known ahead of time: by inspecting the format plugin.

Other notes:

  • OperandTypes.VARIADIC: "Operand type-checking strategy for an operator which takes no operands."
  • TranslatableTable: "Extension to Table that specifies how it is to be translated to a relational expression. It is optional for a Table to implement this interface. If Table does not implement this interface, it will be converted to an EnumerableTableScan. Generally a Table will implement this interface to create a particular subclass of RelNode, and also register rules that act on that particular subclass of RelNode."
  • The operator's deriveType() method is called with SqlValidator and SqlValidatorScope, which is used to look up names, such as table names.

Other Notes

Calcite provides a planner, Planner and PlannerImpl which appears more of an example. Druid works around the fact that the validator is not visible, but has lots of useful information. Drill, for example, incorporates the functionality .

Parser Table Syntax

Calcite provides this table syntax, but the syntax seems incomplete relative to the actual parser.

Calcite provides a way to extend the table ref syntax.

From Calcite docs:

tableReference:
      tablePrimary
      [ FOR SYSTEM_TIME AS OF expression ]
      [ pivot ]
      [ unpivot ]
      [ matchRecognize ]
      [ [ AS ] alias [ '(' columnAlias [, columnAlias ]* ')' ] ]

tablePrimary:
      [ [ catalogName . ] schemaName . ] tableName
      '(' TABLE [ [ catalogName . ] schemaName . ] tableName ')'
  |   tablePrimary [ hintComment ] [ EXTEND ] '(' columnDecl [, columnDecl ]* ')'
  |   [ LATERAL ] '(' query ')'
  |   UNNEST '(' expression ')' [ WITH ORDINALITY ]
  |   [ LATERAL ] TABLE '(' [ SPECIFIC ] functionName '(' expression [, expression ]* ')' ')'

Inferred syntax from Parser.jj:

FromClause: FROM <TableRef> ...

TableRef: 
      <TableRefDetails> 
      [ <Pivot> ] [ <Unpivot> ] 
      [ [ AS ] SimpleIdentifier [ ParenthesizedSimpleIdentifierList ] ] 
      [ TABLESAMPLE ... ]

TableRefDetails:
      <TableRefWithHintsOpt> <TableOverOpt>
  |   [LATERAL] <ParenthesizedExpression> <TableOverOpt>
  |   UNNEST <ParenthesizedQueryOrCommaList>
  |   [LATERAL] TABLE '(' <TableFunctionCall> ')'
  |   <ExtendedTableRef>

TableRefWithHintsOpt:
      <CompoundTableIdentifier> [ <CommaSeparatedSqlHints> ] 

CompoundTableIdentifier: 
      <TableIdentifierSegment> ( . <TableIdentifierSegment> )*

AST-to-Rel Conversion

The AST is that produced by the Calcite SQL planner. "Rel" (relational element) nodes are a set of relational operators and related items on the path to a physical query plan. The CalcitePlanner (Druid's version of PlannerImpl), rel(.) method does the deed.

  • The SqlToRelConverter class does the conversion.
  • It is parameterized with the validator, the catalog reader, the config and a few other items.
  • The convertQuery() method takes the validated SQL node and produces the root rel.
  • A few follow-on operations occur.

In the planner test framework the rel returned from conversion is visible in the plan section of a test case.

Clone this wiki locally