Skip to content

Calcite Notes

Paul Rogers edited this page Aug 9, 2022 · 8 revisions

Introduction

See Glossary for concepts, and the set of patterns which Calcite uses.

Terms

Basics

  • A custom schema is created with a SchemaFactory.
  • How is the schema factory registered from code?
  • The schema returns an instance of Table. If updatable, then ModifiableTable.
  • Also, to push operations, FilterableTable or ProjectableFilterableTable.
  • Table gives rise to TableScan.
  • Optimization done via planner rules.
  • Each engine implements its own variation of the operators.
  • Operators have "conventions" which is the calling convention. The built-in convention is the enumerable convention.

See slide 23 of this presentation:

  • SQL is parsed to 'SqlNode'. A converter produces RelNodes.

Type Inference

Type inference happens in SqlValidatorImpl. The validator walks the tree to, among other things, validate the type of each node. Basics:

  • The planner invoked the CanclitePlanner.validate() method.
  • The validator invokes validateQuery() to walk the SELECT nodes.
  • If our expression is in the SELECT list (the most common case), then validateSelectItem() does the work.
  • deriveType() computes the type of that one item.
  • Most expressions are functions. The accept() method of the function does the function-specific validation.
  • The DeriveTpyeVisitor drives the process by validating operands. The visitor itself is not recursive: a different instance is used for each expression level.
  • Each function has a deriveType implementation which invokes valdateOperands and then inferReturnType.

The result is that each function is responsible for its own type inference. Along the way:

  • The validator builds up a map from node to type. The recursive process will repeatedly ask the type of a node. The validator uses the cache to avoid running the type inference multiple times. Note that this means that node type is not an attribute of the SqlNode, it is kept in a separate work space.
  • checkOperandTypes checks if an operand can be cast from its actual type to the desired argument type.
  • While one might expect deriveType() to build up a list of operand types, it does so indirectly: first putting the types in the cache, then pulling them out again.
  • deriveType() says that "type coercion" is disabled "for builtin operator operands". Since "they are handled by the TypeCoercion specifically".
  • deriveType() sets the operator on the call, then lets the call validate the operands.
  • An elaborate mechanism exists to infer types, based on a set of rules. For plus, the rule in question is ReturnTypes.LEAST_RESTRICTIVE, based on a call to SqlTypeFactory.leastRestrictive(), which computes the widest of the operand types.

Each SqlCall provides a type inference algorithm in the form of SqlOperandTypeInference. InferTypes provides some standard forms: FIRST_KNOWN, RETURN_TYPE, etc. Functions (presumably operators) can provide a custom version. Oddly, the addition operator uses FIRST_KNOWN. (Question, when is widening done?)

Calcite Table Functions

Setup:

  • A Calcite Schema can hold a Function, which may be a compile-time function (AKA "macro".)
  • A TableMacro is a kind of Function that produces a TranslatableTable.
  • The parser reduces the table call to a SqlStdOperatorTable.COLLECTION_TABLE.createCall(), which is SqlCollectionTableOperator.
  • On the other hand, EXPLICIT_TABLE "allows an identifier to be explicitly flagged as a table, For example, select * from (TABLE t) or TABLE t." (If used, the TABLE prefix form can't literally be used? Would clash with current usage?)
  • Thus, the parser invokes SqlCollectionTableOperator to create the table function call. At this point, all the parser has is names and operands.
  • ExternalOperatorConversion is the bridge from function name to function definition (e.g. ExternalOperator.) It is registered via SqlBindings.addOperatorConversion() in a Guice module.
  • The Set of SqlOperatorConversion items is injected into the DruidOperatorTable.

In validate() step:

  • SqlValidatorImpl.performUnconditionalRewrites looks up the function by name using the operator table, with the result as a SqlBasicCall.
  • DruidOperatorTable.lookupOperatorOverloads looks up the function (by name). Gets the SqlOperatorConversion from the definition by calling calciteOperator(), which becomes the result above. (There is only one overload in this case.)
  • Above is repeated a second time during validation.
  • SqlUtil.lookupSubjectRoutesByName does the second call, and filters by type which, for our use, is SqlSyntax of either FUNCTION or SPECIAL.
  • ProcedureNamespace.validate() calls valiateImpl() which calls getTable() on the UDF (the ExternalOperatorConversion) to get a ? on which it then calls getRowType(). Here, it seems each procedure defines a "namespace", of name CATALOG.
  • ExternalOperatorConversion.getTable() is in the base class. It calls SqlUserDefinedTableMacro.convertArguments() to get the list of arguments, which are computed from a call to ExternalTableMacro.getParameters() compared against the actual operand list.
  • ExternalOperatorConversion.getTable: “Returns the table in this UDF, or null if there is no table.” The rest is details of how this is done by default.
  • convertArguments() converts arguments to Java format. At this point, there is a 1:1 mapping between actual arguments and parameter definitions. The type of the definition is used to coerce the actual argument to the desired type. The list of parameters is obtained from ExternalTableMacro.
  • ExternalOperatorConversion.getTable() after converting the arguments, calls ExternalTableMacro.apply() to “apply” the arguments and return a TranslatableTable table. That is, we go from call + arguments to a table node. In this case, a DruidTable.

Once we have the Druid table, then the table resolution process kicks in.

  • ExternalTableMacro
  • ExternalTableScan - Rel node for the table scan. Holds the Druid table and JSON mapper.
  • ExternalTableScanRule - Converts the ExternalTableScan rel node into a DruidQueryRel.
  • ExternalDataSource - Attached to a DruidTable to define the external data source.

From Drill:

  • SqlKind.COLLECTION_TABLE is said to support TABLE(udx(x, y, z)), which is close to what we want.
  • Specifically, TABLE(<file name>(<pluginProp> => <value>, ...))
  • Drill has a way have "udx" above reference a format plugin.
  • Used in DrillTableInfo to do the parameter/arg conversion.
  • Inside, the only argument must be a SqlUserDefinedTableMacro, with the first argument as an identifier.
  • Can also be an SqlKind.IDENTIFIER directly.
  • From plugins, get the list of fields, convert to args, use plugin name (?) to define a table within the proper schema namespace.
  • This means that the set of arguments is known ahead of time: by inspecting the format plugin.

Other notes:

  • OperandTypes.VARIADIC: "Operand type-checking strategy for an operator which takes no operands."
  • TranslatableTable: "Extension to Table that specifies how it is to be translated to a relational expression. It is optional for a Table to implement this interface. If Table does not implement this interface, it will be converted to an EnumerableTableScan. Generally a Table will implement this interface to create a particular subclass of RelNode, and also register rules that act on that particular subclass of RelNode."
  • The operator's deriveType() method is called with SqlValidator and SqlValidatorScope, which is used to look up names, such as table names.

Other Notes

Calcite provides a planner, Planner and PlannerImpl which appears more of an example. Druid works around the fact that the validator is not visible, but has lots of useful information. Drill, for example, incorporates the functionality .

Parser Table Syntax

Calcite provides this table syntax, but the syntax seems incomplete relative to the actual parser.

Calcite provides a way to extend the table ref syntax.

From Calcite docs:

tableReference:
      tablePrimary
      [ FOR SYSTEM_TIME AS OF expression ]
      [ pivot ]
      [ unpivot ]
      [ matchRecognize ]
      [ [ AS ] alias [ '(' columnAlias [, columnAlias ]* ')' ] ]

tablePrimary:
      [ [ catalogName . ] schemaName . ] tableName
      '(' TABLE [ [ catalogName . ] schemaName . ] tableName ')'
  |   tablePrimary [ hintComment ] [ EXTEND ] '(' columnDecl [, columnDecl ]* ')'
  |   [ LATERAL ] '(' query ')'
  |   UNNEST '(' expression ')' [ WITH ORDINALITY ]
  |   [ LATERAL ] TABLE '(' [ SPECIFIC ] functionName '(' expression [, expression ]* ')' ')'

Inferred syntax from Parser.jj:

FromClause: FROM <TableRef> ...

TableRef: 
      <TableRefDetails> 
      [ <Pivot> ] [ <Unpivot> ] 
      [ [ AS ] SimpleIdentifier [ ParenthesizedSimpleIdentifierList ] ] 
      [ TABLESAMPLE ... ]

TableRefDetails:
      <TableRefWithHintsOpt> <TableOverOpt>
  |   [LATERAL] <ParenthesizedExpression> <TableOverOpt>
  |   UNNEST <ParenthesizedQueryOrCommaList>
  |   [LATERAL] TABLE '(' <TableFunctionCall> ')'
  |   <ExtendedTableRef>

TableRefWithHintsOpt:
      <CompoundTableIdentifier> [ <CommaSeparatedSqlHints> ] 

CompoundTableIdentifier: 
      <TableIdentifierSegment> ( . <TableIdentifierSegment> )*

AST-to-Rel Conversion

The AST is that produced by the Calcite SQL planner. "Rel" (relational element) nodes are a set of relational operators and related items on the path to a physical query plan. The CalcitePlanner (Druid's version of PlannerImpl), rel(.) method does the deed.

  • The SqlToRelConverter class does the conversion.
  • It is parameterized with the validator, the catalog reader, the config and a few other items.
  • The convertQuery() method takes the validated SQL node and produces the root rel.
  • A few follow-on operations occur.

In the planner test framework the rel returned from conversion is visible in the plan section of a test case.