Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change stance on SQL dialect support #130

Merged
merged 4 commits into from
Jun 9, 2021
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
190 changes: 20 additions & 170 deletions SPEC.md
Original file line number Diff line number Diff line change
Expand Up @@ -436,187 +436,37 @@ The Data Connect service responds with the following table:

## SQL Functions

Data Connect specification is implementation-agnostic and does not prescribe use of a relational database or a particular database technology. However, for convenience, its SQL dialect has been selected for compatibility with current major open source database platforms including Trino, PostgreSQL, MySQL, and well as BigQuery. There are occasional name or signature differences, but a Data Connect API implementation atop any of the major database platforms should be able to pass through queries that use the functions listed below with only minor tweaks.
The Data Connect specification is implementation-agnostic and does not prescribe use of a relational database or a particular database technology. As a result, the exact SQL dialect that is available will vary between implementations of the standard.

With Trino (formerly PrestoSQL) being a popular database choice in Data Connect implementations, we've chosen its grammar as the basis for the grammar supported by Data Connect. Functions below are a subset of those available in Trino 341, and must behave according to the Trino documentation in a conforming Data Connect implementation. To assist with implementations directly on other database platforms, the [Trino Functions Support Matrix](https://docs.google.com/document/d/1y51qNuoe2ELX9kCOyQbFB4jihiKt2N8Qcd6-zzadIvk) captures the differences between platforms in granular detail.
The functions listed below SHOULD be supported by any implementation of Data Connect that supports the `search` endpoint. These functions are supported by major database platforms including Trino, PostgreSQL, MySQL and BigQuery. There are occasional name or signature differences, but a Data Connect API implementation atop any of the major database platforms should be able to pass through queries that use the functions listed below with only minor tweaks.

* ga4gh_type (described above)
* **Logical Operators**
* `AND`, `OR`, `NOT`
* **Comparison Operators**
* `<`, `>`, `<=`, `>=`, `=`, `<>`, `!=`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just noticed < is not escaped correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this just to < since it shouldn't be escaped at all AFAIK.

* `BETWEEN, IS NULL, IS NOT NULL`
* `IS DISTINCT FROM`*
* `IS NOT DISTINCT FROM`*
* `GREATEST`, `LEAST`
* Quantified Comparison Predicates: `ALL`, `ANY` and `SOME`*
* `IS NULL, IS NOT NULL`
* Pattern Comparison: `LIKE`
* **Conditional Expressions**
* `CASE`, `IF`, `COALESCE`, `NULLIF`
* **Conversion Functions**
* `cast(value AS type)` → `type`
* `format(format, args...)` → `varchar`
* **Mathematical Functions**
* Most basic functions are supported across implementations. Notably missing are hyperbolic trig functions, infinity, floating point, and statistical/CDF functions.
* `abs(x)` → [same as input]
* `ceil(x)` → [same as input]
* `ceiling(x)` → [same as input]
* `degrees(x)` → `double`*
* `exp(x)` → `double`
* `floor(x)` → [same as input]
* `ln(x)` → `double`
* `log(b, x)` → `double`
* `log10(x)` → `double`
* `mod(n, m)` → [same as input]
* `pi()` → `double`
* `pow(x, p)` → `double`*
* `power(x, p)` → `double`
* `radians(x)` → `double`*
* `round(x)` → [same as input]
* `round(x, d)` → [same as input]
* `sign(x)` → [same as input]
* `sqrt(x)` → `double`
* `truncate(x)` → `double`*
* Random Functions:
* `rand()` → `double`*
* `random()` → `double`*
* `random(n)` → [same as input]*
* `random(m, n)` → [same as input]*
* Trigonometric Functions:
* `acos(x)` → `double`
* `asin(x)` → `double`
* `atan(x)` → `double`
* `atan2(y, x)` → `double`
* `cos(x)` → `double`
* `sin(x)` → `double`
* `tan(x)` → `double`
* **Bitwise Functions**
* `bitwise_and(x, y)` → `bigint`
* `bitwise_or(x, y)` → `bigint`
* `bitwise_xor(x, y)` → `bigint`
* `bitwise_not(x)` → `bigint`
* `bitwise_left_shift(value, shift)` → [same as value]
* `bitwise_right_shift(value, shift, digits)` → [same as value]
* `bit_count(x, bits)` → `bigint`*
* **Regular Expression Functions**
* `regexp_extract_all(string, pattern)` -> `array(varchar)`*
* `regexp_extract_all(string, pattern, group)` -> `array(varchar)`*
* `regexp_extract(string, pattern)` → `varchar`*
* `regexp_extract(string, pattern, group)` → `varchar`*
* `regexp_like(string, pattern)` → `boolean`*
* `regexp_replace(string, pattern)` → `varchar`*
* `regexp_replace(string, pattern, replacement)` → `varchar`*
* `regexp_replace(string, pattern, function)` → `varchar`*
jfuerth marked this conversation as resolved.
Show resolved Hide resolved
* **UUID Functions**
* `uuid()*`
* **Session Information Functions**
* `current_user`*
* `IF`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I could only have one conditional expression, I'd want CASE. I can't think of a time I've seen IF in a SQL query, but CASE pops up quite a bit. I'm not aware of any incompatibilities between SQL implementations here.

COALESCE isn't 100% necessary, but it's part of ANSI SQL and because of the way SQL handles nulls, it's often important to have.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced IF with CASE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added COALESCE and IF both actually

* **String manipulation**
* **Operators:**
* `Concatenation (||)`*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will need a way to splice strings together

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, was an oversight!

* `LIKE`
* **Functions:**
* `chr(n)` → `varchar`*
* `codepoint(string)` → `integer`*
* `format(format, args...)` → `varchar`
* `length(string)` → `bigint`
* `lower(string)` → `varchar`
* `lpad(string, size, padstring)` → `varchar`
* `ltrim(string)` → `varchar`
* `position(substring IN string)` → `bigint`*
* `replace(string, search, replace)` → `varchar`
* `reverse(string)` → `varchar`
* `rpad(string, size, padstring)` → `varchar`
* `rtrim(string)` → `varchar`
* `split(string, delimiter, limit)` -> `array(varchar)`*
* `starts_with(string, substring)` → `boolean`*
* `strpos(string, substring)` → `bigint`*
* `substr(string, start)` → `varchar`*
* `substring(string, start)` → `varchar`
* `substr(string, start, length)` → `varchar`*
* `substring(string, start, length)` → `varchar`
* `trim(string)` → `varchar`
* `upper(string)` → `varchar`
* `substring(string, start)` → `varchar`
* **Date manipulation**
**Be aware of different quotation (‘) syntax requirements between MySQL and PostgreSQL. BigQuery does not support the +/- operators for dates. Convenience methods could be replaced with EXTRACT().**
* **Operators:**
* `+`, `- *`
* `AT TIME ZONE`*
* **Functions:**
* `current_date`
* `current_time`
* `current_timestamp`
* `current_timestamp(p)`*
Comment on lines -545 to -548
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need these (or now()) along with date arithmetic to convert a date of birth to an age at time of observation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, of course I should have said date of birth and observation date.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

* `date(x)` → `date`*
* `date_trunc(unit, x)` → [same as input]*
* `date_add(unit, value, timestamp)` → [same as input]*
* `date_diff(unit, timestamp1, timestamp2)` → `bigint`*
* `extract(field FROM x)` → `bigint`*
* `from_unixtime(unixtime)` -> `timestamp(3)`*
* `from_unixtime(unixtime, zone)` → `timestamp(3) with time zone`*
* `from_unixtime(unixtime, hours, minutes)` → `timestamp(3) with time zone`*
* `Localtime`*
* `localtimestamp`*
* `localtimestamp(p)`*
* `now()` → `timestamp(3)` with time zone*
* `to_unixtime(timestamp)` → `double`*
* **MySQL-like date functions:**
* `date_format(timestamp, format)` → `varchar`*
* `date_parse(string, format)` → `timestamp(3)`*
* `extract(field FROM date)`
* **Aggregate functions**
**Note that Trino provides a much larger superset of functions. Bitwise, map, and approximate aggregations are mostly absent. Only BigQuery has a few native approximate aggregation functions.
* `array_agg(x)` → `array&lt;`[same as input]>*
* `avg(x)` → `double`
* `bool_and(boolean)` → `boolean`*
* `bool_or(boolean)` → `boolean`*
* `count(*)` → `bigint`*
* `count(x)` → `bigint`
* `count_if(x)` → `bigint`*
* `every(boolean)` → `boolean`*
* `max(x)` → [same as input]
* `max(x, n)` → `array&lt;`[same as x]>*
* `min(x)` → [same as input]
* `min(x, n)` → `array&lt;`[same as x]>*
* `sum(x)` → [same as input]
* **Statistical Aggregate Functions:**
* `corr(y, x)` → `double`*
* `covar_pop(y, x)`→ `double`*
* `covar_samp(y, x)` → `double`*
* `stddev(x)` → `double`
* `stddev_pop(x)` → `double`
* `stddev_samp(x)` → `double`
* `variance(x)` → `double`
* `var_pop(x)` → `double`
* `var_samp(x)` → `double`
* **Window functions**
* **Ranking Functions:**
* `cume_dist()` → `bigint`
* `dense_rank()` → `bigint`
* `ntile(n)` → `bigint`
* `percent_rank()` → `double`
* `rank()` → `bigint`
* `row_number()` → `bigint`
* **Value Functions:**
* `first_value(x)` → [same as input]
* `last_value(x)` → [same as input]
* `nth_value(x, offset)` → [same as input]
* `lead(x[, offset[, default_value]])` → [same as input]
* `lag(x[, offset[, default_value]])` → [same as input]
* **JSON functions
**In general, function signatures and behaviour differs across implementations for many JSON related functions.
* `json_array_length(json)` → bigint*
* `json_extract(json, json_path)` → json*
* `json_extract_scalar(json, json_path)` → varchar*
* `json_format(json)` → `varchar`*
* `json_size(json, json_path)` → `bigint`*
* Functions for working with nested and repeated data (ROW and ARRAY)
See also UNNEST, which is part of the SQL grammar and allows working with nested arrays as if they were rows in a joined table.

Note: Arrays are mostly absent in MySQL
* Array Subscript Operator: []
* Array Concatenation Operator: ||
* `concat(array1, array2, ..., arrayN)` → `array`
* `cardinality(x)` → `bigint`*
* ga4gh_type (described above)
* `count(*)`
* `max(x)`
* `min(x)`
* `sum(x)`
* **Structured data**
* `json_extract(json, json_path)`
* `unnest(array)`

An implementation of Data Connect MAY support any number of additional SQL functions.

Trino (formerly PrestoSQL) has proven to be a popular choice in existing Data Connect implementations, owing to the highly configurable nature of the engine. A simplified version of Trino's SQL grammar is presented in [Appendix A](#appendix-a-sql-grammar).

To assist with implementations directly on other database platforms, the [Trino Functions Support Matrix](https://docs.google.com/document/d/1y51qNuoe2ELX9kCOyQbFB4jihiKt2N8Qcd6-zzadIvk) captures differences in the implementation of common functions in granular detail.

## Pagination and Long Running Queries

Expand Down