-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change stance on SQL dialect support #130
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -436,187 +436,37 @@ The Data Connect service responds with the following table: | |
|
||
## SQL Functions | ||
|
||
Data Connect specification is implementation-agnostic and does not prescribe use of a relational database or a particular database technology. However, for convenience, its SQL dialect has been selected for compatibility with current major open source database platforms including Trino, PostgreSQL, MySQL, and well as BigQuery. There are occasional name or signature differences, but a Data Connect API implementation atop any of the major database platforms should be able to pass through queries that use the functions listed below with only minor tweaks. | ||
The Data Connect specification is implementation-agnostic and does not prescribe use of a relational database or a particular database technology. As a result, the exact SQL dialect that is available will vary between implementations of the standard. | ||
|
||
With Trino (formerly PrestoSQL) being a popular database choice in Data Connect implementations, we've chosen its grammar as the basis for the grammar supported by Data Connect. Functions below are a subset of those available in Trino 341, and must behave according to the Trino documentation in a conforming Data Connect implementation. To assist with implementations directly on other database platforms, the [Trino Functions Support Matrix](https://docs.google.com/document/d/1y51qNuoe2ELX9kCOyQbFB4jihiKt2N8Qcd6-zzadIvk) captures the differences between platforms in granular detail. | ||
The functions listed below SHOULD be supported by any implementation of Data Connect that supports the `search` endpoint. These functions are supported by major database platforms including Trino, PostgreSQL, MySQL and BigQuery. There are occasional name or signature differences, but a Data Connect API implementation atop any of the major database platforms should be able to pass through queries that use the functions listed below with only minor tweaks. | ||
|
||
* ga4gh_type (described above) | ||
* **Logical Operators** | ||
* `AND`, `OR`, `NOT` | ||
* **Comparison Operators** | ||
* `<`, `>`, `<=`, `>=`, `=`, `<>`, `!=` | ||
* `BETWEEN, IS NULL, IS NOT NULL` | ||
* `IS DISTINCT FROM`* | ||
* `IS NOT DISTINCT FROM`* | ||
* `GREATEST`, `LEAST` | ||
* Quantified Comparison Predicates: `ALL`, `ANY` and `SOME`* | ||
* `IS NULL, IS NOT NULL` | ||
* Pattern Comparison: `LIKE` | ||
* **Conditional Expressions** | ||
* `CASE`, `IF`, `COALESCE`, `NULLIF` | ||
* **Conversion Functions** | ||
* `cast(value AS type)` → `type` | ||
* `format(format, args...)` → `varchar` | ||
* **Mathematical Functions** | ||
* Most basic functions are supported across implementations. Notably missing are hyperbolic trig functions, infinity, floating point, and statistical/CDF functions. | ||
* `abs(x)` → [same as input] | ||
* `ceil(x)` → [same as input] | ||
* `ceiling(x)` → [same as input] | ||
* `degrees(x)` → `double`* | ||
* `exp(x)` → `double` | ||
* `floor(x)` → [same as input] | ||
* `ln(x)` → `double` | ||
* `log(b, x)` → `double` | ||
* `log10(x)` → `double` | ||
* `mod(n, m)` → [same as input] | ||
* `pi()` → `double` | ||
* `pow(x, p)` → `double`* | ||
* `power(x, p)` → `double` | ||
* `radians(x)` → `double`* | ||
* `round(x)` → [same as input] | ||
* `round(x, d)` → [same as input] | ||
* `sign(x)` → [same as input] | ||
* `sqrt(x)` → `double` | ||
* `truncate(x)` → `double`* | ||
* Random Functions: | ||
* `rand()` → `double`* | ||
* `random()` → `double`* | ||
* `random(n)` → [same as input]* | ||
* `random(m, n)` → [same as input]* | ||
* Trigonometric Functions: | ||
* `acos(x)` → `double` | ||
* `asin(x)` → `double` | ||
* `atan(x)` → `double` | ||
* `atan2(y, x)` → `double` | ||
* `cos(x)` → `double` | ||
* `sin(x)` → `double` | ||
* `tan(x)` → `double` | ||
* **Bitwise Functions** | ||
* `bitwise_and(x, y)` → `bigint` | ||
* `bitwise_or(x, y)` → `bigint` | ||
* `bitwise_xor(x, y)` → `bigint` | ||
* `bitwise_not(x)` → `bigint` | ||
* `bitwise_left_shift(value, shift)` → [same as value] | ||
* `bitwise_right_shift(value, shift, digits)` → [same as value] | ||
* `bit_count(x, bits)` → `bigint`* | ||
* **Regular Expression Functions** | ||
* `regexp_extract_all(string, pattern)` -> `array(varchar)`* | ||
* `regexp_extract_all(string, pattern, group)` -> `array(varchar)`* | ||
* `regexp_extract(string, pattern)` → `varchar`* | ||
* `regexp_extract(string, pattern, group)` → `varchar`* | ||
* `regexp_like(string, pattern)` → `boolean`* | ||
* `regexp_replace(string, pattern)` → `varchar`* | ||
* `regexp_replace(string, pattern, replacement)` → `varchar`* | ||
* `regexp_replace(string, pattern, function)` → `varchar`* | ||
jfuerth marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* **UUID Functions** | ||
* `uuid()*` | ||
* **Session Information Functions** | ||
* `current_user`* | ||
* `IF` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If I could only have one conditional expression, I'd want CASE. I can't think of a time I've seen IF in a SQL query, but CASE pops up quite a bit. I'm not aware of any incompatibilities between SQL implementations here. COALESCE isn't 100% necessary, but it's part of ANSI SQL and because of the way SQL handles nulls, it's often important to have. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Replaced IF with CASE. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added COALESCE and IF both actually |
||
* **String manipulation** | ||
* **Operators:** | ||
* `Concatenation (||)`* | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we will need a way to splice strings together There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added, was an oversight! |
||
* `LIKE` | ||
* **Functions:** | ||
* `chr(n)` → `varchar`* | ||
* `codepoint(string)` → `integer`* | ||
* `format(format, args...)` → `varchar` | ||
* `length(string)` → `bigint` | ||
* `lower(string)` → `varchar` | ||
* `lpad(string, size, padstring)` → `varchar` | ||
* `ltrim(string)` → `varchar` | ||
* `position(substring IN string)` → `bigint`* | ||
* `replace(string, search, replace)` → `varchar` | ||
* `reverse(string)` → `varchar` | ||
* `rpad(string, size, padstring)` → `varchar` | ||
* `rtrim(string)` → `varchar` | ||
* `split(string, delimiter, limit)` -> `array(varchar)`* | ||
* `starts_with(string, substring)` → `boolean`* | ||
* `strpos(string, substring)` → `bigint`* | ||
* `substr(string, start)` → `varchar`* | ||
* `substring(string, start)` → `varchar` | ||
* `substr(string, start, length)` → `varchar`* | ||
* `substring(string, start, length)` → `varchar` | ||
* `trim(string)` → `varchar` | ||
* `upper(string)` → `varchar` | ||
* `substring(string, start)` → `varchar` | ||
* **Date manipulation** | ||
**Be aware of different quotation (‘) syntax requirements between MySQL and PostgreSQL. BigQuery does not support the +/- operators for dates. Convenience methods could be replaced with EXTRACT().** | ||
* **Operators:** | ||
* `+`, `- *` | ||
* `AT TIME ZONE`* | ||
* **Functions:** | ||
* `current_date` | ||
* `current_time` | ||
* `current_timestamp` | ||
* `current_timestamp(p)`* | ||
Comment on lines
-545
to
-548
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We'll need these (or now()) along with date arithmetic to convert a date of birth to an age at time of observation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, of course I should have said date of birth and observation date. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added |
||
* `date(x)` → `date`* | ||
* `date_trunc(unit, x)` → [same as input]* | ||
* `date_add(unit, value, timestamp)` → [same as input]* | ||
* `date_diff(unit, timestamp1, timestamp2)` → `bigint`* | ||
* `extract(field FROM x)` → `bigint`* | ||
* `from_unixtime(unixtime)` -> `timestamp(3)`* | ||
* `from_unixtime(unixtime, zone)` → `timestamp(3) with time zone`* | ||
* `from_unixtime(unixtime, hours, minutes)` → `timestamp(3) with time zone`* | ||
* `Localtime`* | ||
* `localtimestamp`* | ||
* `localtimestamp(p)`* | ||
* `now()` → `timestamp(3)` with time zone* | ||
* `to_unixtime(timestamp)` → `double`* | ||
* **MySQL-like date functions:** | ||
* `date_format(timestamp, format)` → `varchar`* | ||
* `date_parse(string, format)` → `timestamp(3)`* | ||
* `extract(field FROM date)` | ||
* **Aggregate functions** | ||
**Note that Trino provides a much larger superset of functions. Bitwise, map, and approximate aggregations are mostly absent. Only BigQuery has a few native approximate aggregation functions. | ||
* `array_agg(x)` → `array<`[same as input]>* | ||
* `avg(x)` → `double` | ||
* `bool_and(boolean)` → `boolean`* | ||
* `bool_or(boolean)` → `boolean`* | ||
* `count(*)` → `bigint`* | ||
* `count(x)` → `bigint` | ||
* `count_if(x)` → `bigint`* | ||
* `every(boolean)` → `boolean`* | ||
* `max(x)` → [same as input] | ||
* `max(x, n)` → `array<`[same as x]>* | ||
* `min(x)` → [same as input] | ||
* `min(x, n)` → `array<`[same as x]>* | ||
* `sum(x)` → [same as input] | ||
* **Statistical Aggregate Functions:** | ||
* `corr(y, x)` → `double`* | ||
* `covar_pop(y, x)`→ `double`* | ||
* `covar_samp(y, x)` → `double`* | ||
* `stddev(x)` → `double` | ||
* `stddev_pop(x)` → `double` | ||
* `stddev_samp(x)` → `double` | ||
* `variance(x)` → `double` | ||
* `var_pop(x)` → `double` | ||
* `var_samp(x)` → `double` | ||
* **Window functions** | ||
* **Ranking Functions:** | ||
* `cume_dist()` → `bigint` | ||
* `dense_rank()` → `bigint` | ||
* `ntile(n)` → `bigint` | ||
* `percent_rank()` → `double` | ||
* `rank()` → `bigint` | ||
* `row_number()` → `bigint` | ||
* **Value Functions:** | ||
* `first_value(x)` → [same as input] | ||
* `last_value(x)` → [same as input] | ||
* `nth_value(x, offset)` → [same as input] | ||
* `lead(x[, offset[, default_value]])` → [same as input] | ||
* `lag(x[, offset[, default_value]])` → [same as input] | ||
* **JSON functions | ||
**In general, function signatures and behaviour differs across implementations for many JSON related functions. | ||
* `json_array_length(json)` → bigint* | ||
* `json_extract(json, json_path)` → json* | ||
* `json_extract_scalar(json, json_path)` → varchar* | ||
* `json_format(json)` → `varchar`* | ||
* `json_size(json, json_path)` → `bigint`* | ||
* Functions for working with nested and repeated data (ROW and ARRAY) | ||
See also UNNEST, which is part of the SQL grammar and allows working with nested arrays as if they were rows in a joined table. | ||
|
||
Note: Arrays are mostly absent in MySQL | ||
* Array Subscript Operator: [] | ||
* Array Concatenation Operator: || | ||
* `concat(array1, array2, ..., arrayN)` → `array` | ||
* `cardinality(x)` → `bigint`* | ||
* ga4gh_type (described above) | ||
* `count(*)` | ||
* `max(x)` | ||
* `min(x)` | ||
* `sum(x)` | ||
* **Structured data** | ||
* `json_extract(json, json_path)` | ||
* `unnest(array)` | ||
|
||
An implementation of Data Connect MAY support any number of additional SQL functions. | ||
|
||
Trino (formerly PrestoSQL) has proven to be a popular choice in existing Data Connect implementations, owing to the highly configurable nature of the engine. A simplified version of Trino's SQL grammar is presented in [Appendix A](#appendix-a-sql-grammar). | ||
|
||
To assist with implementations directly on other database platforms, the [Trino Functions Support Matrix](https://docs.google.com/document/d/1y51qNuoe2ELX9kCOyQbFB4jihiKt2N8Qcd6-zzadIvk) captures differences in the implementation of common functions in granular detail. | ||
|
||
## Pagination and Long Running Queries | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just noticed
<
is not escaped correctly.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this just to
<
since it shouldn't be escaped at all AFAIK.