-
Notifications
You must be signed in to change notification settings - Fork 357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update timestamp.md #4458
Update timestamp.md #4458
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the docs need updating (and the blog post now has some errors in it) so thanks for the wakeup call, but the PR seems like an attempt to help people use naïve timestamps instead of avoid them, (which would make their lives better.) I'll take a stab at it myself.
> Since there is not currently a `TIMESTAMP_NS WITH TIME ZONE` data type, external columns with nano-second precision and "instant semantics", e.g., [parquet timestamp columns with `isAdjustedToUTC=true`](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#instant-semantics-timestamps-normalized-to-utc), lose precision when read using DuckDB. | ||
Timestamps can be created using the `TIMESTAMP` keyword (or its variants), where the data must be formatted according to the ISO 8601 format (`YYYY-MM-DD hh:mm:ss[.zzzzzz][+-TT[:tt]]` (three extra decimal places supported by `TIMESTAMP_NS`). Decimal places beyond the targeted sub-second precision are ignored. | ||
|
||
The `WITH TIME ZONE` data types exhibit *instant* semantics, which means that they represent points in absolute time, called *instants*, and are *displayed* in the system or a configured time zone. They require the [ICU extension]({% link docs/extensions/icu.md %}) to be installed. The `WITHOUT TIME ZONE` data types exhibit *local* semantics, which means they represent a local value of time for an unspecified observer. As such, a `WITHOUT TIME ZONE` data type together *with* a time zone defines an *instant* that can be stored in a `WITH TIME ZONE` data type: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All timestamp types use instant semantics. Instants are just a count from an epoch.
--- | ||
|
||
Timestamps represent points in absolute time, usually called *instants*. | ||
DuckDB represents instants as the number of microseconds (µs) (or nanoseconds, for `TIMESTAMP_NS`) since `1970-01-01 00:00:00+00`. | ||
Timestamps represent points in time and are internally stored as the `INT64` number of seconds / milliseconds / microseconds / nanoseconds since `1970-01-01 00:00:00+00`, depending on the chosen variant. Informally speaking, they contain both [`DATE`]({% link docs/sql/data_types/date.md %}) (year, month, day) and [`TIME`]({% link docs/sql/data_types/time.md %}) (hour, minute, second, microsecond or nanosecond) information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restore "absolute".
| `TIMESTAMP_MS` | | timestamp with millisecond precision (ignores time zone) | | ||
| `TIMESTAMP_S` | | timestamp with second precision (ignores time zone) | | ||
| `TIMESTAMPTZ` | `TIMESTAMP WITH TIME ZONE` | timestamp with microsecond precision (uses time zone) | | ||
| `TIMESTAMP_NS` | | timestamp with nanosecond precision (local semantics) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"ignores time zone" was correct! I don't know what "local semantics" means, but if you want to be precise, UTC would be more accurate. To be local, you would need a time zone.
|
||
A timestamp specifies a combination of [`DATE`]({% link docs/sql/data_types/date.md %}) (year, month, day) and a [`TIME`]({% link docs/sql/data_types/time.md %}) (hour, minute, second, microsecond or nanosecond). Timestamps can be created using the `TIMESTAMP` keyword, where the data must be formatted according to the ISO 8601 format (`YYYY-MM-DD hh:mm:ss[.zzzzzz][+-TT[:tt]]` (three extra decimal places supported by `TIMESTAMP_NS`). Decimal places beyond the targeted sub-second precision are ignored. | ||
> Warning Since there is not currently a `TIMESTAMP_NS WITH TIME ZONE` data type, external columns with nano-second precision and instant semantics, e.g., [parquet timestamp columns with `isAdjustedToUTC=true`](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#instant-semantics-timestamps-normalized-to-utc), are converted to `TIMESTAMP WITH TIME ZONE` and thus lose precision when read using DuckDB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe move warnings down to a bottom section? Or move this to the parquet documentation?
|
||
> Warning It is possible to convert between `WITH TIME ZONE` and `WITHOUT TIME ZONE` types using regular explicit and even implicit casts, which perform the same computation as the `timezone` function above, using the system or configured time zone. Similarly, there are regular explicit casts from `WITH TIME ZONE` data types to `DATE` and `TIME`, which also use the system of configured time zone. | ||
|
||
> Bestpractice To convert between `WITH TIME ZONE` and `WITHOUT TIME ZONE` data, use the `timezone` function with explicitly specified time zone instead of explicit or implicit casts. Similarly, to convert `WITH TIME ZONE` to `DATE` or `TIME` data, use the `timezone` function to create `WITHOUT TIME ZONE` data using an explicitly specified time zone, and then convert that to `DATE` or `TIME`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Best practice" here and below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a built-in box type:
Line 20 in b76956f
case "Bestpractice": |
|
||
The `WITH TIME ZONE` data types exhibit *instant* semantics, which means that they represent points in absolute time, called *instants*, and are *displayed* in the system or a configured time zone. They require the [ICU extension]({% link docs/extensions/icu.md %}) to be installed. The `WITHOUT TIME ZONE` data types exhibit *local* semantics, which means they represent a local value of time for an unspecified observer. As such, a `WITHOUT TIME ZONE` data type together *with* a time zone defines an *instant* that can be stored in a `WITH TIME ZONE` data type: | ||
|
||
```sql |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section is basically a long explanation of how to use naïve timestamps, which is itself not a best practice. I suppose it is a data cleaning task, but maybe it should be moved to a tips & tricks section somewhere?
@hawkfish Addressing your comments jointly here instead of in-line, since I believe it all boils down to this: I was in need for a succinct pair of labels for the two types of timestamps and found the "local" vs "instant" terminology at https://github.com/apache/parquet-format/blob/master/LogicalTypes.md, which distinguishes between "Instant semantics (timestamps normalized to UTC)" and "Local semantics (timestamps not normalized to UTC)". I am not an expert on the topic and believe you if you say that this is improper terminology (or that it's proper but I have misunderstood that page). Is there a pair of two-word descriptions to distinguish the two types that you're happy with? The "ignores time zone" and "uses time zone" pair is a fine pair of labels for describing how functions act on these datatypes, but it's not as easily linked to descriptions of what kind of data is morally stored in the two types of timestamps, like in the following sentence from the PR:
(btw, the "local" in this sentence was what convinced me I understood the apache docs correctly, but again I may well just have misunderstood things). I don't see why you think the following paragraph explains how to use naive timestamps (for clarification, by "naive", do you mean
|
Thanks for the Apache reference. What they call "local" is often called "naïve", which is the terminology I have been using, but I see why they might call it that. In my experience, "local" times are a data cleaning problem/headache. Once you have cleaned them to instants, you can then decide if you need sophisticated time zone binning or just UTC display. So I take the (possibly overly pedantic) view that all timestamps are instants, and the "local" thing is something to be avoided. In this view, plain The problem with local timestamps is that they have holes and collisions. More generally, they treat a set of temporal bins as PGC/UTC bins and generate the corresponding instant, but that mapping is not bijective. It is even worse if you have a non-Gregorian calendar with 13 months (and these non-Gregorian calendars do get used: I have run into people using Chinese, Hebrew and Japanese calendars.) So I see what you were trying to explain, but I'm not sure that it belongs in the data type description. In particular, there is documentation on the |
(Oh, I have my own PR up with some of the material from the blog post, and you were right it was getting lame to refer to it instead of the docs themselves...) |
and
and
I disagree exactly because of binning and collisions: In my view, Similarly, I don't view arithmetic on Anyway, I'm happy for you to take over, and will leave my comments on your PR and possibly create a new one afterwards with links to other pages or fixed on what I think might be missing on other pages. |
Add a more succinct version of https://duckdb.org/2022/01/06/time-zones.html to the docs.
Having to always refer to a blog post in issues/discussions feels like the docs are incomplete (which I believe they were, hence the PR)
Best practice / warning boxes might be controversial, maybe @hawkfish has an opinion (being the author of the blog post)?