Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update timestamp.md #4458

Closed
wants to merge 11 commits into from
Closed

Update timestamp.md #4458

wants to merge 11 commits into from

Conversation

soerenwolfers
Copy link
Collaborator

@soerenwolfers soerenwolfers commented Jan 1, 2025

Add a more succinct version of https://duckdb.org/2022/01/06/time-zones.html to the docs.

Having to always refer to a blog post in issues/discussions feels like the docs are incomplete (which I believe they were, hence the PR)

Best practice / warning boxes might be controversial, maybe @hawkfish has an opinion (being the author of the blog post)?

Copy link
Contributor

@hawkfish hawkfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the docs need updating (and the blog post now has some errors in it) so thanks for the wakeup call, but the PR seems like an attempt to help people use naïve timestamps instead of avoid them, (which would make their lives better.) I'll take a stab at it myself.

> Since there is not currently a `TIMESTAMP_NS WITH TIME ZONE` data type, external columns with nano-second precision and "instant semantics", e.g., [parquet timestamp columns with `isAdjustedToUTC=true`](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#instant-semantics-timestamps-normalized-to-utc), lose precision when read using DuckDB.
Timestamps can be created using the `TIMESTAMP` keyword (or its variants), where the data must be formatted according to the ISO 8601 format (`YYYY-MM-DD hh:mm:ss[.zzzzzz][+-TT[:tt]]` (three extra decimal places supported by `TIMESTAMP_NS`). Decimal places beyond the targeted sub-second precision are ignored.

The `WITH TIME ZONE` data types exhibit *instant* semantics, which means that they represent points in absolute time, called *instants*, and are *displayed* in the system or a configured time zone. They require the [ICU extension]({% link docs/extensions/icu.md %}) to be installed. The `WITHOUT TIME ZONE` data types exhibit *local* semantics, which means they represent a local value of time for an unspecified observer. As such, a `WITHOUT TIME ZONE` data type together *with* a time zone defines an *instant* that can be stored in a `WITH TIME ZONE` data type:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All timestamp types use instant semantics. Instants are just a count from an epoch.

---

Timestamps represent points in absolute time, usually called *instants*.
DuckDB represents instants as the number of microseconds (µs) (or nanoseconds, for `TIMESTAMP_NS`) since `1970-01-01 00:00:00+00`.
Timestamps represent points in time and are internally stored as the `INT64` number of seconds / milliseconds / microseconds / nanoseconds since `1970-01-01 00:00:00+00`, depending on the chosen variant. Informally speaking, they contain both [`DATE`]({% link docs/sql/data_types/date.md %}) (year, month, day) and [`TIME`]({% link docs/sql/data_types/time.md %}) (hour, minute, second, microsecond or nanosecond) information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restore "absolute".

| `TIMESTAMP_MS` | | timestamp with millisecond precision (ignores time zone) |
| `TIMESTAMP_S` | | timestamp with second precision (ignores time zone) |
| `TIMESTAMPTZ` | `TIMESTAMP WITH TIME ZONE` | timestamp with microsecond precision (uses time zone) |
| `TIMESTAMP_NS` | | timestamp with nanosecond precision (local semantics) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"ignores time zone" was correct! I don't know what "local semantics" means, but if you want to be precise, UTC would be more accurate. To be local, you would need a time zone.


A timestamp specifies a combination of [`DATE`]({% link docs/sql/data_types/date.md %}) (year, month, day) and a [`TIME`]({% link docs/sql/data_types/time.md %}) (hour, minute, second, microsecond or nanosecond). Timestamps can be created using the `TIMESTAMP` keyword, where the data must be formatted according to the ISO 8601 format (`YYYY-MM-DD hh:mm:ss[.zzzzzz][+-TT[:tt]]` (three extra decimal places supported by `TIMESTAMP_NS`). Decimal places beyond the targeted sub-second precision are ignored.
> Warning Since there is not currently a `TIMESTAMP_NS WITH TIME ZONE` data type, external columns with nano-second precision and instant semantics, e.g., [parquet timestamp columns with `isAdjustedToUTC=true`](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#instant-semantics-timestamps-normalized-to-utc), are converted to `TIMESTAMP WITH TIME ZONE` and thus lose precision when read using DuckDB.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move warnings down to a bottom section? Or move this to the parquet documentation?


> Warning It is possible to convert between `WITH TIME ZONE` and `WITHOUT TIME ZONE` types using regular explicit and even implicit casts, which perform the same computation as the `timezone` function above, using the system or configured time zone. Similarly, there are regular explicit casts from `WITH TIME ZONE` data types to `DATE` and `TIME`, which also use the system of configured time zone.

> Bestpractice To convert between `WITH TIME ZONE` and `WITHOUT TIME ZONE` data, use the `timezone` function with explicitly specified time zone instead of explicit or implicit casts. Similarly, to convert `WITH TIME ZONE` to `DATE` or `TIME` data, use the `timezone` function to create `WITHOUT TIME ZONE` data using an explicitly specified time zone, and then convert that to `DATE` or `TIME`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Best practice" here and below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a built-in box type:

case "Bestpractice":


The `WITH TIME ZONE` data types exhibit *instant* semantics, which means that they represent points in absolute time, called *instants*, and are *displayed* in the system or a configured time zone. They require the [ICU extension]({% link docs/extensions/icu.md %}) to be installed. The `WITHOUT TIME ZONE` data types exhibit *local* semantics, which means they represent a local value of time for an unspecified observer. As such, a `WITHOUT TIME ZONE` data type together *with* a time zone defines an *instant* that can be stored in a `WITH TIME ZONE` data type:

```sql
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is basically a long explanation of how to use naïve timestamps, which is itself not a best practice. I suppose it is a data cleaning task, but maybe it should be moved to a tips & tricks section somewhere?

@soerenwolfers
Copy link
Collaborator Author

soerenwolfers commented Jan 2, 2025

@hawkfish Addressing your comments jointly here instead of in-line, since I believe it all boils down to this: I was in need for a succinct pair of labels for the two types of timestamps and found the "local" vs "instant" terminology at https://github.com/apache/parquet-format/blob/master/LogicalTypes.md, which distinguishes between "Instant semantics (timestamps normalized to UTC)" and "Local semantics (timestamps not normalized to UTC)". I am not an expert on the topic and believe you if you say that this is improper terminology (or that it's proper but I have misunderstood that page). Is there a pair of two-word descriptions to distinguish the two types that you're happy with? The "ignores time zone" and "uses time zone" pair is a fine pair of labels for describing how functions act on these datatypes, but it's not as easily linked to descriptions of what kind of data is morally stored in the two types of timestamps, like in the following sentence from the PR:

In the opposite direction, we can extract the local time for an observer in a given time zone at a given instant

(btw, the "local" in this sentence was what convinced me I understood the apache docs correctly, but again I may well just have misunderstood things).

I don't see why you think the following paragraph explains how to use naive timestamps (for clarification, by "naive", do you mean WITHOUT TIME ZONE?). I included that paragraph because I thought it'd be useful to understand the difference between the two types of timestamps, e.g., via the "they represent a local value of time for an unspecified observer", but maybe if I got that "local" vs "instant" terminology wrong to begin with this needs rewriting anyway.

The WITH TIME ZONE data types exhibit instant semantics, which means that they represent points in absolute time, called instants, and are displayed in the system or a configured time zone. They require the [ICU extension]({% link docs/extensions/icu.md %}) to be installed. The WITHOUT TIME ZONE data types exhibit local semantics, which means they represent a local value of time for an unspecified observer. As such, a WITHOUT TIME ZONE data type together with a time zone defines an instant that can be stored in a WITH TIME ZONE data type:

@hawkfish
Copy link
Contributor

hawkfish commented Jan 2, 2025

Thanks for the Apache reference. What they call "local" is often called "naïve", which is the terminology I have been using, but I see why they might call it that. In my experience, "local" times are a data cleaning problem/headache. Once you have cleaned them to instants, you can then decide if you need sophisticated time zone binning or just UTC display.

So I take the (possibly overly pedantic) view that all timestamps are instants, and the "local" thing is something to be avoided. In this view, plain TIMESTAMPs are instants and the fast binning operations for them are just UTC/proleptic Gregorian calendar operations. It is true that these binning operations are more suited for display than analytics, but for a lot of workloads (e.g., logs) they are both appropriate and much faster than ICU. In fact ICU is so slow I often recommend pre-generating calendar tables keyed off instants if the time resolution is known up front (e.g., hours for electrical utilities).

The problem with local timestamps is that they have holes and collisions. More generally, they treat a set of temporal bins as PGC/UTC bins and generate the corresponding instant, but that mapping is not bijective. It is even worse if you have a non-Gregorian calendar with 13 months (and these non-Gregorian calendars do get used: I have run into people using Chinese, Hebrew and Japanese calendars.)

So I see what you were trying to explain, but I'm not sure that it belongs in the data type description. In particular, there is documentation on the timezone function, and maybe just a link to that would be a good idea? Maybe see if your additions would fit better there?

@hawkfish
Copy link
Contributor

hawkfish commented Jan 2, 2025

(Oh, I have my own PR up with some of the material from the blog post, and you were right it was getting lame to refer to it instead of the docs themselves...)

@soerenwolfers
Copy link
Collaborator Author

soerenwolfers commented Jan 2, 2025

All timestamp types use instant semantics. Instants are just a count from an epoch

and

all timestamps are instants

and

Restore "absolute".

I disagree exactly because of binning and collisions: In my view, TIMESTAMP '1970-01-01 03:00' is not a point in absolute time, and it is not a "count from an epoch"; it is merely an observation that someone made in a non-retained, local, context. Whether that observation was made at the instant with count-from-unix-epoch 0 or 3.6e-12 or 10.8e-12 microseconds cannot be known without more information (e.g. a timezone and what other libraries call the bin-fold number).

Similarly, I don't view arithmetic on TIMESTAMPs as happening in UTC, but as following naive (so I like your terminology there) rules of manipulating timestamps, akin to simply turning a manual clock (the old ones, with fingers) by a fixed number of degrees without thinking about the meaning of that operation. Implementation wise, just doing arithmetic, which can be done by every school child without even knowing the term UTC.

Anyway, I'm happy for you to take over, and will leave my comments on your PR and possibly create a new one afterwards with links to other pages or fixed on what I think might be missing on other pages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants