[Iceberg] Add Iceberg metadata table $metadata_log_entries #24302

agrawalreetika · 2024-12-29T05:45:44Z

Description

Add Iceberg metadata table $metadata_log_entries

Motivation and Context

Add Iceberg metadata table $metadata_log_entries
This will help to get metadata changes on the Iceberg table https://iceberg.apache.org/docs/latest/spark-queries/#metadata-log-entries

Impact

Iceberg Connector

Test Plan

Contributor checklist

Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
Documented new properties (with its default value), SQL syntax, functions, or other functionality.
If release notes are required, they follow the release notes guidelines.
Adequate tests were added if applicable.
CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

Iceberg Connector Changes
* Add Iceberg metadata table $metadata_log_entries :pr:`24302`

hantangwangd

Thanks for adding this feature, overall looks good to me, except one little problem about timestamp with tz and some nits.

hantangwangd · 2025-01-01T16:59:34Z

presto-iceberg/src/main/java/com/facebook/presto/iceberg/MetadataLogTable.java

+            .add(new ColumnMetadata("timestamp", TIMESTAMP_WITH_TIME_ZONE))
+            .add(new ColumnMetadata("file", VARCHAR))
+            .add(new ColumnMetadata("latest_snapshot_id", BIGINT))
+            .add(new ColumnMetadata("latest_schema_id", BIGINT))


nit: Should this type be INTEGER?

hantangwangd · 2025-01-01T17:00:20Z

presto-iceberg/src/main/java/com/facebook/presto/iceberg/MetadataLogTable.java

+    {
+        InMemoryRecordSet.Builder table = InMemoryRecordSet.builder(COLUMNS);
+
+        TableMetadata metadata = ((org.apache.iceberg.BaseTable) icebergTable).operations().current();


nit: use static import

hantangwangd · 2025-01-01T17:06:15Z

presto-iceberg/src/main/java/com/facebook/presto/iceberg/MetadataLogTable.java

+            Long snapshotId = null;
+            Snapshot snapshot = null;
+            try {
+                snapshotId = SnapshotUtil.snapshotIdAsOfTime(icebergTable, entry.timestampMillis());


Suggested change

snapshotId = SnapshotUtil.snapshotIdAsOfTime(icebergTable, entry.timestampMillis());

snapshotId = snapshotIdAsOfTime(icebergTable, entry.timestampMillis());

nit: I know this code is from iceberg lib, but we can still use static import as much as possible.

hantangwangd · 2025-01-01T17:12:18Z

presto-iceberg/src/main/java/com/facebook/presto/iceberg/MetadataLogTable.java

+
+    private void addRow(InMemoryRecordSet.Builder table, ConnectorSession session, long timestampMillis, String fileLocation, Long snapshotId, Snapshot snapshot)
+    {
+        table.addRow(packDateTimeWithZone(timestampMillis, session.getSqlFunctionProperties().getTimeZoneKey()),


Should we consider the situation when session.getSqlFunctionProperties().isLegacyTimestamp() is false? As I understand, in that case we should use UTC as time zone key. Any misunderstanding please let me know.

Thanks for your review @hantangwangd

Currently with and w/o isLegacyTimestamp the output for timestamp in metadata_log_entries entries looks same -

presto:iceberg_schema> set session legacy_timestamp=true; SET SESSION presto:iceberg_schema> select * from "region_legacy$metadata_log_entries"; timestamp | file | latest_snapshot_id | latest_schema_id | latest_s> --------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+---------------------+------------------+---------> 2025-01-02 22:39:00.666 Asia/Kolkata | hdfs://localhost:9000/user/hive/warehouse/iceberg_schema.db/region_legacy/metadata/00000-26e6389a-ab54-455b-b5ce-6648e241ce29.metadata.json | 7341611993609958569 | 0 | > 2025-01-02 22:39:12.478 Asia/Kolkata | hdfs://localhost:9000/user/hive/warehouse/iceberg_schema.db/region_legacy/metadata/00001-15c85566-fcf1-413d-8115-5fc4376426cf.metadata.json | 8958386941531340808 | 0 | > (2 rows)

presto:iceberg_schema> set session legacy_timestamp=false; SET SESSION presto:iceberg_schema> select * from "region_nolegacy$metadata_log_entries"; timestamp | file | latest_snapshot_id | latest_schema_id | latest> --------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------+---------------------+------------------+-------> 2025-01-02 22:40:50.877 Asia/Kolkata | hdfs://localhost:9000/user/hive/warehouse/iceberg_schema.db/region_nolegacy/metadata/00000-1de7672a-8a9e-4249-afe8-d526b094ca57.metadata.json | 1517277585224920583 | 0 | > 2025-01-02 22:41:03.948 Asia/Kolkata | hdfs://localhost:9000/user/hive/warehouse/iceberg_schema.db/region_nolegacy/metadata/00001-8bf52dfe-2601-4ce8-bb3c-30ac435573ea.metadata.json | 2705037583472111886 | 0 | > (2 rows)

It looks like both cases are taking up my local time and timezone since the session object has local TZ
Could you please help me understand if this is not expected?

My mistake, I confused the result column type timestamp with tz with timestamp. The property isLegacyTimestamp is used for timestamp type, so there is no need to consider it here.

hantangwangd · 2025-01-01T17:18:43Z

presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergDistributedTestBase.java

+    @Test
+    public void testMetadataLogTable()
+    {
+        try {
+            assertUpdate("CREATE TABLE test_table_metadatalog (id1 BIGINT, id2 BIGINT)");
+            assertQuery("SELECT count(*) FROM \"test_table_metadatalog$metadata_log_entries\"", "VALUES 1");
+            //metadata file created at table creation
+            assertQuery("SELECT latest_snapshot_id FROM \"test_table_metadatalog$metadata_log_entries\"", "VALUES NULL");
+
+            assertUpdate("INSERT INTO test_table_metadatalog VALUES (0, 00), (1, 10), (2, 20)", 3);
+            Table icebergTable = loadTable("test_table_metadatalog");
+            Snapshot latestSnapshot = icebergTable.currentSnapshot();
+            assertQuery("SELECT count(*) FROM \"test_table_metadatalog$metadata_log_entries\"", "VALUES 2");
+            assertQuery("SELECT latest_snapshot_id FROM \"test_table_metadatalog$metadata_log_entries\" order by timestamp DESC limit 1", "values " + latestSnapshot.snapshotId());
+        }
+        finally {
+            assertUpdate("DROP TABLE IF EXISTS test_table_metadatalog");
+        }
+    }


Is it convenient to add some test cases considering different timezone and legacyTimestamp, and verify the output column timestamp?

@hantangwangd Could you please provide me some example around which type of testcases would fit in here considering different timezone?
I just looked at other metadata tables with timestamp column, but couldn't find any example around same.

Refer to Iceberg's test case, I think we can add some tests similar with the following code:

Session session = sessionWithTimezone(zoneId); assertUpdate(session, "CREATE TABLE test_table_metadatalog (id1 BIGINT, id2 BIGINT)"); assertQuery(session, "SELECT count(*) FROM \"test_table_metadatalog$metadata_log_entries\"", "VALUES 1"); Table icebergTable = loadTable("test_table_metadatalog"); TableMetadata tableMetadata = ((HasTableOperations) icebergTable).operations().current(); ZonedDateTime zonedDateTime1 = ZonedDateTime.ofInstant(Instant.ofEpochMilli(tableMetadata.lastUpdatedMillis()), ZoneId.of(zoneId)); String metadataFileLocation1 = "file:" + tableMetadata.metadataFileLocation(); assertUpdate(session, "INSERT INTO test_table_metadatalog VALUES (0, 00), (1, 10), (2, 20)", 3); tableMetadata = ((HasTableOperations) icebergTable).operations().refresh(); ZonedDateTime zonedDateTime2 = ZonedDateTime.ofInstant(Instant.ofEpochMilli(tableMetadata.lastUpdatedMillis()), ZoneId.of(zoneId)); String metadataFileLocation2 = "file:" + tableMetadata.metadataFileLocation(); Snapshot latestSnapshot = tableMetadata.currentSnapshot(); MaterializedResult result = getQueryRunner().execute(session, "SELECT * FROM \"test_table_metadatalog$metadata_log_entries\""); assertThat(result).hasSize(2); assertThat(result) .anySatisfy(row -> assertThat(row) .isEqualTo(new MaterializedRow(MaterializedResult.DEFAULT_PRECISION, zonedDateTime1, metadataFileLocation1, null, null, null))) .anySatisfy(row -> assertThat(row) .isEqualTo(new MaterializedRow(MaterializedResult.DEFAULT_PRECISION, zonedDateTime2, metadataFileLocation2, latestSnapshot.snapshotId(), latestSnapshot.schemaId(), latestSnapshot.sequenceNumber())));

And test it under different zoneIds.

ZacBlanco

One minor thing. I also agree with @hantanwangd to make sure this works with proper TZ configuration. Otherwise lgtm

ZacBlanco · 2025-01-03T17:09:01Z

presto-iceberg/src/main/java/com/facebook/presto/iceberg/MetadataLogTable.java

+    @Override
+    public RecordCursor cursor(ConnectorTransactionHandle transactionHandle, ConnectorSession session, TupleDomain<Integer> constraint)
+    {
+        InMemoryRecordSet.Builder table = InMemoryRecordSet.builder(COLUMNS);


Rather than use the builder, I would recommend using the public constructor and passing an iterator. It will help reduce memory pressure on the coordinator by streaming records rather than requiring us to aggregate all at once in-memory. The overall footprint of this table shouldn't be too large but I think using an iterator approach to generate the records is not difficult to implement.

When generating records you can just use java's Stream and map operations and just call .iterator() at the end.

ZacBlanco · 2025-01-03T17:10:59Z

presto-iceberg/src/main/java/com/facebook/presto/iceberg/MetadataLogTable.java

+        List<MetadataLogEntry> metadataLogEntries = metadata.previousFiles();
+
+        processMetadataLogEntries(table, session, metadataLogEntries);
+        addLatestMetadataEntry(table, session, metadata);


to add the latest entry I think you can just do Stream.concat+Stream.of()

steveburnett

LGTM! (docs)

Pull branch, local doc build, looks good. Thank you for the documentation!

agrawalreetika requested review from steveburnett, elharo, hantangwangd, ZacBlanco and a team as code owners December 29, 2024 05:45

agrawalreetika requested a review from presto-oss December 29, 2024 05:45

prestodb-ci added the from:IBM PR from IBM label Dec 29, 2024

prestodb-ci requested review from a team, infvg and pratyakshsharma and removed request for a team December 29, 2024 05:45

hantangwangd reviewed Jan 1, 2025

View reviewed changes

ZacBlanco requested changes Jan 3, 2025

View reviewed changes

steveburnett previously approved these changes Jan 3, 2025

View reviewed changes

agrawalreetika dismissed steveburnett’s stale review via b4d52b7 January 4, 2025 05:23

agrawalreetika force-pushed the metadata_log_entries branch 3 times, most recently from 8242d7d to 8f6f007 Compare January 4, 2025 18:52

Add Iceberg metadata table $metadata_log_entries

e66d10e

agrawalreetika force-pushed the metadata_log_entries branch from 8f6f007 to e66d10e Compare January 5, 2025 05:38

agrawalreetika marked this pull request as draft January 6, 2025 02:33

agrawalreetika self-assigned this Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Iceberg] Add Iceberg metadata table $metadata_log_entries #24302

[Iceberg] Add Iceberg metadata table $metadata_log_entries #24302

agrawalreetika commented Dec 29, 2024 •

edited

Loading

hantangwangd left a comment

hantangwangd Jan 1, 2025

hantangwangd Jan 1, 2025

hantangwangd Jan 1, 2025

hantangwangd Jan 1, 2025

agrawalreetika Jan 3, 2025

hantangwangd Jan 3, 2025

hantangwangd Jan 1, 2025

agrawalreetika Jan 4, 2025

hantangwangd Jan 4, 2025 •

edited

Loading

ZacBlanco left a comment

ZacBlanco Jan 3, 2025

ZacBlanco Jan 3, 2025

steveburnett left a comment

	snapshotId = SnapshotUtil.snapshotIdAsOfTime(icebergTable, entry.timestampMillis());
	snapshotId = snapshotIdAsOfTime(icebergTable, entry.timestampMillis());

[Iceberg] Add Iceberg metadata table $metadata_log_entries #24302

Are you sure you want to change the base?

[Iceberg] Add Iceberg metadata table $metadata_log_entries #24302

Conversation

agrawalreetika commented Dec 29, 2024 • edited Loading

Description

Motivation and Context

Impact

Test Plan

Contributor checklist

Release Notes

hantangwangd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hantangwangd Jan 4, 2025 • edited Loading

Choose a reason for hiding this comment

ZacBlanco left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

steveburnett left a comment

Choose a reason for hiding this comment

agrawalreetika commented Dec 29, 2024 •

edited

Loading

hantangwangd Jan 4, 2025 •

edited

Loading