lakeFS enriches your Iceberg tables with Git capabilities: create a branch and make your changes in isolation, without affecting other team members.
See the instructions below on how to use it, and check out the integration in action in the lakeFS samples repository.
Use the following Maven dependency to install the lakeFS custom catalog:
<dependency>
<groupId>io.lakefs</groupId>
<artifactId>lakefs-iceberg</artifactId>
<version>0.1.4</version>
</dependency>
Here is how to configure the lakeFS custom catalog in Spark:
conf.set("spark.sql.catalog.lakefs", "org.apache.iceberg.spark.SparkCatalog");
conf.set("spark.sql.catalog.lakefs.catalog-impl", "io.lakefs.iceberg.LakeFSCatalog");
conf.set("spark.sql.catalog.lakefs.warehouse", "lakefs://example-repo");
You will also need to configure the S3A Hadoop FileSystem to interact with lakeFS:
conf.set("fs.s3a.access.key", "AKIAlakefs12345EXAMPLE")
conf.set("fs.s3a.secret.key", "abc/lakefs/1234567bPxRfiCYEXAMPLEKEY")
conf.set("fs.s3a.endpoint", "https://example-org.us-east-1.lakefscloud.io")
conf.set("fs.s3a.path.style.access", "true")
To create a table on your main branch, use the following syntax:
CREATE TABLE lakefs.main.table1 (id int, data string);
We can now commit the creation of the table to the main branch:
lakectl commit lakefs://example-repo/main -m "my first iceberg commit"
Then, create a branch:
lakectl branch create lakefs://example-repo/dev -s lakefs://example-repo/main
We can now make changes on the branch:
INSERT INTO lakefs.dev.table1 VALUES (3, 'data3');
If we query the table on the branch, we will see the data we inserted:
SELECT * FROM lakefs.dev.table1;
Results in:
+----+------+
| id | data |
+----+------+
| 1 | data1|
| 2 | data2|
| 3 | data3|
+----+------+
However, if we query the table on the main branch, we will not see the new changes:
SELECT * FROM lakefs.main.table1;
Results in:
+----+------+
| id | data |
+----+------+
| 1 | data1|
| 2 | data2|
+----+------+