- Docker + Compose
- Curl
- jq (optional)
┌───────────┐ ┌───────────┐ ┌───────┐
│ │ │ │ │ │
│ Trino │ ───────────────► │ Lakekeeper├─────────────► │ │
│ SQL │ │ Iceberg │ │ DB │
│ │ │ Catalog │ │ │
└─────┬─────┘ └─────┬─────┘ │ │
│ │ └───────┘
│ │
│ │
│ ┌─────▼─────┐
│ │ │
└───────────────────────►│ Object │
│ Storage │
│ │
└───────────┘
Start services using Docker Compose:
- Trino (SQL engine)
- Lakekeeper (Iceberg Catalog)
- Postgres (Catalog Concurrency Broker)
- Minio (S3 Object Storage)
docker compose up -d && docker compose rm -f
Create an iceberg
catalog using the management api from the Lakekeeper catalog.
curl -X POST -v \
-H "Content-Type: application/json" \
http://localhost:8181/management/v1/warehouse \
-d \
'{
"warehouse-name": "iceberg-warehouse",
"project-id": "00000000-0000-0000-0000-000000000000",
"storage-profile": {
"type": "s3",
"bucket": "databucks",
"endpoint": "http://s3gateway:9000/",
"region": "local-1",
"path-style-access": true,
"flavor": "minio",
"sts-enabled": true,
"sts_role_arn": null,
"assume-role-arn": null
},
"storage-credential": {
"type": "s3",
"credential-type": "access-key",
"aws-access-key-id": "storage",
"aws-secret-access-key": "storage123"
}
}'
Verify the warehouse is created.
curl -X GET -v \
-H "Content-Type: application/json" \
http://localhost:8181/management/v1/warehouse?project-id=00000000-0000-0000-0000-000000000000 | jq
Go take a moment to log into the object storage UI and check the structure under the main bucket used. Then log into the Trino CLI.
docker container exec -it iceberg-getting-started-trino-coordinator-1 trino
Create hiveand
iceberg` catalogs using the Hive Metastore and Lakekeeper. We will use the Hive data to migrate data in the Hive format to Iceberg.
CREATE CATALOG hive USING iceberg
WITH (
"hive.metastore.uri" = 'thrift://hive-metastore:9083',
"fs.native-s3.enabled" = 'true',
"s3.region"= 'local-1',
"s3.path-style-access" = 'true',
"s3.endpoint" = 'http://s3gateway:9000',
"s3.aws-access-key" = 'storage',
"s3.aws-secret-key" = 'storage123'
);
CREATE CATALOG iceberg USING iceberg
WITH (
"iceberg.catalog.type" = 'rest',
"iceberg.rest-catalog.uri" = 'http://lakekeeper:8181/catalog',
"iceberg.rest-catalog.warehouse" = 'iceberg-warehouse',
"fs.native-s3.enabled" = 'true',
"s3.region"= 'local-1',
"s3.path-style-access" = 'true',
"s3.endpoint" = 'http://s3gateway:9000',
"s3.aws-access-key" = 'storage',
"s3.aws-secret-key" = 'storage123'
);
Create schemas in both hive
and iceberg
catalogs.
CREATE SCHEMA hive.pokemon
WITH (location = 's3a://databucks/pokemon/');
NOTE: we use
s3a
protocol when specifying the location since Trino utilizes the Hadooporg.apache.hadoop.fs.s3a.S3AFileSystem
class. The Iceberg catalog will use thes3
protocol, based on the Iceberg class used by Trino and Lakekeeper.
CREATE SCHEMA iceberg.pokemon
WITH (location = 's3://databucks/pokemon/');
Copy the csv pokemon pokedex data to s3.
docker run --rm -it \
-e AWS_ACCESS_KEY_ID=storage \
-e AWS_SECRET_ACCESS_KEY=storage123 \
--network compose_network \
-v $PWD/data/pokemon_pokedex.csv:/tmp/pokemon_pokedex.csv \
amazon/aws-cli --region local-1 \
--endpoint-url http://s3gateway:9000/ \
s3 cp /tmp/pokemon_pokedex.csv s3://databucks/pokemon/pokedex_csv/
Create the pokedex table from the csv file in Hive.
CREATE TABLE hive.pokemon.pokedex_csv(
"Number" VARCHAR,
"Name" VARCHAR,
"Type 1" VARCHAR,
"Type 2" VARCHAR,
"Abilities" VARCHAR,
"HP" VARCHAR,
"Att" VARCHAR,
"Def" VARCHAR,
"Spa" VARCHAR,
"Spd" VARCHAR,
"Spe" VARCHAR,
"BST" VARCHAR,
"Mean" VARCHAR,
"Standard Deviation" VARCHAR,
"Generation" VARCHAR,
"Experience type" VARCHAR,
"Experience to level 100" VARCHAR,
"Final Evolution" VARCHAR,
"Catch Rate" VARCHAR,
"Legendary" VARCHAR,
"Mega Evolution" VARCHAR,
"Alolan Form" VARCHAR,
"Galarian Form" VARCHAR,
"Against Normal" VARCHAR,
"Against Fire" VARCHAR,
"Against Water" VARCHAR,
"Against Electric" VARCHAR,
"Against Grass" VARCHAR,
"Against Ice" VARCHAR,
"Against Fighting" VARCHAR,
"Against Poison" VARCHAR,
"Against Ground" VARCHAR,
"Against Flying" VARCHAR,
"Against Psychic" VARCHAR,
"Against Bug" VARCHAR,
"Against Rock" VARCHAR,
"Against Ghost" VARCHAR,
"Against Dragon" VARCHAR,
"Against Dark" VARCHAR,
"Against Steel" VARCHAR,
"Against Fairy" VARCHAR,
"Height" VARCHAR,
"Weight" VARCHAR,
"BMI" VARCHAR
)
WITH (
format = 'CSV',
external_location = 's3a://databucks/pokemon/pokedex_csv',
skip_header_line_count=1
);
Now convert the CSV data to Parquet and proper Hive data types along with partitioning and clustering for read efficiency.
CREATE TABLE hive.pokemon.pokedex
WITH (
format = 'PARQUET',
partitioned_by = ARRAY['type1'],
bucketed_by = ARRAY['type2'],
bucket_count = 2,
external_location = 's3a://databucks/pokemon/pokedex'
) AS
SELECT
CAST(number AS INTEGER) AS number,
name,
"Type 2" AS type2,
CAST(json_parse(replace(replace(Abilities, '''s', 's'), '''', '"')) AS ARRAY(VARCHAR)) AS abilities,
CAST(hp AS INTEGER) AS hp,
CAST(att AS INTEGER) AS att,
CAST(def AS INTEGER) AS def,
CAST(spa AS INTEGER) AS spa,
CAST(spd AS INTEGER) AS spd,
CAST(spe AS INTEGER) AS spe,
CAST(bst AS INTEGER) AS bst,
CAST(mean AS DOUBLE) AS mean,
CAST("Standard Deviation" AS DOUBLE) AS std_dev,
CAST(generation AS DOUBLE) AS generation,
"Experience type" AS experience_type,
CAST("Experience to level 100" AS BIGINT) AS experience_to_lvl_100,
CAST(CAST("Final Evolution" AS DOUBLE) AS BOOLEAN) AS final_evolution,
CAST("Catch Rate" AS INTEGER) AS catch_rate,
CAST(CAST("Legendary" AS DOUBLE) AS BOOLEAN) AS legendary,
CAST(CAST("Mega Evolution" AS DOUBLE) AS BOOLEAN) AS mega_evolution,
CAST(CAST("Alolan Form" AS DOUBLE) AS BOOLEAN) AS alolan_form,
CAST(CAST("Galarian Form" AS DOUBLE) AS BOOLEAN) AS galarian_form,
CAST("Against Normal" AS DOUBLE) AS against_normal,
CAST("Against Fire" AS DOUBLE) AS against_fire,
CAST("Against Water" AS DOUBLE) AS against_water,
CAST("Against Electric" AS DOUBLE) AS against_electric,
CAST("Against Grass" AS DOUBLE) AS against_grass,
CAST("Against Ice" AS DOUBLE) AS against_ice,
CAST("Against Fighting" AS DOUBLE) AS against_fighting,
CAST("Against Poison" AS DOUBLE) AS against_poison,
CAST("Against Ground" AS DOUBLE) AS against_ground,
CAST("Against Flying" AS DOUBLE) AS against_flying,
CAST("Against Psychic" AS DOUBLE) AS against_psychic,
CAST("Against Bug" AS DOUBLE) AS against_bug,
CAST("Against Rock" AS DOUBLE) AS against_rock,
CAST("Against Ghost" AS DOUBLE) AS against_ghost,
CAST("Against Dragon" AS DOUBLE) AS against_dragon,
CAST("Against Dark" AS DOUBLE) AS against_dark,
CAST("Against Steel" AS DOUBLE) AS against_steel,
CAST("Against Fairy" AS DOUBLE) AS against_fairy,
CAST("Height" AS DOUBLE) AS height,
CAST("Weight" AS DOUBLE) AS weight,
CAST("BMI" AS DOUBLE) AS bmi,
"Type 1" AS type1
FROM hive.pokemon.pokedex_csv;
Notice that becuase we are partitioning the Hive data, it requires us to place the partitioned column at the end of the list, requiring users to understand storage details of the data layout. Let's just take a look at the data.
SELECT * FROM hive.pokemon.pokedex LIMIT 50;
Since Hive and Iceberg share the same data storage location, they should both be able to access that data right?
SELECT * FROM iceberg.pokemon.pokedex LIMIT 50;
Let's dig around in the metadata using our Iceberg REST catalog. First, let's check what namespaces exist. We need to get the warehouse id from lakekeeper first.
export WAREHOUSE_ID=$(curl -q -H "Content-Type: application/json" \
http://localhost:8181/management/v1/warehouse?project-id=00000000-0000-0000-0000-000000000000 | jq -r .warehouses[0].id)
Now we'll switch away from Lakekeeper's /management
endpoint to /catalog
endpoint which acts like the regular Iceberg REST catalog.
curl -v -H "Content-Type: application/json" http://localhost:8181/catalog/v1/${WAREHOUSE_ID}/namespaces
Let's migrate the hive.pokemon.pokedex
table to iceberg.pokemon.pokedex
catalog.
CALL hive.system.migrate(
schema_name => 'pokemon',
table_name => 'pokedex',
recursive_directory => 'true'
);
docker run --rm -it
-e AWS_ACCESS_KEY_ID=storage
-e AWS_SECRET_ACCESS_KEY=storage123
--network compose_network
amazon/aws-cli s3api help