Skip to content

Latest commit

 

History

History
59 lines (37 loc) · 3.11 KB

File metadata and controls

59 lines (37 loc) · 3.11 KB

ADB workspace with external hive metastore

Credits to [email protected] and [email protected] for notebook logic for database initialization steps. This architecture will be deployed:

alt text

Get Started:

This template will complete 99% process for external hive metastore deployment with Azure Databricks, using hive version 3.1.0. The last 1% step is just to run only once a pre-deployed Databricks job to initialize the external hive metastore. After successful deployment, your cluster can connect to external hive metastore (using azure sql database).

On your local machine:

  1. Clone this repository to local.

  2. Update terraform.tfvars file and provide values to each defined variable. Some variabes may have default values defined in variables.tf file.

  3. For step 2, variables for db_username and db_password, you can also use your environment variables: terraform will automatically look for environment variables with name format TF_VAR_xxxxx.

    export TF_VAR_db_username=yoursqlserveradminuser

    export TF_VAR_db_password=yoursqlserveradminpassword

  4. Init terraform and apply to deploy resources:

    terraform init

    terraform apply

Now we log into the Databricks workspace, such that you are added into the workspace (since the user identity deployed workspace and have at least Contributor role on the workspace, upon lauching workspace, user identity will be added as workspace admin).

Once logged into workspace, the final step is to manually trigger the pre-deployed job.

Go to databricks workspace - Job - run the auto-deployed job only once; this is to initialize the database with metastore schema.

alt text

Then you can verify in a notebook:

alt text

We can also check inside the sql db (metastore), we've successfully linked up cluster to external hive metastore and registered the table here:

alt text

Now you can config all other clusters to use this external metastore, using the same spark conf and env variables of cold start cluster.

Notes: Migrate from your existing managed metastore to external metastore

Refer to tutorial: https://kb.databricks.com/metastore/create-table-ddl-for-metastore.html

dbs = spark.catalog.listDatabases()
for db in dbs:
    f = open("your_file_name_{}.ddl".format(db.name), "w")
    tables = spark.catalog.listTables(db.name)
    for t in tables:
        DDL = spark.sql("SHOW CREATE TABLE {}.{}".format(db.name, t.name))
        f.write(DDL.first()[0])
        f.write("\n")
    f.close()