Jump to content

Sharing Metadata Across Different Databricks Workspaces Using Hive External Metastore


Recommended Posts

Guest SamPanda
Posted

Sharing Metadata Across Different Databricks Workspaces

 

 

 

 

Contents

 

 

 

Background.

 

Hive External Metastore.

 

Prerequisites.

 

Important Steps for the configuration.

 

Hive External metastore database.

 

 

Background

 

 

Every Databricks deployment comes with a managed built-in Hive metastore. (If you aren’t familiar, a Hive metastore is a database that holds metadata about our data, such as the paths to the data in the data lake and the format of the data (parquet, delta, CSV, etc.)) Instead of using the out-of-the-box Hive metastore wouldn't it be great to have one central metastore and have multiple workspaces point to that metastore. Some scenarios call for an external metastore that is shareable and independent of the workspace's existence. Well, this article will cover the end-to-end process of setting up an external metastore using Azure SQL DB and connecting multiple workspaces to that metastore.

 

 

 

mediumvv2px400.png.1d6ee248ebf40a52f0554192f2827f85.png

 

 

 

Scenario 1: Consider the scenario where we have several developer groups, and they all want to use their own Databricks Workspace for their development. However, they have a requirement that they must collaborate with each other. By default, Databricks uses the local built-in metastore in DBFS file system to keep the logical schema of all the Delta and Hive tables. If they start registering the databases and tables in their workspaces separately, we will end up having multiple metadata spread across the different workspaces which would make it harder for the teams to collaborate.

 

 

 

A solution to this is to create Hive external metastore that different Databricks Workspaces can share, and each of the Workspaces can register and use the commonly shared metastore. We will be detailing the end-to-end process that is required to set this up in the following steps.

 

 

 

mediumvv2px400.png.bc9b8e31585ae77907f0f48471c1538a.png

 

 

 

Scenario 2: Now let's paint the picture for Disaster Recovery. Imagine we have a requirement where we need to create a Disaster Recovery site for the primary Databricks Workspace. In addition, one of our requirements calls for minimum down time while failing over from the primary site to the Disaster Recovery site.

 

 

 

We could use the external Hive metastore database to store the metadata and make our lives easier. The moment disaster happens we can basically attach the replicated Hive metastore database in secondary region with the secondary Databricks Workspace (DR site). One of the huge advantages is we won’t have the overhead to run the DDL scripts again and create the databases or table in the secondary site as the Hive external metastore will already have the logical schema.

 

 

 

(Note: We will discuss the Disaster Recovery scenario above and how to use an external Hive metadata store with Azure Databricks for DR in this blogpost in our series.)

 

 

 

mediumvv2px400.png.c44205a015fd48b9860d20eeba31ca03.png

 

 

 

Reference link(s):

 

Data Redundancy in Azure Storage

 

Disaster Recovery for azure sql db

 

 

 

Hive External Metastore

 

 

Here are the steps that we can follow to create the external Hive metastore. Remember the config values are dependent on the Hive version that we are using, and the Hive version is dependent on the Databricks runtime version.

 

 

 

Prerequisites

 

  • Two databricks workspaces are created.
  • You have created an Azure SQL server and a database which will store the Hive metastore. You can provide any name for the Azure SQL server, and Azure SQL DB.
  • You have a storage account (preferably ADLS g2) where the table’s data would be stored (e.g., Data Lake).
  • You have already created a Service Principal, and you noted down the Application ID, Application secret, Tenant ID. To create a Service Principle in the Azure Portal the steps are here.
  • Provide the storage data blob contributor access in the storage account that is created in #3 to the service principal.

 

 

 

Important steps for the configuration

 

 

Below we are going to create a Hive external metastore that will be shared by two Databricks Workspaces (scenario 1 from the background section above).

 

 

 

 

 

Create Hive metastore schema

 

  1. Download the DDL script for the Hive metadata creation from here. Since I am using 2.3.7, I am selecting the 2.3.x version.

 

mediumvv2px400.png.47591c8ede19f759014cbc3350713371.png

 

 

 

 

 

  1. Unzip the downloaded file, and you will see the folder structure like below. Then select the hive-schema-2.3.0.mssql.sql file.

 

largevv2px999.png.abf27548803e7cf22ee22e80dfbab341.png

 

 

 

 

 

  1. Run the script in your Azure SQL database. This database is going to be our Hive external metastore.
  2. Post running the script you will see that the Hive metastore schema was created. (See image below.)

 

 

 

mediumvv2px400.png.9045689ec9b2234b1ea5f3262281cda1.png

 

 

 

 

 

Cluster creation and spark configuration

 

 

 

 

  1. In both workspaces, create a cluster with Databricks runtime version. Hive 2.3.7 (Databricks Runtime 7.0 - 9.x) or Hive 2.3.9 (Databricks Runtime 10.0 and above) and with the below configurations.
  2. In the spark configuration, provide the below config values:

 

 

 

 

 

 

 

 

 

 

 

 

 

spark.hadoop.javax.jdo.option.ConnectionUserName <sql user name>

spark.hadoop.javax.jdo.option.ConnectionURL jdbc:sqlserver://<azure sql server name>.database.windows.net:1433;database=< azure sql db name>;encrypt=true;trustServerCertificate=false;loginTimeout=30;

spark.hadoop.javax.jdo.option.ConnectionPassword <azure sql password>

spark.hadoop.javax.jdo.option.ConnectionDriverName com.microsoft.sqlserver.jdbc.SQLServerDriver

spark.sql.hive.metastore.jars builtin

spark.sql.hive.metastore.version 2.3.7

 

 

 

 

 

 

 

 

 

 

 

 

 

Note:

 

  • Since we used the databricks runtime version 9.1, we provided the hive version as 2.3.7: spark.sql.hive.metastore.version 2.3.7

  • For the Hive version 2.3.7 and 2.3.9, we don’t need to explicitly mention the jar file location. We can just provide builtin: spark.sql.hive.metastore.jars builtin

  • For the password or secrets, you can use Databricks Secrets. We need to provide the config value as follows: javax.jdo.option.ConnectionPassword {{secrets/xxxscope/xxxsecretname}}
    • In the above example we provided the password in clear text which is not recommended.

  • For production environments, it is recommend that you set hive.metastore.schema.verification to true. This prevents the Hive metastore client from implicitly modifying the metastore database schema when the metastore client version does not match the metastore database version.

 

 

largevv2px999.thumb.png.ccae286566fb5550e669cf0205b73dc6.png

 

 

 

 

 

  1. Post creating the cluster you will find that the Hive metastore to be present in the catalogs section of your workspace. The Hive metastore will always have a default database.largevv2px999.png.001ee1db82651f56ceb2d521b4f9ecc7.png

 

Database and table creation

 

  1. We are going to create a mount-point for an external storage account that will store the actual data.

 

 

 

 

 

 

 

 

 

 

 

application_ID = "<< provide-your-application-ID"

application_secret="<<provide-your-app-secret-key>>"

directory_id = "<< provide-your-directory-ID>>"

storage_account_name = "<<storage-account-name>>"

file_system_name="<<file-system-name>>"

configs = {"fs.azure.account.auth.type": "OAuth",

"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",

"fs.azure.account.oauth2.client.id": application_ID,

"fs.azure.account.oauth2.client.secret": application_secret,

"fs.azure.account.oauth2.client.endpoint": "Sign in to your account"+directory_id+"/oauth2/token"}

dbutils.fs.mount(

source = "abfss://"+file_system_name+"@"+storage_account_name+".dfs.core.windows.net/",

mount_point = "/mnt/deltalake",

extra_configs = configs)

 

 

 

 

 

 

 

 

 

 

 

 

  1. Now let's create a Delta Table

 

 

 

 

 

 

 

 

 

 

 

spark.sql('''create database if not exists mytestDB''')

#read the sample data into dataframe

df_flight_data = spark.read.csv("/databricks-datasets/flights/departuredelays.csv", header=True)

#create the delta table to the mount point that we have created earlier

dbutils.fs.rm("/mnt/deltalake/mytestDB/flight_data", recurse=True)

df_flight_data.write.format("delta").mode("overwrite").save("/mnt/deltalake/mytestDB/flight_data")

spark.sql('''drop table if exists mytestDB.flight_data''')

spark.sql('''create table if not exists mytestDB.flight_data using delta location "/mnt/deltalake/mytestDB/flight_data"''

 

 

 

 

 

 

 

 

 

 

 

  1. Once the Delta Table is created, we can see the Delta Table in the database section in both Workspaces.

 

In dbkworkspace1

 

largevv2px999.png.a747be4a8d561757a6c4065753540aa1.png

 

 

 

 

 

  1. In dbkworkspace2, if you navigate to the database, you will see an error. This is because we have only attached the external Hive metastore which contains the logical metadata of the table. The physical metadata or the folder structure is present in the Data Lake that we need to mount with the dbkworkspace2 to view the data.

 

largevv2px999.png.89889c2e85b3c2093a017b09e64dda75.png

 

 

 

 

 

  1. Mount the storage account in dbkworkspace2 that we did #8 for dbkworkspace1. We should always give the same mount point name. Once we mount the storage account with same name, we can see the data.

 

largevv2px999.png.0c5a0167f49ee778fbdcf47da0401bea.png

 

 

 

 

 

Let’s make some change to the table from the dbworkspace2

 

largevv2px999.png.3b1e4491ed72f16fabbb0d2d7279f821.png

 

largevv2px999.png.fd34cd823f2fd51c86b41eb31c241c1a.png

 

 

 

In dbkworkspace1, we can see the changes-

 

largevv2px999.png.6d3c7b86fc1af132d7ec257b1b9c8a79.png

 

 

 

Hive External Metastore Database

 

 

 

 

Running the below query will give us the information about the tables that are registered in hive external metastore for a given database

 

 

 

 

 

 

 

 

 

select

databases.name

,tables.tbl_name

,tables.tbl_type

,sds.location

from [dbo].[DBS] as databases

inner join [dbo].[TBLS] as tables on tables.db_id=databases.db_id

inner join [dbo].[sDS] as sds on sds.sd_id=tables.sd_id

where databases.name = 'mytestdb'

 

 

 

 

 

 

 

 

 

largevv2px999.png.35263e0d38ee05fc2d45e97e3f0231d0.png

 

 

 

The Hive external metostore contains the pointer to the external storage.

 

 

 

Here is the pointer for the database - dbfs:/user/hive/warehouse/mytestdb.db

 

largevv2px999.png.1b5b88daf4c2175b5239172bcc2431de.png

 

 

 

Now when we look at the table, we see the actual path is being mapped with the table.

 

 

 

largevv2px999.png.e3169023ac558acdfbc63237be68f159.png

 

 

 

Final Thoughts

 

 

Whether you have a simple Databricks setup or a complicated one, every workspace in Azure Databricks has a Hive metastore that is critical to your Databricks jobs, your collaboration and your disaster recovery. Leveraging the above setup will allow you to paint a better picture for sharing tables across the business, sharing a metastore so that different workspaces can register their data into a commonly shared metastore and simplifying your disaster recovery setup. We hope that you got some new insights and they come in handy. The only thing left is to get your hands dirty and have fun!

 

Continue reading...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...