Jump to content

Working with Unity Catalog in Azure Databricks


Recommended Posts

Guest SamPanda
Posted

Content

 

 

 

 

Working with Unity Catalog in Azure Databricks

 

 

In this blogpost, we will work with the below Unity catalog features in Azure Databricks.

 

  • Setup the Unity Catalog Metastore in an Azure Databricks environment.
  • Sharing the unity catalog across Azure Databricks environments.
  • Table Access Control in Unity Catalog.
  • Data Lineage in Unity Catalog.
  • Row Level Security and Column Level filtering using the Dynamic view.

What is Unity catalog?

 

 

Unity catalog provides bundle of features that help the administrator to manage the access policies on the data and helps data owner to manage/view the data schema, lineage. Unity catalog is a metastore that uses storage accounts to keep the metadata. The same catalog can be shared across different workspacsecure everywhere). By sharing the metadata across different Azure Databricks workspace, you can collaborate between different teams who work in their individual workspaces and store the metadata of the tables and schema in the centralized metastore.

 

 

 

mediumvv2px400.png.bb8c8b05e22d454b250458e476711afd.png

 

 

 

 

 

reference: Overview of Unity Catalog

 

 

 

What is stored inside a metastore?

 

 

 

Metastore is the top level container of objects that basically stores the metadata of the objects and the object level permissions for the Azure Databricks environment. Once the metastore is configured, we generally access the object in the form of a three-level naming convention (catalog.schema.table).

 

Each metastore requires a root storage location ( generally a file system in ADLS g2) to store the metadata and the actual data for the managed table.

 

smallvv2px200.png.c917c8dd489f6c34884a8edbd82012ed.png

 

 

 

 

 

We can have one-to-many mapping from top to bottom in the above diagram. For example, in a metastore, we can have multiple catalogs. In one catalog, we can have multiple schemas/databases. In one database/schema, we can have multiple tables.

 

External Table stores the physical data in external location (blob storage or ADLS g2). Not in the same root storage account that has been used to store the Unity Catalog metastore. We need to take some additional steps like creating storage credential and external location to register the external table in the metastore. When we apply the DROP statement, only the logical schema gets deleted, the physical data remain as-is. We might need to apply dbutils.fs.rm command to remove the physical files from the external storage location.

 

Managed Table stores the physical data in the same location where the unity catalog stores the metastore. If we don’t specify the location path while creating the table, it gets stored in the Unity catalog metastore location. When we apply the DROP SQL statement, the physical data also gets deleted from the storage account.

 

Prerequisite

 

 

Here are some prerequisites that we need to consider using the Unity catalog in Azure Databricks.

 

  1. Azure Databricks Account should be in the premium plan.
  2. The first Azure Databricks Account Admin must be an Azure Active Directory Global Administrator at the time that they first login to the Azure Databricks Account Console( Databricks - Sign in ). Upon the first-time login, the AAD admin becomes the Azure Databricks Global Admin. AAD admin can assign the Azure Databricks Global admin permission to some other user who can perform the global administrator role further without AAD global admin involvement.
  3. If we have multiple Azure Databricks workspaces span across multiple regions (East US/ West US), you need to have separate metastore for each of the regions. We cannot share metastore in more than one region. Example, if we have two Azure Databricks workspaces one in EastUS, and WestUS, we need two metastores to be created for each of the region.

Setup the Unity Catalog Metastore in an Azure Databricks environment.

 

 

Here are the steps that we need to perform to create our first Unity catalog metastore in the EAST US region.

 

Step 1: Create the root storage account for the metastore

 

 

From the Azure Portal create a ADLS g2 account, make sure that hierarchical namespace option is enabled.

 

Note down the URI of the storage account.

 

abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<metastore-name>

 

In our case the URI is abfss://unitycatalogmetadata@adldbkunityctlg.dfs.core.windows.net/FirstMetastore

 

largevv2px999.png.6b1a2de3bd737122acc3cdc2a199fc1a.png

 

 

 

Step 2: Create the Azure Databricks access connector.

 


    1. From the Azure portal, let us create a resource Azure Databricks Access connector. Post Deployment connector resource URI would be our managed identity for Azure Databricks. The authentication from Azure Databricks to the other resources will happen using this managed identity.

 

 

 

 

mediumvv2px400.png.a46fcc3c21829de9032b529ca968d53c.png

 

 

 

 

 

largevv2px999.png.6708fee34bd96df522a9b86bd6a9cdd4.png

 

 

 

  1. Note down the access connector ID. It would be in below form: /subscriptions/xxxxxxxxxxx/resourceGroups/RGMisc/providers/Microsoft.Databricks/accessConnectors/accessconnectordbkwrkspace
  2. Give the storage managed identity (access connector) as “Storage Data blob contributor” access in the root storage account that we have created in the earlier section.largevv2px999.png.3e810d66519154e12ccd4d38c2fc480e.png

Step 3: Create the metastore in Azure Databricks Account Console

 

 

Important to note that AAD Global admin has to login to the account console for the first time. Once AAD Global admin login, they will add any user or group as the account administrator, so that going forward the account console can be managed by the added user. The added user will become the Azure Databricks Account Administrator.

 

 

 

  • Create a premium tier Azure Databricks workspace.

 

largevv2px999.png.80133dfd2b6d6975d6b837aeb7782545.png

 

 

 

  • Create a cluster where the Unity Catalog feature is enabled.

 

largevv2px999.png.3e2f13430fe061078ae95fba5aae0ba7.png

 

 

 

  • Login to the account console (Databricks - Sign in ) using any AAD Global admin account.
  • Go to the user management and add a different account as the account admin.

 

largevv2px999.png.8bf1d340b0ae03ff75ae99f009d43744.png

 

 

 

 

 

  • We will now create our first metastore. We are creating the first metastore for the “eastus” region. In the ADLSG2 path, we need to provide the resource URI of the root storage account that we have created in the previous section. In Access Connector ID, provide in the resource URI of the Access connector for Azure Databricks.

 

566x418vv2.png.f2f04c9b5f21b779e61759c4033dc51d.png

 

 

 

  • Now, let’s attach the metastore with the Azure Databricks workspace. Since we have created the workspace and metastore both in the eastUS region, we are able to attach the metastore with the Azure Databricks workspace. Incase if there are multiple workspaces in the same region, we can attach all of them with the single metastore. By sharing the metastore, we can share the objects across the workspaces.

 

764x273vv2.png.035ad3db49eabf0041607901fff264d4.png

 

 

 

  • Now when login to the Azure Databricks workspace, we would be able to see a default catalog “main” that got created when we attached the Unity Catalog metastore with the Azure Databricks workspace.

 

751x367vv2.png.7aa7259078220976142f64cb747e1752.png

 

 

 

Step 4a: Create catalog and managed table.

 

 

 

 

Let’s create our first catalog, and managed table.

 

spark.sql('''create catalog if not exists myfirstcatalog ''')

 

spark.sql('''create database if not exists myfirstcatalog.mytestDB''')

 

#read the sample data into dataframe

 

df_flight_data = spark.read.csv("/databricks-datasets/flights/departuredelays.csv", header=True)

 

df_flight_data.createOrReplaceTempView("temp_tbl")

 

 

 

%sql

 

create table if not exists myfirstcatalog.mytestDB.myFirstManagedTable

 

AS

 

Select * from temp_tbl

 

 

 

As we can see from the below screenshot, the managed table got created in the metastore location.

 

953x344vv2.png.8669cde5993d23e22e74635fccecd4d2.png

 

 

 

Step 4b: Create an external table.

 

 

 

 

Let’s consider that we have the data present in some external location like Azure Blob Storage or ADLS g2 account in hive or delta format. We would like to attach that table into our Unity Catalog metastore.

 

Important to note that Unity Catalog metastore might not have direct access to the external location. To establish that access, we need to create the storage credential and External Locations by providing the actual path. Unity Catalog will use the access connector managed identity (Access Connector for Azure Databricks) to access the external location. We need to provide the “Storage Blob Data Contributor” access to the access connector URI to the external storage location.

 

 

 

Creation of the storage credential

 

We are using the access connector URI (managed identity) to create the storage credential. We can also provide the service principal details as well.

 

643x595vv2.png.7a040ab539537276529f15b26c5e1923.png

 

 

 

 

 

Creation of the External Location

 

Before we create the external table, we need to map the physical data location with the storage credential that we created earlier.

 

 

 

688x679vv2.png.d152c84ec4b468dc3d90b7b089706f83.png

 

 

 

 

 

Let’s create an external table.

 

 

 

 

 

 

 

 

 

 

spark.sql('''create catalog if not exists myfirstcatalog ''')

spark.sql('''create database if not exists myfirstcatalog.mytestDB''')

 

#read the sample data into dataframe

 

df_flight_data = spark.read.csv("/databricks-datasets/flights/departuredelays.csv", header=True)

 

#create the delta table to the mount point that we have created earlier

 

dbutils.fs.rm("abfss://dbkdata@adldbkunityctlg.dfs.core.windows.net/mytestDB/MyFirstExternalTable", recurse=True)

df_flight_data.write.format("delta").mode("overwrite").save("abfss://dbkdata@adldbkunityctlg.dfs.core.windows.net/mytestDB/MyFirstExternalTable")

 

%sql

create table if not exists myfirstcatalog.mytestDB.MyFirstExternalTable

USING DELTA

LOCATION 'abfss://dbkdata@adldbkunityctlg.dfs.core.windows.net/mytestDB/MyFirstExternalTable'

 

 

 

 

 

 

 

 

 

 

 

largevv2px999.png.8e074b604e71eb24efb4276025d5dc6f.png

 

 

 

Sharing the unity Catalog across Azure Databricks environments.

 

 

 

 

Now we have created a new workspace called dbkworkspace2 in the same region eastus. From the Admin console we will attach the metastore with the new Azure Databricks workspace

 

mediumvv2px400.png.77f8fcfd6427b9f266e2c4cc9469c6f7.png

 

 

 

We can see the tables & metadata even before creating clusters. This is possible because metastore data is present in the ADLS g2 which can be accessible without creating cluster. However, to see the physical data we need to create a cluster.

 

 

 

665x231vv2.png.a85a23ef1d01aa39a4c1adf237dc2052.png

 

 

 

Access Control in Unity Catalog.

 

 

 

 

Here are the securable objects in Unity Catalog. The access level is inherited from the parent-level objects to the child-level objects. For example, if we provide read access to the catalog, it gets inherited till the last child in the hierarchy (i.e. Schema/ Managed Table/ View... etc.). We can manage privileges for metastore objects by using SQL commands, Unity Catalog CLI, or in Data Explorer.

 

 

 

IMPORTANT: Privileges that are given in the Unity Catalog metastore are not inherited. Privileges can be granted by the metastore admin.

 

591x237vv2.png.92080b672ad6f1e04e91dca19ee25401.png

 

 

 

Example:

 

 

 

We have a test user. We would like to give access to a specific table (myfirstcatalog.mytestdb.mytestflightdata).

 

 

 

While we can easily provide access using the Unity Catalog portal, here is example of providing access using the databricks sql.

 

Step 1: Lets create a Azure Databricks group that will contain all the users that is going to have the readonly permission to the table (myfirstcatalog.mytestdb.mytestflightdata). For that we need to navigate to the databricks account console group section. We need to then add the user to the group.

 

768x240vv2.png.00cb4cc4b64a6caf1740b26ae3d1732d.png

 

 

 

Step 2: Run the GRANT command in Azure Databricks. This should be run the by the metastore admin.

 

largevv2px999.png.1366001c7760c3f37c2f8d5c5b0f42de.png

 

 

 

 

 

If we login using the Test User we will be able to see only the mytestflightdata table.

 

 

 

largevv2px999.png.5b9cfdab7d6381d94eb1adf584edffda.png

 

 

 

You can find more details about the different privileges type here.

 

 

 

Data Lineage in Unity Catalog.

 

 

 

 

We can capture the runtime data lineage across queries in any language executed on an Azure Databricks cluster. The lineage is being captured from the table level and column level. Lineage data includes notebooks, workflows, and dashboards related to the query.

 

 

 

Lineage graphs share the same permission model as Unity Catalog that was being discussed in the previous section. The user who does not have access to a specific table that won’t be shown in the Lineage graph.

 

 

 

Example:

 

Post executing the script, we can access the lineage from the Unity Catalog UI.

 

 

 

710x364vv2.png.87610ca219467c61197978aeb8215ca0.png

 

 

 

 

 

 

 

709x321vv2.png.44903916221ba79624b35e549ad923d6.png

 

 

 

Row Level Security and Column Level filtering using the Dynamic view.

 

 

 

 

In Unity Catalog we can use dynamic views to configure the fine-grained access control in rows and columns. Also, we can mask the data depending on the permission.

 

The new in-built function is_account_group_member() returns TRUE if the current user is a member of a specific account-level group. Recommended for use in dynamic views against Unity Catalog data.

 

 

 

Column Level Permission

 

 

 

Example 1: The normal test user should not be able to see the delay column value, however the flight admin would be able to see it.

 

 

 

In Azure Databricks Account console, we have created a group flight_admin, we have added the user in this group.

 

 

 

The user is part of flight_admin group, hence able to see the delay data.

 

684x525vv2.png.06a11e55099846cd38fd1dbb55f5ee4c.png

 

 

 

Now a non flight_admin user is not able to see the data.

 

 

 

697x344vv2.png.5af6b054ffd3db9f99181c43f28d1b27.png

 

 

 

 

 

Row Level Permission

 

 

Now we are going to extend our previous example. We have several origin specific groups, for example ABE_admin would only be able to see the data where Origin =’ABE’, and similarly ATL_admin group should only be able to see the data where origin=’ATL’. If any user is not part of any group, they won’t be able to see the data. The user from flight_admin group would be able to see all the data.

 

 

 

Flight_admin group members can see all the records.

 

 

 

759x566vv2.png.2b66efd299fcb94b4b8577a8ad71103c.png

 

 

 

 

 

ATL_admin group members can only see ATL related records.

 

 

 

635x182vv2.png.9971eecac44ae535bdac4bf18ed243fe.png

 

 

 

612x250vv2.png.672adf13707e3af329a963e712266535.png

 

 

 

That’s all for this blogpost. Hope this helps!

 

Continue reading...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...