Guest SamPanda Posted December 20, 2022 Posted December 20, 2022 Content Working with Unity Catalog in Azure Databricks What is Unity catalog? Prerequisite Setup the Unity Catalog Metastore in an Azure Databricks environment. Create the root storage account for the metastore Create the Azure Databricks access connector Create the metastore in Azure Databricks Account Console Create catalog and managed table Create an external table. [*]Sharing the unity catalog across Azure Databricks environments. Working with Unity Catalog in Azure Databricks In this blogpost, we will work with the below Unity catalog features in Azure Databricks. Setup the Unity Catalog Metastore in an Azure Databricks environment. Sharing the unity catalog across Azure Databricks environments. Table Access Control in Unity Catalog. Data Lineage in Unity Catalog. Row Level Security and Column Level filtering using the Dynamic view. What is Unity catalog? Unity catalog provides bundle of features that help the administrator to manage the access policies on the data and helps data owner to manage/view the data schema, lineage. Unity catalog is a metastore that uses storage accounts to keep the metadata. The same catalog can be shared across different workspacsecure everywhere). By sharing the metadata across different Azure Databricks workspace, you can collaborate between different teams who work in their individual workspaces and store the metadata of the tables and schema in the centralized metastore. reference: Overview of Unity Catalog What is stored inside a metastore? Metastore is the top level container of objects that basically stores the metadata of the objects and the object level permissions for the Azure Databricks environment. Once the metastore is configured, we generally access the object in the form of a three-level naming convention (catalog.schema.table). Each metastore requires a root storage location ( generally a file system in ADLS g2) to store the metadata and the actual data for the managed table. We can have one-to-many mapping from top to bottom in the above diagram. For example, in a metastore, we can have multiple catalogs. In one catalog, we can have multiple schemas/databases. In one database/schema, we can have multiple tables. External Table stores the physical data in external location (blob storage or ADLS g2). Not in the same root storage account that has been used to store the Unity Catalog metastore. We need to take some additional steps like creating storage credential and external location to register the external table in the metastore. When we apply the DROP statement, only the logical schema gets deleted, the physical data remain as-is. We might need to apply dbutils.fs.rm command to remove the physical files from the external storage location. Managed Table stores the physical data in the same location where the unity catalog stores the metastore. If we don’t specify the location path while creating the table, it gets stored in the Unity catalog metastore location. When we apply the DROP SQL statement, the physical data also gets deleted from the storage account. Prerequisite Here are some prerequisites that we need to consider using the Unity catalog in Azure Databricks. Azure Databricks Account should be in the premium plan. The first Azure Databricks Account Admin must be an Azure Active Directory Global Administrator at the time that they first login to the Azure Databricks Account Console( Databricks - Sign in ). Upon the first-time login, the AAD admin becomes the Azure Databricks Global Admin. AAD admin can assign the Azure Databricks Global admin permission to some other user who can perform the global administrator role further without AAD global admin involvement. If we have multiple Azure Databricks workspaces span across multiple regions (East US/ West US), you need to have separate metastore for each of the regions. We cannot share metastore in more than one region. Example, if we have two Azure Databricks workspaces one in EastUS, and WestUS, we need two metastores to be created for each of the region. Setup the Unity Catalog Metastore in an Azure Databricks environment. Here are the steps that we need to perform to create our first Unity catalog metastore in the EAST US region. Step 1: Create the root storage account for the metastore From the Azure Portal create a ADLS g2 account, make sure that hierarchical namespace option is enabled. Note down the URI of the storage account. abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<metastore-name> In our case the URI is abfss://unitycatalogmetadata@adldbkunityctlg.dfs.core.windows.net/FirstMetastore Step 2: Create the Azure Databricks access connector. From the Azure portal, let us create a resource Azure Databricks Access connector. Post Deployment connector resource URI would be our managed identity for Azure Databricks. The authentication from Azure Databricks to the other resources will happen using this managed identity. Note down the access connector ID. It would be in below form: /subscriptions/xxxxxxxxxxx/resourceGroups/RGMisc/providers/Microsoft.Databricks/accessConnectors/accessconnectordbkwrkspace Give the storage managed identity (access connector) as “Storage Data blob contributor” access in the root storage account that we have created in the earlier section. Step 3: Create the metastore in Azure Databricks Account Console Important to note that AAD Global admin has to login to the account console for the first time. Once AAD Global admin login, they will add any user or group as the account administrator, so that going forward the account console can be managed by the added user. The added user will become the Azure Databricks Account Administrator. Create a premium tier Azure Databricks workspace. Create a cluster where the Unity Catalog feature is enabled. Login to the account console (Databricks - Sign in ) using any AAD Global admin account. Go to the user management and add a different account as the account admin. We will now create our first metastore. We are creating the first metastore for the “eastus” region. In the ADLSG2 path, we need to provide the resource URI of the root storage account that we have created in the previous section. In Access Connector ID, provide in the resource URI of the Access connector for Azure Databricks. Now, let’s attach the metastore with the Azure Databricks workspace. Since we have created the workspace and metastore both in the eastUS region, we are able to attach the metastore with the Azure Databricks workspace. Incase if there are multiple workspaces in the same region, we can attach all of them with the single metastore. By sharing the metastore, we can share the objects across the workspaces. Now when login to the Azure Databricks workspace, we would be able to see a default catalog “main” that got created when we attached the Unity Catalog metastore with the Azure Databricks workspace. Step 4a: Create catalog and managed table. Let’s create our first catalog, and managed table. spark.sql('''create catalog if not exists myfirstcatalog ''') spark.sql('''create database if not exists myfirstcatalog.mytestDB''') #read the sample data into dataframe df_flight_data = spark.read.csv("/databricks-datasets/flights/departuredelays.csv", header=True) df_flight_data.createOrReplaceTempView("temp_tbl") %sql create table if not exists myfirstcatalog.mytestDB.myFirstManagedTable AS Select * from temp_tbl As we can see from the below screenshot, the managed table got created in the metastore location. Step 4b: Create an external table. Let’s consider that we have the data present in some external location like Azure Blob Storage or ADLS g2 account in hive or delta format. We would like to attach that table into our Unity Catalog metastore. Important to note that Unity Catalog metastore might not have direct access to the external location. To establish that access, we need to create the storage credential and External Locations by providing the actual path. Unity Catalog will use the access connector managed identity (Access Connector for Azure Databricks) to access the external location. We need to provide the “Storage Blob Data Contributor” access to the access connector URI to the external storage location. Creation of the storage credential We are using the access connector URI (managed identity) to create the storage credential. We can also provide the service principal details as well. Creation of the External Location Before we create the external table, we need to map the physical data location with the storage credential that we created earlier. Let’s create an external table. spark.sql('''create catalog if not exists myfirstcatalog ''') spark.sql('''create database if not exists myfirstcatalog.mytestDB''') #read the sample data into dataframe df_flight_data = spark.read.csv("/databricks-datasets/flights/departuredelays.csv", header=True) #create the delta table to the mount point that we have created earlier dbutils.fs.rm("abfss://dbkdata@adldbkunityctlg.dfs.core.windows.net/mytestDB/MyFirstExternalTable", recurse=True) df_flight_data.write.format("delta").mode("overwrite").save("abfss://dbkdata@adldbkunityctlg.dfs.core.windows.net/mytestDB/MyFirstExternalTable") %sql create table if not exists myfirstcatalog.mytestDB.MyFirstExternalTable USING DELTA LOCATION 'abfss://dbkdata@adldbkunityctlg.dfs.core.windows.net/mytestDB/MyFirstExternalTable' Sharing the unity Catalog across Azure Databricks environments. Now we have created a new workspace called dbkworkspace2 in the same region eastus. From the Admin console we will attach the metastore with the new Azure Databricks workspace We can see the tables & metadata even before creating clusters. This is possible because metastore data is present in the ADLS g2 which can be accessible without creating cluster. However, to see the physical data we need to create a cluster. Access Control in Unity Catalog. Here are the securable objects in Unity Catalog. The access level is inherited from the parent-level objects to the child-level objects. For example, if we provide read access to the catalog, it gets inherited till the last child in the hierarchy (i.e. Schema/ Managed Table/ View... etc.). We can manage privileges for metastore objects by using SQL commands, Unity Catalog CLI, or in Data Explorer. IMPORTANT: Privileges that are given in the Unity Catalog metastore are not inherited. Privileges can be granted by the metastore admin. Example: We have a test user. We would like to give access to a specific table (myfirstcatalog.mytestdb.mytestflightdata). While we can easily provide access using the Unity Catalog portal, here is example of providing access using the databricks sql. Step 1: Lets create a Azure Databricks group that will contain all the users that is going to have the readonly permission to the table (myfirstcatalog.mytestdb.mytestflightdata). For that we need to navigate to the databricks account console group section. We need to then add the user to the group. Step 2: Run the GRANT command in Azure Databricks. This should be run the by the metastore admin. If we login using the Test User we will be able to see only the mytestflightdata table. You can find more details about the different privileges type here. Data Lineage in Unity Catalog. We can capture the runtime data lineage across queries in any language executed on an Azure Databricks cluster. The lineage is being captured from the table level and column level. Lineage data includes notebooks, workflows, and dashboards related to the query. Lineage graphs share the same permission model as Unity Catalog that was being discussed in the previous section. The user who does not have access to a specific table that won’t be shown in the Lineage graph. Example: Post executing the script, we can access the lineage from the Unity Catalog UI. Row Level Security and Column Level filtering using the Dynamic view. In Unity Catalog we can use dynamic views to configure the fine-grained access control in rows and columns. Also, we can mask the data depending on the permission. The new in-built function is_account_group_member() returns TRUE if the current user is a member of a specific account-level group. Recommended for use in dynamic views against Unity Catalog data. Column Level Permission Example 1: The normal test user should not be able to see the delay column value, however the flight admin would be able to see it. In Azure Databricks Account console, we have created a group flight_admin, we have added the user in this group. The user is part of flight_admin group, hence able to see the delay data. Now a non flight_admin user is not able to see the data. Row Level Permission Now we are going to extend our previous example. We have several origin specific groups, for example ABE_admin would only be able to see the data where Origin =’ABE’, and similarly ATL_admin group should only be able to see the data where origin=’ATL’. If any user is not part of any group, they won’t be able to see the data. The user from flight_admin group would be able to see all the data. Flight_admin group members can see all the records. ATL_admin group members can only see ATL related records. That’s all for this blogpost. Hope this helps! Continue reading... Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.