Jump to content

Building the Lakehouse - Implementing a Data Lake Strategy with Azure Synapse


Recommended Posts

Guest ArshadAliTMMBA
Posted

206x63vv2.png.296a21c162e1e69ec4aa9ff003f5a385.png

 

Author(s): Arshad Ali and Abid Nazir Guroo are Program Managers in Azure Synapse Customer Success Engineering (CSE) team.

 

 

 

Introduction

 

 

Data Lakehouse architecture has become the de facto standard for designing and building data platforms for analytics as it bridges the gap and breaks the silos created by the traditional/modern data warehouse and the data lake. This blog post introduces you to the world of data lakehouse and it goes into details of how to implement it successfully in Azure with Azure Synapse Analytics.

 

 

 

What is a Lakehouse

 

 

Traditionally, organizations have been using a data warehouse for their analytical needs. As the business requirements evolved and data scale increased, they started adopting a modern data warehouse architecture which can process massive amounts of data in a relational format and in parallel across multiple compute nodes. At the same time, they started collecting and managing their non-relational big data that was in semi-structured or unstructured format with a data lake.

 

largevv2px999.png.51287f66f680672f69039d59f0397696.png

 

Figure 1 - Data Lake vs Modern Data Warehouse - A silo created by two different yet related data platforms

 

 

 

These two disparate yet related systems ran in silos, increasing development time, operational overhead, and overall total cost of ownership. It caused an inconvenience to end users to integrate data if they needed access to the data from both systems to meet their business requirements.

 

largevv2px999.png.801e6d8e6dc6dcf4615c973f3d8e138e.png

 

Figure 2 - Data Lakehouse - bridging the gap by combining best of both the worlds

 

 

 

Data Lakehouse platform architecture combines the best of both worlds in a single data platform, offering and combining capabilities from both these earlier data platform architectures into a single unified data platform – sometimes also called as medallion architecture. It means, the data lakehouse is the one platform to unify all your data, analytics, and Artificial Intelligence/Machine Learning (AI/ML) workloads.

 

largevv2px999.png.99974697192c3432dfb729e1c37a6c2e.png

 

Figure 3 - Data Lakehouse - Architectural View

 

 

 

This lakehouse data platform architecture provides several advantages:

 

 

 

  • Unified data platform architecture for all your data
    • Lakehouse brings the best of data lake and data warehouse in a single unified data platform
    • It’s a single source of truth for data of all types and formats: structured, semi-structured, unstructured, streaming, and transactional
    • It has integrated capabilities for storage, ETL (Extraction, Transformation, and Loading) / ELT (Extraction, Loading, and Transformation), processing in parallel for scale, and reporting
    • This single unified data platform provides unmatched Total Cost of Ownership (TCO): Single support model; built in security, identity, and compliance plumbing with Azure.

 

 

 

  • Collaborative environment for use cases and to boost productivity
    • Data engineers, data scientists, and business analysts can work together on the same unified data lakehouse
    • They can collaborate on the common data set using their favorite tools and IDEs (Integrated Development Environments)
    • When you are using Apache Spark with Spark pool in Azure Synapse, you can use the language of your choice (PySpark, Scala, SQL, .Net, R) or the language you are comfortable with when writing your Spark application. That means you don’t have to go through learning a new language. Additionally, irrespective of which language you use, all the code (when leveraging dataframe and dataset APIs) that you write in Spark supported languages goes through the same Spark Catalyst Optimizer and gives you a similar level of optimization and performance.

 

 

 

  • Proven ecosystem for data and Artificial Intelligence (AI) innovation
    • It has support for batch and streaming for data ingestion and processing
    • It has support for open source, open standards, open libraries, and frameworks on top of data lakes with no lock-in
    • It has a single place for governance, data lineage, and security

How to implement it

 

 

You can architect and implement data lakehouse on Azure with Azure Synapse Analytics which natively integrates with other services and offers features and capabilities for end-to-end analytical needs.

 


Azure Synapse Analytics is an enterprise analytics service that accelerates time to insight across data warehouses and big data systems. Azure Synapse brings together the best of SQL technologies used in enterprise data warehousing, Spark technologies used for big data, Data Explorer for log and time series analytics, Pipelines for data integration and ETL/ELT, and deep integration with other Azure services such as Power BI, CosmosDB, and AzureML.

What is Azure Synapse Analytics? - Azure Synapse Analytics

 

 

 

With the goal of incrementally and progressively improving the structure and quality of the data as it flows through each stage, data Lakehouse architecture proposes multiple stages of data, as follows:

 

  • Raw or Bronze Zone – data lands into this zone directly from the source systems in its original format. This zone is generally considered append only and immutable.
  • Enriched or Silver Zone – data lands into this zone from raw zone that has been cleaned of all erroneous data, format normalized, and enriched (with other sources) and provide “enterprise view” of all its business entities (like customers, products, sales transaction etc.). Typically, we create one folder for each date of data ingestion (no overwrite of the data).
  • Curated or Gold Zone – data lands into this zone from enriched zone that has been merged/aggregated and is ready to be served to users via reporting services or other down-stream systems. Usually, data in this zone is de-normalized and optimized for read by minimizing the joins needed to query the data.

 

Azure Synapse Analytics has Spark pools compute engine which can be leveraged to prepare and transform data from one stage to another stage of the data lakehouse. Spark pool is powered by Apache Spark which is a parallel processing framework that supports in-memory processing to boost the performance of big data analytic applications. Apache Spark in Azure Synapse Analytics is one of Microsoft's implementations of Apache Spark in the cloud. Azure Synapse makes it easy to create and configure an Apache Spark pool in Azure. Spark pools in Azure Synapse are compatible with Azure Storage and Azure Data Lake Generation 2 Storage where we store data for the data lakehouse.

 

largevv2px999.png.120e0773a0f6ab4f02b071ba6f32afea.png

 

Figure 4 - Implementing Data Lakehouse with Azure Synapse

 

 

 

Apache Parquet is a columnar data storage file format and is the default file format in Apache Spark. While you can use it as a file format for different stages of the data lakehouse, we would recommend you use Delta Lake file format which is an open format storage layer that brings ACID (atomicity, consistency, isolation, and durability) transactions to Apache Spark and big data workloads. It delivers reliability, security, and performance on data lakehouse for both streaming and batch operations. Underneath, it stores the data in parquet file format, however, it additionally maintains the transaction logs and statistics to provide features and performance improvement over standard parquet.

 

Using Delta Lake format over generic file formats gives advantages like:

 

  • Supports ACID properties especially durability and prevents data corruption
  • Faster reading queries
  • Increases data freshness
  • Supports data rollback using time-travel
  • Supports both batch and streaming use cases
  • Enables regulatory compliance using Audit History when required

 

With native Delta Lake support in Azure Synapse, you can build different zones of the data lakehouse with Delta Lake tables. In a typical data lakehouse, the raw zone contains the data in the same format as the source whereas the enriched and curated zone are implemented using Delta Lake tables. However, there are scenarios (or based on your preference) where you can use Delta Lake table for even raw zone as well.

 


Azure Synapse Analytics is compatible with Linux Foundation Delta Lake. The current version of Delta Lake included with Azure Synapse has language support for Scala, PySpark, and .NET.

What is Delta Lake - Azure Synapse Analytics

 

 

 

largevv2px999.png.53ec73d4b8b4e3dd3402b94c8ec56284.png

 

Figure 5 - Data Lakehouse architecture and different use cases it can serves

 

 

 

Lakehouse and Data Mesh Compatibility

 

 

Data mesh is a data platform architectural pattern for implementing enterprise data platforms in large, complex organizations. Data mesh is a technical pattern that also requires organizational change. Data mesh can be an effective way to implement large/complex enterprise data platforms, but it isn't the best solution for all organizations (especially the smaller ones with minimal complexity). Data mesh's goal is to let distributed teams work with and share information in a decentralized and agile manner. The benefits of a data mesh approach are achieved by implementing multi-disciplinary teams that publish and consume data products. You can learn more about Data Mesh here: Data Mesh Principles and Logical Architecture (martinfowler.com)

 

The proposed lakehouse architecture is compatible and can be used for implementation of data mesh for large/complex enterprise data platforms. What is a data mesh? - Cloud Adoption Framework | Microsoft Docs

 

 

 

Key considerations

 

  • For bronze zone, you can store the data in its original format, parquet, or delta lake.
  • For silver and gold, we would recommend using the delta lake format because of additional capabilities and performance enhancements it provides.
  • Any big data platform is designed to have a small number of large files as having a large number of small files degrades performance. Performance degradation can occur if the compute engine must deal with many metadata and file operations. It is suggested to coalesce/combine data resulting in approximately 1 GB per file.
  • Delta Lake has a feature called Optimize which coalesces smaller files into large files transparently and without having impact on other readers and writers of data.
  • How much data am I storing in the data lake – Delta Lake maintains a history of all the changes by default. That means over a period, the historical data will grow. Based on your business requirements you would like to keep historical data for a past certain duration like last one month, last one year etc. to optimize your storage costs. To clean up this older historical data from the delta lake table, delta lake has a VACCUM command. Please note, you cannot delete historical data within the last 7 days by default and that is to maintain consistency in data. This is controlled by a configuration (delta.deletedFileRetentionDuration = "interval <interval>") that controls how long ago a file must have been deleted before being a candidate for VACUUM.
  • When storing data within each zone, it is recommended to use a partitioned folder structure, wherever applicable. This technique helps to improve data manageability and query performance. Partitioned data in a folder structure enables faster search for specific data entries by partition pruning/elimination when querying the data by query engines.
  • Typically, you will append data into your target table as new data arrives, however, in some cases you might have to merge the data (like update existing data) and for that use case you can use the MERGE command with Delta Lake tables. When your target table is partitioned, please use partition filter so that Spark engine can prune partitions not necessary for update and optimize the merge performance.
  • Finally, you need to plan for who all needs access to what parts of your data in the lakehouse and what are the various transaction patterns they are going to use while accessing this data.

Summary

 

 

With the proliferation of data lakehouse data platform, it becomes important to understand what technologies and capabilities are available to build it successfully for scale, performance, and security. In this blog post, we talked about data lakehouse architecture, why it matters and how to implement it successfully with Azure Synapse Analytics and key considerations during implementation.

 

 

 

Our team will be publishing blogs regularly and you can find all these blogs here: Azure Synapse Analytics Blog

 

 

 

For deeper level understanding of Synapse implementation best practices, please refer to our Success By Design (SBD) site: https://aka.ms/Synapse-Success-By-Design

 

Continue reading...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...