Jump to content

Synapse Spark - Encryption, Decryption and Data Masking


Recommended Posts

Guest ArshadAliTMMBA
Posted

smallvv2px200.png.2c20a64789020a22d925a1d601613744.png

 

Author(s): Arshad Ali is a Program Manager in Azure Synapse Customer Success Engineering (CSE) team.

 

 

 

Introduction

 

 

As a data engineer, we often get requirements to encrypt, decrypt, mask, or anonymize certain columns of data in files sitting in the data lake when preparing and transforming data with Apache Spark. The extensibility feature of Spark allows us to leverage a library which is not native to Spark. One such library is Microsoft Presidio, which provides fast identification and anonymization modules for private entities in text such as credit card numbers, names, locations, social security numbers, bitcoin wallets, US phone numbers, financial data, and more. It facilitates both fully automated and semi-automated PII (Personal Identifiable Information) de-identification and anonymization flows on multiple platforms.

 

 

 

In this blog post, I am going to demonstrate step by step how to download and use this library to meet the above requirements with Spark pool of Azure Synapse Analytics.

 

 

 

Getting it ready

 

 

Microsoft Presidio is an open-source library from Microsoft, which can be used with Spark to ensure private and sensitive data is properly managed and governed. It mainly provides two modules, the analyzer module for fast identification and the anonymization module to anonymize private entities in text such as credit card numbers, names, locations, social security numbers, bitcoin wallets, US phone numbers, financial data, and more.

 

 

 

Presidio analyzer

 

 

The Presidio analyzer is a Python based service for detecting PII entities in text. During analysis, it runs a set of different PII Recognizers, each one in charge of detecting one or more PII entities using different mechanisms. It comes with a set of predefined recognizers but can easily be extended with other types of custom recognizers. Predefined and custom recognizers leverage regex, Named Entity Recognition (NER), and other types of logic to detect PII in unstructured text.

 

largevv2px999.png.37168ef9a068f4dad327235bac678a97.png

 

 

 

You can download this library from here by clicking on “Download files” under Navigation on the left of the page: presidio-analyzer

 

 

 

Presidio anonymizer

 

 

The Presidio anonymizer is a Python based module for anonymizing detected PII text entities with desired values. Presidio anonymizer supports both anonymization and deanonymization by applying different operators. Operators are built-in text manipulation classes which can be easily extended like custom analyzer. It contains both Anonymizers and Deanonymizers:

 

  • Anonymizers are used to replace a PII entity text with some other value by applying a certain operator (e.g., replace, mask, redact, encrypt)
  • Deanonymizers are used to revert the anonymization operation. (e.g., to decrypt an encrypted text).

 

largevv2px999.png.f638e34cc981b808eea67b663e0da9f1.png

 

 

 

This library includes several built-in operators.

 

Step 1 - You can download this library from here by clicking on “Download files” under Navigation on the left of the page: presidio-anonymizer

 


Additionally, it also contains, Presidio Image Redactor module as well, which is again a Python based module and used for detecting and redacting PII text entities in images. You can learn more about it here: Handling images - Microsoft Presidio

 

 

 

Presidio uses an NLP engine which is an open-source model (the en_core_web_lg model from spaCy), however it can be customized to leverage other NLP engines as well, either public or proprietary. You can download this default NLP engine library from here:

 

English · spaCy Models Documentation

 

Release en_core_web_lg-3.4.0 · explosion/spacy-models

 

Step 2 - Once you have downloaded all three libraries you can upload them to Synapse workspace, as documented here (Manage workspace libraries for Apache Spark - Azure Synapse Analytics) and shown in the image below:

 

 

 

largevv2px999.png.5e75127a39b812cc9c2b0a38581fe4f1.png

 

Figure 1 - Upload Libraries to Synapse Workspace

 

 

 

Given that NLP engine library is slightly bigger in size, you might have to wait a couple of minutes for the upload to complete. Once successfully uploaded, you will see a “Succeeded” status message for each of these libraries, as shown below:

 

largevv2px999.png.a5a5ba158687eb4496e8edb7ab26653a.png

 

Figure 2 - Required libraries uploaded to Synapse Workspace

 

 

 

Step 3 - Next, you have to apply these libraries from the Synapse workspace to the Spark pool where you are going to use it. Here are the instructions on how to do that and the screenshot below shows how it looks:

 

Manage Spark pool level libraries for Apache Spark - Azure Synapse Analytics

 

largevv2px999.png.504a4911508a58074863a67566448f23.png

 

Figure 3 - Applying libraries to Synapse Spark pool

 

 

 

Once you hit “Apply” Synapse is going to trigger a system job to install and cache the specified libraries on the selected Spark pool. This process helps reduce overall session startup time. Once this system job is completed successfully, all new sessions will pick up the updated pool libraries.

 

 

 

Putting it all together in action

 

 

Step 1 - First thing first, we need to import the relevant classes/modules (and other relevant classes/modules from other existing libraries) which we just applied to the Spark pool.

 

 

 

 

 

from presidio_analyzer import AnalyzerEngine, PatternRecognizer, EntityRecognizer, Pattern, RecognizerResult

from presidio_anonymizer import AnonymizerEngine

from presidio_anonymizer.entities import OperatorConfig

from pyspark.sql.types import StringType

from pyspark.sql.functions import input_file_name, regexp_replace

from pyspark.sql.functions import col, pandas_udf

import pandas as pd

 

 

 

 

 

Presidio Analyzer

 

 

Step 2 - Next, you can use an analyzer module for detecting PII entities in text. Here is an example to detect phone number in the given text.

 

 

 

 

 

# Set up the engine, loads the NLP module (spaCy model by default) and other PII recognizers

analyzer = AnalyzerEngine()

 

# Call analyzer to get results

results = analyzer.analyze(text="My phone number is 212-555-5555",

entities=["PHONE_NUMBER"],

language='en')

print(results)

 

 

 

 

 

 

 

As you can see, it detected the phone number which starts at position 19 and ends at position 31 with a score of 75%:

 

largevv2px999.png.78caf3861f802dc31d19bfcf36896f1b.png

 

 

 

In addition to the phone number entity, which we used earlier, you can use any of the other built-in entities, as below, or use custom developed entities:

 

mediumvv2px400.png.80bf63cf1a4d477cc51de9521d7d8991.png

 

 

 

For example, the next code uses two entities Person and Phone Number to detect the name of the person and phone number in the given text:

 

 

 

 

 

# Set up the engine, loads the NLP module (spaCy model by default) and other PII recognizers

analyzer = AnalyzerEngine()

 

# Call analyzer to get results

results = analyzer.analyze(text="My name is David and my number is 212-555-1234",

entities=["PERSON", "PHONE_NUMBER"],

language='en')

print(results)

 

 

 

 

 

largevv2px999.png.6b4396e95103f55744c72452f1389c6e.png

 

Presidio anonymizer

 

 

Step 3 - Once you have used the analyzer to identify the text with private or sensitive data, you can use the anonymizer class for anonymizing it using different operators.

 

 

 

Anonymization Example

 

 

Here is an example of anonymizing the identified sensitive data by using the replace operator. In this example, for simplicity, I am using the recognizer result as hard-coded values however you can have this information coming directly from the analyzer during runtime.

 

 

 

 

 

# Anonymization Example

 

from presidio_anonymizer import AnonymizerEngine

from presidio_anonymizer.entities import RecognizerResult, OperatorConfig

 

# Initialize the engine with logger.

engine = AnonymizerEngine()

 

# Invoke the anonymize function with the text,

# analyzer results (potentially coming from presidio-analyzer) and

# Operators to get the anonymization output:

result = engine.anonymize(

text="My name is Bond, James Bond",

analyzer_results=[

RecognizerResult(entity_type="PERSON", start=11, end=15, score=0.8),

RecognizerResult(entity_type="PERSON", start=17, end=27, score=0.8),

],

operators={"PERSON": OperatorConfig("replace", {"new_value": "<ANONYMIZED>"})},

)

 

print(result)

 

 

 

 

 

largevv2px999.png.b55078e397c5d2025a06fc9993f0bfe5.png

 

Figure 4 - Person name anonymized by using replace operator

 

 

 

Encryption Example

 

 

Step 4 - The next example demonstrates how you can use the encrypt operator to encrypt identified sensitive data from the text. Again, in this example, for simplicity I am using recognizer result as hard-coded values however you can have this information coming directly from analyzer during runtime.

 

Also, I have the encryption key hard-coded, however in your case you will have this information coming from Azure KeyVault: Introduction to Microsoft Spark utilities - Azure Synapse Analytics

 

 

 

 

 

# Encryption Example

 

encryption_key = "WmZq4t7w!z%C&F)J" # in real world, this will come from Azure KeyVault

 

engine = AnonymizerEngine()

 

# Invoke the anonymize function with the text,

# analyzer results (potentially coming from presidio-analyzer)

# and an 'encrypt' operator to get an encrypted anonymization output:

anonymize_result = engine.anonymize(

text="My name is Bond, James Bond",

analyzer_results=[

RecognizerResult(entity_type="PERSON", start=11, end=15, score=0.8),

RecognizerResult(entity_type="PERSON", start=17, end=27, score=0.8),

],

operators={"PERSON": OperatorConfig("encrypt", {"key": encryption_key})},

)

 

anonymize_result

 

 

 

 

 

largevv2px999.png.344375e13909e7a99603d155cc83e7fd.png

 

Figure 5 - Person name anonymized by using encrypt operator

 

 

 

Decryption Example

 

 

Step 5 - Like the encrypt operator to encrypt identified private and sensitive data, you can use the decrypt operator to decrypt already encrypted private data with the same key used during encryption.

 

 

 

 

 

# Decryption Example

 

# Initialize the engine:

engine = DeanonymizeEngine()

 

# Fetch the anonymized text from the result.

anonymized_text = anonymize_result.text

 

# Fetch the anonynized entities from the result.

anonymized_entities = anonymize_result.items

 

# Invoke the deanonymize function with the text, anonymizer results

# and a 'decrypt' operator to get the original text as output.

deanonymized_result = engine.deanonymize(

text=anonymized_text,

entities=anonymized_entities,

operators={"DEFAULT": OperatorConfig("decrypt", {"key": encryption_key})},

)

 

deanonymized_result

 

 

 

 

 

largevv2px999.png.fd1b42da9c887d8f7469c177f118f0c4.png

 

Figure 6 - Person name decrypted by using decrypt operator

 

 

 

Spark Example – Using it with Dataframe and UDF

 

 

Apache Spark is a distributed data processing platform and to use these libraries in Spark you can use a user defined function to encapsulate your logic. Next, you can use that function to perform operations (anonymize, encrypt, or decrypt etc.) on the Spark dataframe, as demonstrated below, for replace.

 

 

 

 

 

analyzer = AnalyzerEngine()

anonymizer = AnonymizerEngine()

broadcasted_analyzer = sc.broadcast(analyzer)

broadcasted_anonymizer = sc.broadcast(anonymizer)

 

# define a pandas UDF function and a series function over it.

# Note that analyzer and anonymizer are broadcasted.

 

def anonymize_text(text: str) -> str:

analyzer = broadcasted_analyzer.value

anonymizer = broadcasted_anonymizer.value

analyzer_results = analyzer.analyze(text=text, language="en")

anonymized_results = anonymizer.anonymize(

text=text,

analyzer_results=analyzer_results,

operators={"DEFAULT": OperatorConfig("replace", {"new_value": "<ANONYMIZED>"})},

)

return anonymized_results.text

 

def anonymize_series(s: pd.Series) -> pd.Series:

return s.apply(anonymize_text)

 

# define a the function as pandas UDF

anonymize = pandas_udf(anonymize_series, returnType=StringType())

 

 

 

 

 

largevv2px999.png.364efbcde46e3c2d0a65c453389741f7.png

 

Figure 7 - Sample data with no encryption yet (for Email and IP Address)

 

 

 

largevv2px999.png.88b64f68d673722f0929c89aa6be753c.png

 

Figure 8 - Sample data anonymized for Email and IP Address

 

 

 

What we discussed so far barely scratched the surface. The possibilities are endless, and Presidio includes several samples for various kinds of scenarios. You can find more details here: Samples - Microsoft Presidio

 

Here is the FAQ: FAQ - Microsoft Presidio

 

 

 

Summary

 

 

As we look to have better control and governance for compliance, we are often tasked with encrypting, decrypting, masking, or anonymizing certain columns with private or sensitive information. In this blog post, I demonstrated how you can use the Microsoft Presidio library with the Spark pool of Azure Synapse Analytics to perform operations on large scales of data.

 

 

 

Our team will be publishing blogs regularly and you can find all these blogs here: Azure Synapse Analytics Blog

 

 

 

For deeper level understanding of Synapse implementation best practices, please refer to our Success By Design (SBD) site: https://aka.ms/Synapse-Success-By-Design

 

Continue reading...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...