Guest ArshadAliTMMBA Posted October 6, 2022 Posted October 6, 2022 Author(s): Arshad Ali is a Program Manager in Azure Synapse Customer Success Engineering (CSE) team. Introduction As a data engineer, we often get requirements to encrypt, decrypt, mask, or anonymize certain columns of data in files sitting in the data lake when preparing and transforming data with Apache Spark. The extensibility feature of Spark allows us to leverage a library which is not native to Spark. One such library is Microsoft Presidio, which provides fast identification and anonymization modules for private entities in text such as credit card numbers, names, locations, social security numbers, bitcoin wallets, US phone numbers, financial data, and more. It facilitates both fully automated and semi-automated PII (Personal Identifiable Information) de-identification and anonymization flows on multiple platforms. In this blog post, I am going to demonstrate step by step how to download and use this library to meet the above requirements with Spark pool of Azure Synapse Analytics. Getting it ready Microsoft Presidio is an open-source library from Microsoft, which can be used with Spark to ensure private and sensitive data is properly managed and governed. It mainly provides two modules, the analyzer module for fast identification and the anonymization module to anonymize private entities in text such as credit card numbers, names, locations, social security numbers, bitcoin wallets, US phone numbers, financial data, and more. Presidio analyzer The Presidio analyzer is a Python based service for detecting PII entities in text. During analysis, it runs a set of different PII Recognizers, each one in charge of detecting one or more PII entities using different mechanisms. It comes with a set of predefined recognizers but can easily be extended with other types of custom recognizers. Predefined and custom recognizers leverage regex, Named Entity Recognition (NER), and other types of logic to detect PII in unstructured text. You can download this library from here by clicking on “Download files” under Navigation on the left of the page: presidio-analyzer Presidio anonymizer The Presidio anonymizer is a Python based module for anonymizing detected PII text entities with desired values. Presidio anonymizer supports both anonymization and deanonymization by applying different operators. Operators are built-in text manipulation classes which can be easily extended like custom analyzer. It contains both Anonymizers and Deanonymizers: Anonymizers are used to replace a PII entity text with some other value by applying a certain operator (e.g., replace, mask, redact, encrypt) Deanonymizers are used to revert the anonymization operation. (e.g., to decrypt an encrypted text). This library includes several built-in operators. Step 1 - You can download this library from here by clicking on “Download files” under Navigation on the left of the page: presidio-anonymizer Additionally, it also contains, Presidio Image Redactor module as well, which is again a Python based module and used for detecting and redacting PII text entities in images. You can learn more about it here: Handling images - Microsoft Presidio Presidio uses an NLP engine which is an open-source model (the en_core_web_lg model from spaCy), however it can be customized to leverage other NLP engines as well, either public or proprietary. You can download this default NLP engine library from here: English · spaCy Models Documentation Release en_core_web_lg-3.4.0 · explosion/spacy-models Step 2 - Once you have downloaded all three libraries you can upload them to Synapse workspace, as documented here (Manage workspace libraries for Apache Spark - Azure Synapse Analytics) and shown in the image below: Figure 1 - Upload Libraries to Synapse Workspace Given that NLP engine library is slightly bigger in size, you might have to wait a couple of minutes for the upload to complete. Once successfully uploaded, you will see a “Succeeded” status message for each of these libraries, as shown below: Figure 2 - Required libraries uploaded to Synapse Workspace Step 3 - Next, you have to apply these libraries from the Synapse workspace to the Spark pool where you are going to use it. Here are the instructions on how to do that and the screenshot below shows how it looks: Manage Spark pool level libraries for Apache Spark - Azure Synapse Analytics Figure 3 - Applying libraries to Synapse Spark pool Once you hit “Apply” Synapse is going to trigger a system job to install and cache the specified libraries on the selected Spark pool. This process helps reduce overall session startup time. Once this system job is completed successfully, all new sessions will pick up the updated pool libraries. Putting it all together in action Step 1 - First thing first, we need to import the relevant classes/modules (and other relevant classes/modules from other existing libraries) which we just applied to the Spark pool. from presidio_analyzer import AnalyzerEngine, PatternRecognizer, EntityRecognizer, Pattern, RecognizerResult from presidio_anonymizer import AnonymizerEngine from presidio_anonymizer.entities import OperatorConfig from pyspark.sql.types import StringType from pyspark.sql.functions import input_file_name, regexp_replace from pyspark.sql.functions import col, pandas_udf import pandas as pd Presidio Analyzer Step 2 - Next, you can use an analyzer module for detecting PII entities in text. Here is an example to detect phone number in the given text. # Set up the engine, loads the NLP module (spaCy model by default) and other PII recognizers analyzer = AnalyzerEngine() # Call analyzer to get results results = analyzer.analyze(text="My phone number is 212-555-5555", entities=["PHONE_NUMBER"], language='en') print(results) As you can see, it detected the phone number which starts at position 19 and ends at position 31 with a score of 75%: In addition to the phone number entity, which we used earlier, you can use any of the other built-in entities, as below, or use custom developed entities: For example, the next code uses two entities Person and Phone Number to detect the name of the person and phone number in the given text: # Set up the engine, loads the NLP module (spaCy model by default) and other PII recognizers analyzer = AnalyzerEngine() # Call analyzer to get results results = analyzer.analyze(text="My name is David and my number is 212-555-1234", entities=["PERSON", "PHONE_NUMBER"], language='en') print(results) Presidio anonymizer Step 3 - Once you have used the analyzer to identify the text with private or sensitive data, you can use the anonymizer class for anonymizing it using different operators. Anonymization Example Here is an example of anonymizing the identified sensitive data by using the replace operator. In this example, for simplicity, I am using the recognizer result as hard-coded values however you can have this information coming directly from the analyzer during runtime. # Anonymization Example from presidio_anonymizer import AnonymizerEngine from presidio_anonymizer.entities import RecognizerResult, OperatorConfig # Initialize the engine with logger. engine = AnonymizerEngine() # Invoke the anonymize function with the text, # analyzer results (potentially coming from presidio-analyzer) and # Operators to get the anonymization output: result = engine.anonymize( text="My name is Bond, James Bond", analyzer_results=[ RecognizerResult(entity_type="PERSON", start=11, end=15, score=0.8), RecognizerResult(entity_type="PERSON", start=17, end=27, score=0.8), ], operators={"PERSON": OperatorConfig("replace", {"new_value": "<ANONYMIZED>"})}, ) print(result) Figure 4 - Person name anonymized by using replace operator Encryption Example Step 4 - The next example demonstrates how you can use the encrypt operator to encrypt identified sensitive data from the text. Again, in this example, for simplicity I am using recognizer result as hard-coded values however you can have this information coming directly from analyzer during runtime. Also, I have the encryption key hard-coded, however in your case you will have this information coming from Azure KeyVault: Introduction to Microsoft Spark utilities - Azure Synapse Analytics # Encryption Example encryption_key = "WmZq4t7w!z%C&F)J" # in real world, this will come from Azure KeyVault engine = AnonymizerEngine() # Invoke the anonymize function with the text, # analyzer results (potentially coming from presidio-analyzer) # and an 'encrypt' operator to get an encrypted anonymization output: anonymize_result = engine.anonymize( text="My name is Bond, James Bond", analyzer_results=[ RecognizerResult(entity_type="PERSON", start=11, end=15, score=0.8), RecognizerResult(entity_type="PERSON", start=17, end=27, score=0.8), ], operators={"PERSON": OperatorConfig("encrypt", {"key": encryption_key})}, ) anonymize_result Figure 5 - Person name anonymized by using encrypt operator Decryption Example Step 5 - Like the encrypt operator to encrypt identified private and sensitive data, you can use the decrypt operator to decrypt already encrypted private data with the same key used during encryption. # Decryption Example # Initialize the engine: engine = DeanonymizeEngine() # Fetch the anonymized text from the result. anonymized_text = anonymize_result.text # Fetch the anonynized entities from the result. anonymized_entities = anonymize_result.items # Invoke the deanonymize function with the text, anonymizer results # and a 'decrypt' operator to get the original text as output. deanonymized_result = engine.deanonymize( text=anonymized_text, entities=anonymized_entities, operators={"DEFAULT": OperatorConfig("decrypt", {"key": encryption_key})}, ) deanonymized_result Figure 6 - Person name decrypted by using decrypt operator Spark Example – Using it with Dataframe and UDF Apache Spark is a distributed data processing platform and to use these libraries in Spark you can use a user defined function to encapsulate your logic. Next, you can use that function to perform operations (anonymize, encrypt, or decrypt etc.) on the Spark dataframe, as demonstrated below, for replace. analyzer = AnalyzerEngine() anonymizer = AnonymizerEngine() broadcasted_analyzer = sc.broadcast(analyzer) broadcasted_anonymizer = sc.broadcast(anonymizer) # define a pandas UDF function and a series function over it. # Note that analyzer and anonymizer are broadcasted. def anonymize_text(text: str) -> str: analyzer = broadcasted_analyzer.value anonymizer = broadcasted_anonymizer.value analyzer_results = analyzer.analyze(text=text, language="en") anonymized_results = anonymizer.anonymize( text=text, analyzer_results=analyzer_results, operators={"DEFAULT": OperatorConfig("replace", {"new_value": "<ANONYMIZED>"})}, ) return anonymized_results.text def anonymize_series(s: pd.Series) -> pd.Series: return s.apply(anonymize_text) # define a the function as pandas UDF anonymize = pandas_udf(anonymize_series, returnType=StringType()) Figure 7 - Sample data with no encryption yet (for Email and IP Address) Figure 8 - Sample data anonymized for Email and IP Address What we discussed so far barely scratched the surface. The possibilities are endless, and Presidio includes several samples for various kinds of scenarios. You can find more details here: Samples - Microsoft Presidio Here is the FAQ: FAQ - Microsoft Presidio Summary As we look to have better control and governance for compliance, we are often tasked with encrypting, decrypting, masking, or anonymizing certain columns with private or sensitive information. In this blog post, I demonstrated how you can use the Microsoft Presidio library with the Spark pool of Azure Synapse Analytics to perform operations on large scales of data. Our team will be publishing blogs regularly and you can find all these blogs here: Azure Synapse Analytics Blog For deeper level understanding of Synapse implementation best practices, please refer to our Success By Design (SBD) site: https://aka.ms/Synapse-Success-By-Design Continue reading... Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.