Guest MarcoCardoso Posted August 19, 2022 Posted August 19, 2022 Synapse Link bridges an important scenario when dealing with Cosmos DB: the efficient processing of analytical workloads, without risking the integrity of transactional applications supplying the data. To put this feature to the test, we used an accelerometer dataset for activity recognition (you can check it out here: UCI Machine Learning Repository: Activity Recognition from Single Chest-Mounted Accelerometer Data Set). Our goal is to perform analytical workloads - in this case, training an ML classification model, over data stored in Cosmos DB. The problem Suppose we would like to create an application which can, in real-time, predict what an end user is doing, based on accelerometer data. This has many applications, like detecting when an elderly person requires help, or generating analytics for a fitness app. Whatever the reason behind it, our objective is to expose an endpoint that can receive acceleration data as an input, and output the user’s current activity - that is, standing, sitting, walking, etc. Solution architecture We propose the following architecture as a solution to this problem: We can break down the diagram into the following steps: Some initial labeled data is loaded into Cosmos DB. This data may look something like this: Timestamp ParticipantId AccelerationX AccelerationY AccelerationZ Activity 00:00:00.0000 1 1222 1402 2037 Walking 00:00:00.0010 1 1245 1426 1956 Walking 00:00:00.0020 1 1142 1363 1986 Walking 00:00:00.0030 1 … ... … Standing … ... … … … … We call this labeled data because it contains labels - meaning, the actual activity being performed, as reported by the user. Since we know this activity to be correct, we also refer to this data as our ground truth. As Cosmos DB has Synapse Link enabled, this data gets piped in real time to a column store - in a way very similar to a storage account. Spark notebooks in Synapse use the analytical store to perform data science workloads. This includes aggregations, plots, joins and other operations which would be either too costly or too slow on regular Cosmos DB. That’s because Cosmos DB is optimized for transactional - not analytical - workloads. Using the Azure ML SDK, the model is deployed to an Azure Machine Learning workspace. This makes it available for real-time predictions. Now, our application can send unlabeled data to the endpoint in AzureML and generate predictions - or labeled data. Here’s what an application might send to the endpoint: Timestamp AccelerationX AccelerationY AccelerationZ 00:00:00.0000 1222 1402 2037 00:00:00.0010 1245 1426 1956 00:00:00.0020 1142 1363 1986 00:00:00.0030 … ... … … … … … And here’s a possible output: Timestamp Activity 00:00:00.0000 Running 00:00:00.0010 Running 00:00:00.0020 Going up stairs 00:00:00.0030 Going up stairs … … This labeled data allows for interesting insights: does the user need assistance? What does their day look like? Optionally, we may now close the loop: by having a human review the predictions, we generate more ground truth, which in turn improves our training. Procedure The first thing we did was spin up the infrastructure. This includes: Cosmos DB Synapse Analytics workspace Key Vault - for storing secrets for Synapse Storage Account - Synapse’s default storage Azure Machine Learning You can check out the ARM templates in the IaC/ directory of our repo, linked at the end of the article. Next, we needed to load our ground truth into Cosmos DB. We used a Synapse Pipeline for this: Once the data is in Cosmos DB, we can use Synapse Spark to load it to a data frame and train our model. Check out the notebook for more instructions on how this is done. The notebook will perform exploratory data analysis, train and evaluate a model, and deploy it to Azure Machine Learning. Once it’s there, you can test your deployment with some sample data to verify that it’s working: Conclusion As a result, we could perform predictive analytics on a dataset in Cosmos DB, without having to create any additional storage or ETL processes. Everything is taken care of and optimized by Microsoft through Synapse Link. But don’t take my word for it - go check out our repository with instructions on how to run this solution in your own subscription at https://github.com/MarcoABCardoso/2b1-luti-marco! Next steps Want to know more about how all of this works? Stay tuned for more posts as we discuss the configuration and best practices for each component in this architecture! Continue reading... Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.