Z
zsoenen
Earlier posts in this series:
Part 1: Is my Chatbot Ready for Production? – A 10,000 foot overview to LLMOps
Part 2: How do I Evaluate my LLM Chatbot? - A guide to different LLM chatbot evaluation techniques and how to implement them
The LLMOps journey does not end when an application is deployed to Production. Continuous monitoring is needed to ensure security and performance and is crucial to a successful LLM powered app. For this blog, think of monitoring as visibility into application and user behavior in near real-time while the application is in Production.
Foremost, monitoring is an important mechanism in application defense. There are many common attacks that an adversary or even an unassuming user can leverage to coax an AI into doing something outside of its guardrails. Attacks such as prompt injection and jailbreaking can lead to data leaks or other unplanned chatbot behaviors. Screening for these events (and taking action) in near-real time is the most effective way to prevent these behaviors.
Another reason to implement a robust monitoring system is to improve customer experience. Among other things, a monitoring system can be used to identify bottlenecks causing latency, compare actual usage to expected usage, and log user inputs and generated outputs for future analysis (Be on the lookout for Part 4 in this series – a deeper dive on Feedback!).
Monitor for Security
Monitoring for security for your LLM powered chatbot can be as simple as adding a step or two to your orchestration. Whether using LangChain, Semantic Kernel, Promptflow, or something else entirely, executing an intelligent and fast method before revealing responses to users could make all the difference.
For example, Azure Content Safety is a free, lightweight, and constantly updated tool to use in these situations. Specifically, PromptShield uses a custom language model to detect adversarial user inputs or documents with hidden embedded instructions. Check out this sample repository that uses Promptflow + PromptShield to identify attacks... adding only 145 ms to the workflow! On top of PromptShield, Content Safety also allows for implementation of custom Blocklists for custom text moderation and levers to for screening generative output in additional categories. (Read more about Content Safety here)
Profile stats for a Sample RAG Chatbot, with PromptShield as the final step
Another potential area to monitor is model output hallucinations. Incorporating a call to RAGAS’s faithfulness metrics or Vectara’s Hughes Hallucination Detection model is a low latency way to identify potential hallucinations as they happen. This unlocks the ability to warn a user of a potential inaccuracy or hide the output from the user altogether, then log the pattern and take corrective action to improve the model.
Monitor for Performance
Techniques for monitoring performance vary slightly depending on the setup of the LLM powering the chatbot. However, the core goal remains the same – ensuring high end user satisfaction by delivering a low latency solution and collecting production input and output data to drive future improvement.
If using a Model-as-a-Service type architecture, (such as Azure OpenAI) using an API Manager to direct traffic is important to avoid dreaded 429 errors. At scale, a Circuit Breaker architecture is the best way to ensure good user experience. For data collection, logging inputs and outputs to a NoSQL datastore such as CosmosDB, or to a storage bucket is a good way to save results for analysis.
Circuit Breaker Architecture in the Context of APIM + AOAI
If using a Model-as-a-Platform type architecture (i.e. Running the model on VMs in your own environment) Azure Machine Learning Managed Online Endpoints can take care of data collection out of the box. Model Data Collection will monitor inputs, outputs, and other metadata for each request, then write that data to a bucket for cheap storage until the data is used later.
In both cases, plugging the solutions into Azure Monitor + Metrics is the most effective way to track standard metrics such as Requests per minute, CPU/GPU utilization, CPU/GPU Memory utilization, errors, disk utilization, and others.
Azure Monitor Sample Reference Architecture
In summary, monitoring is an important part of the LLMOPs process, but it does not need to be overbearing. The right amount of monitoring will strike a perfect balance between securing the application and enabling a positive user experience.
Continue reading...
Part 1: Is my Chatbot Ready for Production? – A 10,000 foot overview to LLMOps
Part 2: How do I Evaluate my LLM Chatbot? - A guide to different LLM chatbot evaluation techniques and how to implement them
The LLMOps journey does not end when an application is deployed to Production. Continuous monitoring is needed to ensure security and performance and is crucial to a successful LLM powered app. For this blog, think of monitoring as visibility into application and user behavior in near real-time while the application is in Production.
Foremost, monitoring is an important mechanism in application defense. There are many common attacks that an adversary or even an unassuming user can leverage to coax an AI into doing something outside of its guardrails. Attacks such as prompt injection and jailbreaking can lead to data leaks or other unplanned chatbot behaviors. Screening for these events (and taking action) in near-real time is the most effective way to prevent these behaviors.
Another reason to implement a robust monitoring system is to improve customer experience. Among other things, a monitoring system can be used to identify bottlenecks causing latency, compare actual usage to expected usage, and log user inputs and generated outputs for future analysis (Be on the lookout for Part 4 in this series – a deeper dive on Feedback!).
Monitor for Security
Monitoring for security for your LLM powered chatbot can be as simple as adding a step or two to your orchestration. Whether using LangChain, Semantic Kernel, Promptflow, or something else entirely, executing an intelligent and fast method before revealing responses to users could make all the difference.
For example, Azure Content Safety is a free, lightweight, and constantly updated tool to use in these situations. Specifically, PromptShield uses a custom language model to detect adversarial user inputs or documents with hidden embedded instructions. Check out this sample repository that uses Promptflow + PromptShield to identify attacks... adding only 145 ms to the workflow! On top of PromptShield, Content Safety also allows for implementation of custom Blocklists for custom text moderation and levers to for screening generative output in additional categories. (Read more about Content Safety here)
Profile stats for a Sample RAG Chatbot, with PromptShield as the final step
Another potential area to monitor is model output hallucinations. Incorporating a call to RAGAS’s faithfulness metrics or Vectara’s Hughes Hallucination Detection model is a low latency way to identify potential hallucinations as they happen. This unlocks the ability to warn a user of a potential inaccuracy or hide the output from the user altogether, then log the pattern and take corrective action to improve the model.
Monitor for Performance
Techniques for monitoring performance vary slightly depending on the setup of the LLM powering the chatbot. However, the core goal remains the same – ensuring high end user satisfaction by delivering a low latency solution and collecting production input and output data to drive future improvement.
If using a Model-as-a-Service type architecture, (such as Azure OpenAI) using an API Manager to direct traffic is important to avoid dreaded 429 errors. At scale, a Circuit Breaker architecture is the best way to ensure good user experience. For data collection, logging inputs and outputs to a NoSQL datastore such as CosmosDB, or to a storage bucket is a good way to save results for analysis.
Circuit Breaker Architecture in the Context of APIM + AOAI
If using a Model-as-a-Platform type architecture (i.e. Running the model on VMs in your own environment) Azure Machine Learning Managed Online Endpoints can take care of data collection out of the box. Model Data Collection will monitor inputs, outputs, and other metadata for each request, then write that data to a bucket for cheap storage until the data is used later.
In both cases, plugging the solutions into Azure Monitor + Metrics is the most effective way to track standard metrics such as Requests per minute, CPU/GPU utilization, CPU/GPU Memory utilization, errors, disk utilization, and others.
Azure Monitor Sample Reference Architecture
In summary, monitoring is an important part of the LLMOPs process, but it does not need to be overbearing. The right amount of monitoring will strike a perfect balance between securing the application and enabling a positive user experience.
Continue reading...