A
anniefang
We’re thrilled to announce that Azure Load Balancer now supports health event logs! These new logs are published to the Azure Monitor resource log category LoadBalancerHealthEvent and are intended to help you monitor and troubleshoot your load balancer resources.
As part of this public preview, you can now receive the following 5 health event types when the associated conditions are met. These health event types are targeted to address the top issues that could affect your load balancer’s health and availability:
Pictured above is a sample load balancer health event log in Azure portal
Not only do health events give you more insight into the health of your load balancer, you also no longer have to worry about picking a threshold for your metric-based alerts or trying to store difficult to parse metric-based data to identify historical impact to your load balancer resources.
As an example, let’s take a look at how customers used to monitor your outbound connectivity health prior to health event logs.
Contoso is leveraging a Standard Public Load Balancer with outbound rules so that their application can connect to public APIs when needed. They are following the recommended guidance and have configured the outbound rules to a dedicated public IP for outbound connections only and have ensure that the backend instances are fully utilizing the 64k available ports by selecting manual port allocation. For their load balancers, they anticipate having at-most, 8 backend instances in a pool at any given time, so they allocate 8k ports to each backend instance using an outbound rule.
However, Contoso is still concerned about the risk of SNAT port exhaustion. They also aren’t sure how much traffic they anticipate receiving, or what their traffic patterns will look like. As a result, they want to create an alert to warn them in advance if it looks like any backend instances are close to consuming all of the allocated SNAT ports.
Using the Used SNAT ports metric, they create an alert that triggers when the metric value exceeds 6k ports, indicated that 75% of the 8k allocated ports have been used. This works, until they receive this alert and decide to add another public IP, doubling the number of allocated ports per backend instance. Now, Contoso needs to update their alert to trigger when the metric value exceeds 12k ports instead.
The team at Contoso learns about Load Balancer’s new health event logs and decide to configure two alerts:
Now, even when more ports are allocated, Contoso doesn’t have to worry about readjusting their alert rules.
As part of this public preview announcement, we’re ushering in a new era of health and monitoring improvements for Azure Load Balancer. These five health event types are just the start of empowering you to identify, troubleshoot, and resolve issues related to your resources as quickly as possible.
Stay tuned as we add additional health event types to cover other types of scenarios, ranging from providing configuration guidance and best practices, to surfacing warnings when you’re approaching service-related limits.
Feel free to leave any feedback you have by leaving comments on this Azure Feedback post, we look forward to hearing from you and are excited for you to try out health event logs.
Load balancer health event logs are now rolling all Azure public regions. For more information on the current regional availability, along with more about these logs and how to start collecting and troubleshooting them, take a look at our public documentation.
Continue reading...
As part of this public preview, you can now receive the following 5 health event types when the associated conditions are met. These health event types are targeted to address the top issues that could affect your load balancer’s health and availability:
LoadBalancerHealthEventType | Scenario |
DataPathAvailabilityWarning | Detect when the Data Path Availability metric of the frontend IP is less than 90% due to platform issues |
DataPathAvailabilityCritical | Detect when the Data Path Availability metric of the frontend IP is less than 25% due to platform issues |
NoHealthyBackends | Detect when all backend instances in a pool are not responding to the configured health probes |
HighSnatPortUsage | Detect when a backend instance utilizes more than 75% of its allocated ports from a single frontend IP |
SnatPortExhaustion | Detect when a backend instance has exhausted all allocated ports and will fail further outbound connections until ports have been released or more ports are allocated |
What can I do with Azure Load Balancer health event logs?
- Create a diagnostic setting to archive or analyze these logs
- Use Log Analytics querying capabilities
- Configure an alert to trigger an action based on the generated logs
Pictured above is a sample load balancer health event log in Azure portal
Why should I use health event logs?
Not only do health events give you more insight into the health of your load balancer, you also no longer have to worry about picking a threshold for your metric-based alerts or trying to store difficult to parse metric-based data to identify historical impact to your load balancer resources.
As an example, let’s take a look at how customers used to monitor your outbound connectivity health prior to health event logs.
Previously in Azure…
Context
Contoso is leveraging a Standard Public Load Balancer with outbound rules so that their application can connect to public APIs when needed. They are following the recommended guidance and have configured the outbound rules to a dedicated public IP for outbound connections only and have ensure that the backend instances are fully utilizing the 64k available ports by selecting manual port allocation. For their load balancers, they anticipate having at-most, 8 backend instances in a pool at any given time, so they allocate 8k ports to each backend instance using an outbound rule.
Problem
However, Contoso is still concerned about the risk of SNAT port exhaustion. They also aren’t sure how much traffic they anticipate receiving, or what their traffic patterns will look like. As a result, they want to create an alert to warn them in advance if it looks like any backend instances are close to consuming all of the allocated SNAT ports.
Alerting with metrics
Using the Used SNAT ports metric, they create an alert that triggers when the metric value exceeds 6k ports, indicated that 75% of the 8k allocated ports have been used. This works, until they receive this alert and decide to add another public IP, doubling the number of allocated ports per backend instance. Now, Contoso needs to update their alert to trigger when the metric value exceeds 12k ports instead.
Now: with the HighSnatPortUsage and SnatPortExhaustion events…
The team at Contoso learns about Load Balancer’s new health event logs and decide to configure two alerts:
- Send an email and create an incident whenever the HighSnatPortUsage event is generated, to warn their network engineers that more SNAT ports may need to be allocated to their load balancer’s backend instances
- Notifies the on-call engineer whenever the SnatPortExhaustion event is generated, to immediately address any potentially critical impact to their applications
Now, even when more ports are allocated, Contoso doesn’t have to worry about readjusting their alert rules.
What’s next?
As part of this public preview announcement, we’re ushering in a new era of health and monitoring improvements for Azure Load Balancer. These five health event types are just the start of empowering you to identify, troubleshoot, and resolve issues related to your resources as quickly as possible.
Stay tuned as we add additional health event types to cover other types of scenarios, ranging from providing configuration guidance and best practices, to surfacing warnings when you’re approaching service-related limits.
Feel free to leave any feedback you have by leaving comments on this Azure Feedback post, we look forward to hearing from you and are excited for you to try out health event logs.
Get started
Load balancer health event logs are now rolling all Azure public regions. For more information on the current regional availability, along with more about these logs and how to start collecting and troubleshooting them, take a look at our public documentation.
Continue reading...