Use cases of Advanced Network Observability for your Azure Kubernetes Service clusters

VamsiKalapala · Jul 20, 2024

Introduction

Advanced Network Observability is the inaugural feature of the Advanced Container Networking Services (ACNS) suite bringing the power of Hubble’s control plane to both Cilium and Non-Cilium Linux data planes. It unlocks Hubble metrics, Hubble’s command line interface (CLI) and the Hubble user interface (UI) on your AKS clusters providing deep insights into your containerized workloads. Advanced Network Observability empowers customers to precisely detect and root-cause network related issues in a Kubernetes cluster.

Prerequisites

This blog will focus on ACNS enabled on Azure Kubernetes Service cluster with Azure Managed Prometheus and Grafana enabled.

Before setting up AKS, ensure that you have an Azure account and subscription, with permissions that allow you to create resource groups and deploy AKS clusters. Follow instructions in this guide to setup an AKS cluster and run the scenarios below.

High level steps:

Create AKS Cluster

Enable Advanced Container Networking Services on this cluster

Create and attach Azure managed Prometheus and Grafana

Install Hubble CLI on your local machine following these instructions.

Deploy Hubble UI Component on this cluster following these instructions.

Concepts

Cilium: Cilium is an open source, cloud native solution for providing, securing, and observing network connectivity between workloads, fueled by the revolutionary Kernel technology eBPF.

Hubble: Hubble is a fully distributed networking and security observability platform. It is built on top of Cilium and eBPF to enable deep visibility into the communication and behavior of services as well as the networking infrastructure in a completely transparent manner.

Retina: Retina is a cloud-agnostic, open-source eBPF based Kubernetes Network Observability platform, it is the technology behind advanced network observability in non-Cilium Linux nodes.

Customer Scenario 1: Domain Name Server (DNS) intermittent failures 

Ruling out Domain Name Server (DNS) issues is the first step for any major network issue. Having powerful visibility into Domain Name Server (DNS) requests/responses at a detailed pod level enables faster incident resolution and cloud cost optimization. With Advanced Observability, customers can not only view requests and responses by type and fully qualified domain name (FQDN), but they can also see error codes returned to requests, IP addresses returned in response for a given request and much more.  

Retina uses eBPF programs to examine every DNS request and response packet in the Linux kernel and pass the packet and its metadata to the user space program. Here, the metadata is further processed to extract returned IPs in DNS response packets. All this metadata is used to produce relevant metrics that show status both at node level and pod level. 

This is an example of how Advanced metrics can help you. DNS latency, errors and timeouts are hard to troubleshoot and can cause severe application issues. But our dashboards make it easier for DevOps engineers to detect and fix DNS problems. The dashboard panel below shows a sudden rise in missing DNS responses within the cluster, the most common DNS errors, and which nodes have the most errors. 

The dashboard shows a summary of all DNS activities in the cluster – what kinds of queries lack responses, what’s the most common query and most common response. All this information can help administrators prevent possible problems with usage and security, and act to reduce them. 

 Customer Scenario 2: Network Policy Drops at Pod level 

Debugging network policies in large, intricate clusters with multiple namespaces can be a daunting task, especially when there are numerous network policies per namespace. To address this challenge, the network policy addon leverages eBPF in Linux to collect crucial information about dropped packets. By attaching kprobes at various critical locations in the Linux kernel, such as the netfilter drop function and the netfilter nat function, the network policy addon effectively determines if a packet is being dropped. 

When a dropped packet is detected, the associated eBPF programs generate an event that includes packet metadata, along with the drop reason and location. This event is then processed by a userspace program, which parses the data and converts it into Prometheus metrics. These metrics offer valuable insights into the dropped packets, aiding in the identification and resolution of network policy configuration issues. 

Let’s walk through an example and see how pod-level metrics and flows can help debug packet drops in a cluster. Below is a snapshot of a workload running in AKS cluster. The panel shows a heatmap of the pods running as part of a deployment, and the number of packets originating from those pods being dropped. This panel is very useful, because this lets administrations know there is a problem in real-time, and the pod being impacted. The panel also immediately shows the reason for the drop – “policy_denied”, indicating the drops are happening because of a networking policy applied in the cluster. 

 

 

To dig deeper, we can leverage the Hubble CLI tool to inspect flows in real time. The below snapshot shows how we can filter traffic using namespace and type. Hubble cli will show the source and destination pods of the packets being dropped, helping us narrow down the policy even further. 

Another tool user can use is the Hubble UI, which shows traffic flows occurring for a namespace. Below, we see the pods in agnhost namespace interacting with other pods in the same namespace as well as pods in different namespaces. Also, it’s receiving packets from outside the cluster. The UI also shows which packets are getting dropped, and the details include source and destination pod names, as well as pod and namespace labels. Using this information, we can dig through network policies applied in the cluster and identify the offending policy quickly. 

 

Customer Scenario 3: Imbalance of traffic for pods within a workload 

Pods fronted by a service expects an even distribution of traffic when a request reaches the service. However, that may not always be the case. Faulty settings can introduce subtle distribution bugs, and this may only manifest when the application performance degrades even when scaling up the workload.  

Retina deploys eBPF programs that attaches itself at various interfaces in the Linux kernel and observes all TCP/UDP packets flowing through the node. This allows Retina to generate rich pod level L4 metrics which can show, among other things, traffic distribution amongst all the pods under a workload (deployment for example). 

The panel below shows the heatmap of incoming and outgoing traffic of pods under a workload. As evident, one of the three pods is receiving a higher volume of traffic that the other two. Administrators can be proactive and help mitigate this issue before application performance degrades and impacts end users. 

Conclusion:

ACNS with advanced network observability enables deep insights into container networks and enhances the operability of AKS. This blog has explored its capabilities through real-world customer scenarios, demonstrating its capabilities in tackling common network challenges. We’d also love to hear how enhanced observability can help make your deployment scenarios easier in a comment below.

Resources:

For more info about ACNS please visit (Advanced Container Networking Services for the Azure Kubernetes Service (AKS) - Azure Kubernetes Service).

To set up Advanced Network Observability visit (Set up Advanced Network Observability for Azure Kubernetes Service (AKS) - Azure managed Prometheus and Grafana - Azure Kubernetes Service).

Pricing for ACNS is here (Advanced Container Networking Services - Pricing | Microsoft Azure ).

Use cases of Advanced Network Observability for your Azure Kubernetes Service clusters

VamsiKalapala

Introduction​

Prerequisites​

​

Concepts​

​

Customer Scenario 1: Domain Name Server (DNS) intermittent failures ​

Customer Scenario 2: Network Policy Drops at Pod level ​

​

Customer Scenario 3: Imbalance of traffic for pods within a workload ​

​

Conclusion:​

Resources:​