Gatekeeper: Enforcing security policy on your Kubernetes clusters

danielbates · Aug 16, 2024

Microsoft Defender for Containers secures Kubernetes clusters deployed in Azure, AWS, GCP, or on-premises using sensor data, audit logs and security events, control plane configuration information, and Azure Policy enforcement. In this blog, we'll take a look at Azure Policy for Kubernetes and explore the Gatekeeper engine that is responsible for policy enforcement on the cluster.

Each Kubernetes environment is architected differently, but Azure Policy is enforced the same way across Azure Kubernetes Service (AKS), Amazon Elastic Kubernetes Service (EKS) in AWS, Google Kubernetes Engine (GKE) in GCP, and on-premises or IaaS. Defender for Containers uses an open-source framework called Gatekeeper to deploy safeguards and enforcements at scale. We'll get into what Gatekeeper is in a moment, but first, let's orient ourselves with a simplified reference architecture for AKS.

Every Kubernetes environment has two main components, the control plane which provides the core Kubernetes services for orchestration and the nodes which house the infrastructure that runs the applications themselves. In Azure managed clusters, the control plane includes the following components:

An API server named kube-apiserver which exposes the Kubernetes API and acts as the front end for the control plane
A scheduler named kube-scheduler which assigns newly created pods to available nodes based on scheduling criteria such as resource requirements, affinity and anti-affinity, and so on
A controller manager named kube-controller-manager which responds to node health events and other tasks
A key-value store named etcd which backs all cluster data
A cloud controller manager, logically named cloud-controller-manager, that links the cluster into Azure (this is the primary difference between Kubernetes on-premises and any cloud-managed Kubernetes)

We look to the API server when we need to enforce and validate a policy. For example, let's say we want to set limits on container CPU and memory usage. This is a good idea to protect against resource exhaustion attacks, and it's a generally good practice to set resource limits on cloud compute anyways. This configuration comes from the container spec - lines 53-54 in this example YAML template:

In this case, I didn't specify any limit on CPU or memory usage for this container. Defender for Cloud will flag this as a recommendation that we can delegate, remediate, automate via a Logic App, or deny outright:

It's not hard to imagine how Defender for Cloud can identify affected containers - it's simply looking for quota values populated in the container spec. But Defender for Cloud is also giving us the option to enforce this recommendation by denying the deployment of any container with no specified resource limit. How does this work? To answer this, we need to dive into Gatekeeper.

Defender for Containers enforces Azure Policy through an add-on called Azure Policy for Kubernetes. This is deployed as an Arc-enabled Kubernetes extension in AWS, GCP, and on-premises environments and as a native AKS add-on in Azure. The add-on is powered by a Gatekeeper pod deployed into a single node in the cluster.

Gatekeeper is a widely deployed solution that allows us to decouple policy decisions from the Kubernetes API server. Our built-in and custom benchmark policies are translated into "CustomResourceDefinition" (CRD) policies that are executed by Gatekeeper's policy engine. Kubernetes includes admission controllers that can view and/or modify authenticated, authorized requests to create, modify, and delete objects in the Kubernetes environment. There are dozens of admission controllers in the Kubernetes API server, but there are two that we specifically rely on for Gatekeeper enforcement. First, the MutatingAdmissionWebhook is a controller that calls mutating webhooks - in serial, one after another - to read and modify the pending request. Second, the ValidatingAdmissionWebhook controller goes into action during the final validation phase of the operation and calls validating webhooks in parallel to inspect the request. A validating webhook can reject the request which will deny creation, modification, or deletion of the resource. Because the validating controller is invoked after all object modifications are complete, we use validating admission webhooks to guarantee that we are inspecting the final state of an object.

Gatekeeper has several components called "operations" that can be deployed into one monolithic pod or as multiple individual pods in a service-oriented architecture. The Azure Policy add-on deploys Gatekeeper's operations individually in three pods:

The audit process, which evaluates and reports policy violations on existing resources (this should always be run as a singleton pod to avoid contentions and prevent overburdening the API server)
The validating webhook, and
The mutating webhook.

You can see these pods in your cluster by filtering on the 'gatekeeper.sh/system' label:

Here we can see one gatekeeper-audit pod and two gatekeeper-controller pods. Note that the two webhook pods are not distinguished by function - we'll encounter this later on when we view logs from the mutating admission controller. Running these operations in different pods allows for horizontal scaling on the webhooks and enables operational resilience among the three components.

In our earlier example, we wanted to deny the creation of any container that doesn't have CPU and/or memory usage limits defined in its container spec. Defender for Containers will use Gatekeeper's validating admission webhook to reject any misconfigured requests at the API server. But what if we wanted to take some other action - for instance, if we were rolling out a new policy and wanted to audit compliance rather than directly move into enforcement? Or what if we want to exempt certain namespaces or labels from a policy rule? For this, we will need to explore parameters and effects.

First, let's find our policy definition in the Azure portal by navigating to Microsoft Defender for Cloud > Environment settings and opening the Security Policies in the settings for our Azure subscription. Our built-in policy definitions come from the default Microsoft Cloud Security Benchmark which contains 240 recommendations covering all Defender for Cloud workload protections. Filtering on a keyword will surface our policy definition:

Click the ellipses at the right of the definition to view the context menu. Select "Manage effect and parameters" to open a configuration panel with several options:

First, let's talk about the policy effects. Sorted by their order of evaluation from first to last, we have:

Disabled - this will prevent rule evaluation throughout this subscription.
Deny - this will block creation of a new resource that fails the policy. (Note that it will not remove existing resources that have already been deployed.)
Audit - this will generate an alert but not block resource creation. Audit is evaluated after Deny to prevent double-logging of an undesired resource.

What about the additional parameters? Our policy rule allows us to set rule parameters such as the maximum allowed memory and CPU values, as well as exclude namespaces from monitoring, select labels for monitoring, and exclude images from all container policy inspection. This configuration block is critical for managing exemptions such as containers that should be allowed to run as root or similar scenarios. Several Kubernetes namespaces are excluded by default: kube-system, gatekeeper-system, azure-arc, and others are commonly excluded from these policy definitions.

If we inspect the policy itself, we will see its execution logic. Of particular interest is the "templateInfo" section in lines 178-181:

This invokes the URI for the CustomResourceDefinition (CRD), a YAML file that describes the schema of the constraint and specifies the actual constraint logic in the Rego declarative language. In our example, the CRD is located at

https://store.policy.core.windows.net/kubernetes/container-resource-limits/v3/template.yaml - check it out to see what the Gatekeeper engine is actually executing in the cluster control plane!

You might have noticed that our Azure policy effects of "audit" and "deny" map directly to the validating admission webhook, which can check resource create/modify/delete requests against our policy configuration. What about the other Gatekeeper component, the mutating admission webhook? Instead of simply rejecting creation of a container that is missing a resource usage quota, we could dynamically edit the API request to set our own limit and allow the container to spawn. Let's check out another built-in Azure policy definition to see this one in action.

First, let's take a look at the policy reference list from the AKS documentation. Search or scroll down to find a policy named "[Preview]: Sets Kubernetes cluster containers CPU limits to default values in case not present." The documentation includes links to the Azure portal (login required) and the JSON source code for the definition in the Azure-Policy GitHub, currently at version 1.2.0-preview as of the date of this blog post. Let's click into the Azure portal where we can view the policy definition and assign it to our Kubernetes cluster. Notice our available effects - instead of "Audit" and "Deny", we now have "Mutate":

The linked CRD (line 64) is a short one, assigning a limit of "500m" if not present:

(Direct link: https://store.policy.core.windows.net/kubernetes/mutate-resource-cpu-limits/v1/mutation.yaml)

We can assign the policy to the tenant, subscription, or resource group(s) in our environment, set exclusions, and optionally configure resource selectors and overrides to customize the rollout of this policy. Once deployed, we will need to wait for up to 15 minutes for the Azure Policy add-on to pull changes to policy assignments. Once the new assignment is updated, the add-on will add the appropriate constraint template and constraints to the policy engine. On the same fifteen-minute timer, the add-on will execute a full scan of the cluster using the Audit operation.

Let's connect to our Kubernetes cluster and run some commands to validate our new mutate-effect policy. First, we'll need to set up kubeconfig by setting subscription context and saving credentials for our cluster. Follow the instructions in the documentation and check by running 'kubectl cluster-info' to validate that the shell is connected correctly:

View constraint templates downloaded by the Azure Policy add-on using 'kubectl get assign':

Now let's spawn a container that will violate this policy to view the mutation in action. You can use any YAML template or the single-image application wizard in the Azure console. If you use the wizard, be sure to zero out the default limits in Application Details.

Since we're using a mutation effect, the mutating admission webhook in Gatekeeper should insert default values for CPU and memory when it's called by the admission controller before passing the object creation request back to the API server. The container should deploy without any interference from a Deny effect policy because the request was modified prior to the validating admission webhook being called. Sure enough, our deployment is successful!

Now let's check the logs for the gatekeeper pod to view audit and mutation events. Note that the two gatekeeper-controller webhook pods are not differentiated in the console - check both pod names to find the one that is executing mutate actions in your cluster.

We can see the mutate event at the end of the log:

Copied in text form, it reads as follows.

{"level":"info","ts":1723829551.9305975,"logger":"mutation","msg":"Mutation applied","process":"mutation","Mutation Id":"a4155642-5417-48c9-a15a-e31040807e66","event_type":"mutation_applied","resource_group":"","resource_kind":"Pod","resource_api_version":"v1","resource_namespace":"default-1723829546418","resource_name":"web-dvwa-nolimit-8c9f967d4-","resource_source_type":"Original","resource_labels":{"app":"web-dvwa-nolimit","pod-template-hash":"8c9f967d4"},"iteration_0":"Assign//azurepolicy-k8sazurev1resourcelimitscpu-f81c1c050a0fb6b965bc:1"}

We can validate that our new container has a limit applied by inspecting the pod YAML:

There it is - the mutation applied a CPU limit before passing the request back to the API server, and the resource was created successfully!

For more reading on Gatekeeper and Azure Policy for Kubernetes, check out these resources:

Gatekeeper: Enforcing security policy on your Kubernetes clusters

danielbates