Revisiting basics of Kubernetes Networking
Kubernetes Cluster
Lets start with a short introduction for a Kubernetes (k8s) cluster. The cluster is a collection of Master and Worker nodes which can be either bare-metal servers of virtual machines.
The master nodes run the control plane components needed to run the manage and operate the cluster while the worker nodes run the user applications. Both the control and user applications run as pods which are a collection of one ore more tightly coupled containers and are separated by namespaces which provide isolation and multi-tenancy.
Network Connectivity Requirements
The basic network connectivity requirements that need to be implemeneted for any kubernetes cluster are as follows
- All pods can communicate with all other pods without NAT
- All nodes can communicate with all pods (and vice-versa) without NAT
- The IP that a pod sees itself as is the same IP that others see it as
Kubernetes clearly separates the different aspects of network connectivity from each other.
1/Node IP addressing (The nodes are deployed in a network referred to as the Node Network)
2/Pod to Pod connectivity (every pods gets an IP address. These addresses are allocated from a Pod Network)
3/Services (Pod-to-Service communications within a cluster and uses a Service Network).
4/External access to a Pod or service via Load Balancer (this is handled by using an External IP and this network is not managed by Kubernetes)
The text in the subsequent sections is based on deployments that would use a kernel based CNI (Container Network Interface) plugin.
Pod to Pod Communication
Communication inside a POD
In Kubernetes Pods are the smallest deployable units of computing. A Pod is a group of one or more containers, with shared storage and network resources. Lets start by taking a look at how containers inside the pod communicate to each other.
When creating a pod, Kubernetes first creates a pause container on the node. The pause container acquires the respective pod’s IP address and sets up the network namespace for all other containers that join the pod. All other containers in the pod called application containers only need to join the network namespace with — net=container:<id> when they are created. After that, they all run in the same network namespace. Any container to container communication inside the pods is done through the loopback network interface, localhost or via a shared filesystem.
Communicating with a pod inside the same host
Consider two pods that are deployed on the same Worker Node. Inside each of the pod/container namespaces, the eth0 interface of the pod is assigned a /32 address from the POD network that is reserved for that node and the default gateway is set to 169.254.1.1.
The eth0 interface of the pod is bound to a vEth interface in the root namespace. When POD-1 sends an ARP request to its default gateway, the node responds back. Further the kernel in the root namespace routes the traffic between pod-1 and pod-2.
Communicating with a pod across hosts — Overlay Mode
Next let us look at POD-1 communicating with POD-2 that is running on a different node. In order to support this in Overlay Mode, the nodes maintain a full mesh of tunnels (VXLAN or IPinIP) and using the combination of the Node-IP and the podCIDR allocated for each node, a forwarding table is created. Using this table the encapsulation and routing is taken care of. An optional interface depicted as “tun” below may be created by the CNI in the root namespace to handle the packet encapsulation and routing.
Communicating with a pod across hosts -Direct Routing Mode
In this design, there are no tunnels. Instead the underlay (or the physical network) is made aware of the podCIDRs via Route Advertisement, thereby achieveing a simple routed traffic flow with no encapsulation.
Communicating with a pod from the Node network
Lets talk about traffic originating from the Node Network and needs to communicate with a pod. There are two cases here:
1/ pods running in root namespace (priviledged pods) — these pods will get IP address from the node network
2/ Nodes themselves
To handle this requirement the forwarding table uses the combination of the node’s tunnel IP address and the podCIDR allocated for each node a forwarding table is created and the kernel takes care of routing the traffic in the root namespace.
Communicating from a Pod to External Destinations
In most deployments the Pod Network uses Private IP addressing. In order to facilitate communications with destinations outside the cluster, the Node can be setup to perform Source address translation (SNAT) to translate the Pod-IP to the Node-IP. This would not be required if the pods were using routable IPs and the underlay or physical network was aware of reaching that network.
Kubernetes Service
Kubernetes Services is a construct that solves the dilemma of having to keep up with every transient IP address assigned to the pod. It allows Pod-to-Pod communications inside the cluster (East-West) in the following manner.
•Define a single IP/Port combination that provides access to a pool of pods.
•Provide a mechanism to load balance distribute traffic to the backend pods.
Type = Cluster-IP
When a Kubernetes service is created, it gets an IP address from the Service Network and is referred to as Cluster-IP. Any pods that needs to consume the service sends traffic to the Cluster-IP. The backend for a Kubernetes Service is derived from pod selectors and that in turn builds the endpoints. The endpoints is a kubernetes resources that includes the pod IP addressees and ports where the application runs.
kube-proxy is a network proxy that runs on each node in the cluster, and implements/maintains network rules on nodes as part of the Kubernetes service. These network rules are maintained in IPtables and take the following action;
Destination NAT (DNAT) — Cluster-IP to Target Pod-IP on the Node where the traffic originated
Source NAT(SNAT) — Target Pod-IP to Cluster-IP on the Node where the traffic originated (this is for the return traffic).
Exposing the Kubernetes Service Externally
Service network is largely realized on the worker nodes only via IPtables or IP Virtual Server (IPVS) and is typically routable only within the cluster. However this network can be advertised / made routable externally with some CNIs (e.g. via BGP host routes , or via advertising the entire service network and worker nodes with a selected pod for a service would be advertised as the ECMP next hop.
But let's discuss some traditional methods of exposing kubernetes services for North South communication.
Type=NodePort
This is the simplest form of exposing a kubernetes service externally. NodePort exposes the service on each Node’s IP at a static port (hence the name NodePort).
There are several address translations that happen in the following order
- Destination NAT (DNAT) — Node-IP to Cluster-IP
- Source NAT (SNAT) — Client IP to Node-IP (this is the default behavior)
Next Kube-Proxy network rules kick in and the Cluster-IP is translated to Pod-IP. The Pod is picked based on round robin load balancing by default. The final step is to route the traffic to the pod in the cluster. Due to the address translations in the North to South Direction, the return traffic will follow a symmetric path.
In the above steps, the Client-IP is lost and translated with the Node-IP. The Client-IP can be preserved but in order to do so, kube-proxy forwards requests to local endpoints (pods running on the node where the traffic arrived).
Type=Load Balancer
When a Kubernetes service is exposed via a Load balancer (LB), an External-IP is assigned to the Load Balancer (LB) which serves as the entry point to access the service from outside. The External-IP gets configured and handled by the Load balancer. Kubernetes offers this type of integration with Cloud Providers which offer Load Balancers as a Service (LBaaS).
Once the traffic hits the LB, the traffic is load balanced to one of the Nodes in the cluster. Subsequent NATting operations happen in the same order as described for the NodePort case.
With recent releases of Kubernetes there is a way to avoid using the Node-IPs as the targets for the Load Balancer. This is possible for cases where the Pods have routeable IP addresses. This also means there should be an alternative mechanism to update the Load Balancer configuration when pods come and go.
Alternatives to Kubernetes Service
One solution is to use Ingress Controller which functions like a reverse proxy and is deployed as a pod in the cluster. Ingress controller can use Service Cluster-IP or directly use Endpoints+Pods for Backend. Any pod can hit the IP of the Ingress Controller to consume applications.
The ingress controller itself can be exposed using the construct of kubernetes service and external load balancer for example.
Service Mesh
Side Car Containers
Sidecar is a utility container and its main function is to support the primary container. One of the common applications of sidecar container is to work as a proxy which can do the following:
- Security: Decrypt incoming requests & authenticate clients via mTLS
- SPIFFE: “Secure Production Identity Framework for Everyone” standardizes x509 subjects
- Layer 7 segmentation
In a service mesh, service deployments are modified to include a dedicated “sidecar” proxy whose job is to handle the complexities of service to service communications. The set of sidecar proxies in a service mesh is referred to as its “data plane”. The APIs used to control the behavior of the sidecar proxy (policies to configure the data plane) are referred to as its “control plane”.
Ingress Gateway
The question still remains about accessing the service mesh from outside. Similar to the Ingress Controller, the Ingress gateway serves as the entry point for all services running within the mesh and is typically exposed via Load Balancer. Note that Ingress Gateway is another pod and it comes with its own sidecar proxy.
Kubernetes Clusters on BareMetal
Finally lets look at the case where kubernetes cluster is deployed on bare-metal servers (these are just a bunch of servers that act as Master Nodes and Worker Nodes). The External Load balancer functionality is provided by Private Cloud and Public Cloud platforms when kubernetes nodes are deployed on that cloud platform. That begs the question, how is this problem addressed for Bare Metal Clusters. The solution is provided by MetalLB which can be configured in Layer 2 mode or BGP Mode.
In the Layer 2 mode, a Virtual IP address (Metal-VIP) is defined and deployed in Active Standby fashion. The Metal-VIP could be from the node network. The Metal-VIP is translated to the Node-IP and from there on the same process follows as described earlier.
In BGP Mode, multiple Virtual IP addresses (Metal-VIPs) are assigned and advertised via BGP to attract traffic in an active active fashion.