Kubernetes Knowledge Overview - Core Components

pod#

In Kubernetes, the primary attributes of almost all resources are the same, mainly consisting of five parts:

apiVersion <string> The version, defined internally by Kubernetes, must be queryable using kubectl api-versions.
kind <string> The type, defined internally by Kubernetes, must be queryable using kubectl api-resources.
metadata <object> Metadata, mainly for resource identification and description, commonly includes name, namespace, labels, etc.
spec <object> Description, which is the most important part of the configuration, contains detailed descriptions of various resource configurations.
status <object> Status information, the contents of which do not need to be defined and are automatically generated by Kubernetes.

Pod Lifecycle#

The time range from the creation to the termination of a Pod object is called the Pod's lifecycle. The main processes of its lifecycle are as follows:

Pod creation
Running initialization containers
Running main containers
(1) Start hooks, termination hooks
(2) Liveness probes, readiness probes
Pod termination

Throughout its lifecycle, a Pod can be in one of five states, as follows:

Pending: The apiserver has created the Pod resource object, but it has not yet been scheduled or is still in the process of downloading the image.
Running: The Pod has been scheduled to a node, and all containers have been created by kubelet.
Succeeded: All containers in the Pod have successfully terminated and will not be restarted.
Failed: All containers have terminated, but at least one container has terminated with a failure, meaning it returned a non-zero exit status.
Unknown: The apiserver cannot retrieve the status information of the Pod normally, usually due to network communication failure.

Pod Creation Process#

The Pod configuration is transmitted to the apiserver via kubectl, which converts the Pod information and stores it in etcd, then performs a "handshake" feedback. The scheduler listens for changes in Pod information in the apiserver, uses algorithms to allocate hosts for the Pod, and updates the information in the apiserver. The corresponding node host listens for the updated information, creates containers, and updates the information to the apiserver, which stores the final information in etcd, thus completing the Pod creation.

The user submits the Pod information to be created to the apiserver via kubectl or other API clients.

The apiserver begins generating the Pod object information and stores it in etcd, then returns confirmation information to the client.

The apiserver begins reflecting changes to the Pod object in etcd, and other components use the watch mechanism to track changes on the apiserver.

The scheduler discovers that a new Pod object needs to be created, starts allocating hosts for the Pod, and updates the result information to the apiserver.

The kubelet on the node detects that a Pod has been scheduled, attempts to call Docker to start the container, and sends the result back to the apiserver.

The apiserver stores the received Pod status information in etcd.

Pod Termination Process#

The user sends a command to delete the Pod, the apiserver accepts and updates the information, and the Pod status changes to terminating. The kubelet listens for this and initiates the Pod shutdown command. The endpoint controller listens for the Pod shutdown command, deletes the corresponding service resource, and the Pod stops running. The kubelet requests the apiserver to set the Pod resource's grace period to 0 to complete the deletion operation, and the apiserver stores the final information in etcd, thus completing the Pod deletion.

The user sends a command to the apiserver to delete the Pod object.

The Pod object information in the apiserver will be updated over time; within the grace period (default 30s), the Pod is considered dead.

The Pod is marked as terminating.

The kubelet starts the Pod shutdown process as soon as it detects that the Pod object has changed to the terminating state.

The endpoint controller removes the Pod object from the endpoint list of all matching service resources when it detects the Pod object's shutdown behavior.

If the current Pod object defines a preStop hook handler, it will be executed synchronously as soon as it is marked as terminating.

The container processes in the Pod object receive a stop signal.

After the grace period ends, if there are still running processes in the Pod, the Pod object will receive an immediate termination signal.

The kubelet requests the apiserver to set the grace period of this Pod resource to 0 to complete the deletion operation; at this point, the Pod is no longer visible to the user.

Initialization Containers#

Initialization containers are containers that run before the main container of the Pod starts, mainly to perform some preparatory work for the main container. They have two main characteristics:

Initialization containers must run to completion; if any initialization container fails, Kubernetes needs to restart it until it succeeds.
Initialization containers must execute in the defined order; only after the previous one succeeds can the next one run.

Initialization containers have many application scenarios; here are the most common ones:

Providing tools or custom code that are not available in the main container image.
Initialization containers must start and run to completion before application containers, so they can be used to delay the startup of application containers until their dependent conditions are met.

Next, let's create a case to simulate the following requirement:
Assuming we want to run Nginx in the main container, but we need to be able to connect to the servers where MySQL and Redis are located before running Nginx.
To simplify testing, we predefine the IP addresses of MySQL and Redis as 192.168.18.103 and 192.168.18.104, respectively (note that these two IPs cannot be pinged, as these IPs do not exist in the environment).
Create a file named pod-initcontainer.yaml with the following content:

apiVersion: v1
kind: Pod
metadata:
  name: pod-initcontainer
  namespace: dev
  labels:
    user: xudaxian
spec:
  containers: # Container configuration
    - name: nginx
      image: nginx:1.17.1
      imagePullPolicy: IfNotPresent
      ports:
        - name: nginx-port
          containerPort: 80
          protocol: TCP
      resources:
        limits:
          cpu: "2"
          memory: "10Gi"
        requests:
          cpu: "1"
          memory: "10Mi"
  initContainers: # Initialization container configuration
    - name: test-mysql
      image: busybox:1.30
      command: ["sh","-c","until ping 192.168.18.103 -c 1;do echo waiting for mysql ...;sleep 2;done;"]
      securityContext:
        privileged: true # Run the container in privileged mode
    - name: test-redis
      image: busybox:1.30
      command: ["sh","-c","until ping 192.168.18.104 -c 1;do echo waiting for redis ...;sleep 2;done;"]

After executing the command, if test-mysql fails to create, subsequent containers cannot be created either. After modifying the IP to an accessible one and re-executing the command, they will be created successfully in order.

Hook Functions#

Kubernetes provides two hook functions after the main container starts and before it stops:

postStart: Executes after the container is created; if it fails, the container will be restarted.
preStop: Executes before the container terminates; after it completes, the container will successfully terminate. The deletion operation of the container will be blocked until it completes.

Hook handlers support defining actions using the following three methods:

exec command: Execute a command once inside the container.

  .......
    lifecycle:
       postStart: 
          exec:
             command:
               - cat
               - /tmp/healthy
  .......

tcpSocket: Attempts to access a specified socket in the current container.

  .......
     lifecycle:
        postStart:
           tcpSocket:
              port: 8080
  .......

httpGet: Initiates an HTTP request to a certain URL in the current container.

  ....... 
     lifecycle:
        postStart:
           httpGet:
              path: / # URI address
              port: 80 # Port number
              host: 192.168.109.100 # Host address  
              scheme: HTTP # Supported protocols, http or https
  .......

Container Probes#

Container probes are used to check whether the application instances in the container are working properly, serving as a traditional mechanism to ensure business availability. If the probe indicates that the instance's state does not meet expectations, Kubernetes will "remove" the problematic instance, preventing it from handling business traffic. Kubernetes provides two types of probes to implement container probing:

liveness probes: Used to detect whether the application instance is currently running normally; if not, k8s will restart the container.
readiness probes: Used to detect whether the application instance can accept requests; if not, k8s will not forward traffic.

livenessProbe: Liveness probe, determines whether to restart the container.
readinessProbe: Readiness probe, determines whether to forward requests to the container.

K8s introduced the startupProbe probe after version 1.16, used to determine whether the application in the container has started. If a startupProbe is configured, other probes will be disabled until the startupProbe succeeds; once successful, probing will no longer occur.

Both of the above probes currently support three probing methods:

exec command: Execute a command once inside the container; if the command's exit code is 0, the program is considered normal; otherwise, it is not.

  ……
    livenessProbe:
       exec:
          command:
            -    cat
            -    /tmp/healthy
  ……

tcpSocket: Attempts to access a port of a user container; if a connection can be established, the program is considered normal; otherwise, it is not.

  ……
     livenessProbe:
        tcpSocket:
           port: 8080
  ……

httpGet: Calls the URL of the web application in the container; if the returned status code is between 200 and 399, the program is considered normal; otherwise, it is not.

……
   livenessProbe:
      httpGet:
         path: / # URI address
         port: 80 # Port number
         host: 127.0.0.1 # Host address
         scheme: HTTP # Supported protocols, http or https
……

Restart Policy#

In container probing, once a container probe encounters a problem, Kubernetes will restart the Pod containing the container; this is actually determined by the Pod's restart policy, which has three types:

Always: Automatically restarts the container when it fails; this is the default value.
OnFailure: Restarts the container when it terminates and the exit code is not 0.
Never: Does not restart the container regardless of its state.

The restart policy applies to all containers in the Pod. The first container that needs to be restarted will be restarted immediately when needed, while subsequent restart operations will be delayed by kubelet for a period of time, with the delay duration for repeated restart operations being 10s, 20s, 40s, 80s, 160s, and 300s, with 300s being the maximum delay duration.

Pod Scheduling#

By default, which Node a Pod runs on is determined by the Scheduler component using corresponding algorithms, and this process is not manually controlled. However, in practical use, this does not meet the needs, as in many cases, we want to control which Pods reach which nodes. This requires understanding Kubernetes' scheduling rules for Pods, which provides four major types of scheduling methods.

Automatic scheduling: The Node on which it runs is entirely determined by the Scheduler through a series of algorithm calculations.
Directed scheduling: NodeName, NodeSelector.
Affinity scheduling: NodeAffinity, PodAffinity, PodAntiAffinity.
Taints (toleration) scheduling: Taints, Toleration.

Directed Scheduling#

Directed scheduling refers to using the nodeName or nodeSelector declared on the Pod to schedule the Pod to the desired Node. Note that this scheduling is mandatory, meaning that even if the target Node does not exist, it will still attempt to schedule to it, but the Pod will fail to run.

nodeName#

nodeName is used to force the Pod to be scheduled on a specified Node by name. This method actually skips the scheduling logic of the Scheduler and directly schedules the Pod to the specified node by name.
Create a file named pod-nodename.yaml with the following content:

apiVersion: v1
kind: Pod
metadata:
  name: pod-nodename
  namespace: dev
  labels:
    user: xudaxian
spec:
  containers: # Container configuration
    - name: nginx
      image: nginx:1.17.1
      imagePullPolicy: IfNotPresent
      ports:
        - name: nginx-port
          containerPort: 80
          protocol: TCP
  nodeName: k8s-node1 # Specify scheduling to the k8s-node1 node

nodeSelector#

nodeSelector is used to schedule the Pod to Node nodes that have specific labels added. It is implemented through Kubernetes' label-selector mechanism. In other words, before the Pod is created, the Scheduler will use the MatchNodeSelector scheduling strategy to perform label matching, find the target node, and then schedule the Pod to the target node. This matching rule is mandatory.

apiVersion: v1
kind: Pod
metadata:
  name: pod-nodeselector
  namespace: dev
spec:
  containers: # Container configuration
    - name: nginx
      image: nginx:1.17.1
      imagePullPolicy: IfNotPresent
      ports:
        - name: nginx-port
          containerPort: 80
          protocol: TCP
  nodeSelector:
    nodeenv: pro # Specify scheduling to the Node node with nodeenv=pro

Affinity Scheduling#

Although the two methods of directed scheduling are very convenient to use, they also have certain issues, namely that if there are no Nodes that meet the conditions, the Pod will not run, even if there are available Nodes in the cluster. This limits its use cases.
To address this issue, Kubernetes also provides an affinity scheduling (Affinity). It extends the nodeSelector and allows for prioritizing the selection of Nodes that meet the conditions for scheduling; if none are available, it can also schedule to Nodes that do not meet the conditions, making scheduling more flexible. Affinity is mainly divided into three categories:

nodeAffinity (Node affinity): Targets Nodes and solves the problem of which Nodes the Pod can be scheduled to.
podAffinity (Pod affinity): Targets Pods and solves the problem of which existing Pods can be deployed in the same topology domain as the new Pod.
podAntiAffinity (Pod anti-affinity): Targets Pods and solves the problem of which existing Pods cannot be deployed in the same topology domain as the new Pod.

Explanation of the usage scenarios for affinity and anti-affinity:

Affinity: If two applications frequently interact, it is necessary to use affinity to keep the two applications as close as possible to reduce performance loss due to network communication.
Anti-affinity: When applications are deployed in multiple replicas, it is necessary to use anti-affinity to spread the application instances across different Nodes to improve service availability.

nodeAffinity (Node Affinity)#

Check the optional configuration items for nodeAffinity:

pod.spec.affinity.nodeAffinity
  requiredDuringSchedulingIgnoredDuringExecution  Node nodes must meet all specified rules to be scheduled, equivalent to a hard limit
    nodeSelectorTerms  Node selection list
      matchFields   Node selector requirements listed by node fields  
      matchExpressions   Node selector requirements listed by node labels (recommended)
        key    Key
        values Value
        operator Relationship operator supports Exists, DoesNotExist, In, NotIn, Gt, Lt
  preferredDuringSchedulingIgnoredDuringExecution Prefer to schedule to Nodes that meet specified rules, equivalent to a soft limit (preference)     
    preference   A node selector item associated with a corresponding weight
      matchFields Node selector requirements listed by node fields
      matchExpressions Node selector requirements listed by node labels (recommended)
        key Key
        values Value
        operator Relationship operator supports In, NotIn, Exists, DoesNotExist, Gt, Lt  
    weight Preference weight, in the range of 1-100.

Explanation of the use of relationship operators:

- matchExpressions:
    - key: nodeenv # Match nodes with the key nodeenv
      operator: Exists   
    - key: nodeenv # Match nodes with the key nodeenv and value "xxx" or "yyy"
      operator: In    
      values: ["xxx","yyy"]
    - key: nodeenv # Match nodes with the key nodeenv and value greater than "xxx"
      operator: Gt   
      values: "xxx"

Demonstration of requiredDuringSchedulingIgnoredDuringExecution:
Create a file named pod-nodeaffinity-required.yaml with the following content:

apiVersion: v1
kind: Pod
metadata:
  name: pod-nodeaffinity-required
  namespace: dev
spec:
  containers: # Container configuration
    - name: nginx
      image: nginx:1.17.1
      imagePullPolicy: IfNotPresent
      ports:
        - name: nginx-port
          containerPort: 80
          protocol: TCP
  affinity: # Affinity configuration
    nodeAffinity: # Node affinity configuration
      requiredDuringSchedulingIgnoredDuringExecution: # Node nodes must meet all specified rules to be scheduled, equivalent to a hard rule, similar to directed scheduling
        nodeSelectorTerms: # Node selection list
          - matchExpressions:
              - key: nodeenv # Match nodes with the key nodeenv and value "xxx" or "yyy"
                operator: In
                values:
                  - "xxx"
                  - "yyy"

Notes on nodeAffinity:

If both nodeSelector and nodeAffinity are defined, both conditions must be met for the Pod to run on the specified Node.

If nodeAffinity specifies multiple nodeSelectorTerms, only one needs to match successfully.

If there are multiple matchExpressions in a nodeSelectorTerms, a node must meet all of them to match successfully.

If the labels of the Node where a Pod is located change during the Pod's runtime and no longer meet the Pod's nodeAffinity requirements, the system will ignore this change.

podAffinity (Pod Affinity)#

podAffinity mainly implements the function of allowing newly created Pods to be deployed in the same area as existing Pods.
Optional configuration items for PodAffinity:

pod.spec.affinity.podAffinity
  requiredDuringSchedulingIgnoredDuringExecution  Hard limit
    namespaces Specifies the namespace of the reference Pod
    topologyKey Specifies the scheduling scope
    labelSelector Label selector
      matchExpressions  Node selector requirements listed by node labels (recommended)
        key    Key
        values Value
        operator Relationship operator supports In, NotIn, Exists, DoesNotExist.
      matchLabels    Content mapped by multiple matchExpressions  
  preferredDuringSchedulingIgnoredDuringExecution Soft limit    
    podAffinityTerm  Options
      namespaces
      topologyKey
      labelSelector
         matchExpressions 
            key    Key  
            values Value  
            operator
         matchLabels 
    weight Preference weight, in the range of 1-1

topologyKey is used to specify the scope of scheduling, for example:

If specified as kubernetes.io/hostname, it distinguishes based on Node nodes.

If specified as beta.kubernetes.io/os, it distinguishes based on the operating system type of Node nodes.

Demonstration of requiredDuringSchedulingIgnoredDuringExecution.
Create a file named pod-podaffinity-required.yaml with the following content:

apiVersion: v1
kind: Pod
metadata:
  name: pod-podaffinity-required
  namespace: dev
spec:
  containers: # Container configuration
    - name: nginx
      image: nginx:1.17.1
      imagePullPolicy: IfNotPresent
      ports:
        - name: nginx-port
          containerPort: 80
          protocol: TCP
  affinity: # Affinity configuration
    podAffinity: # Pod affinity
      requiredDuringSchedulingIgnoredDuringExecution: # Hard limit
        - labelSelector:
            matchExpressions: # This Pod must be on the same Node as Pods with the label podenv=xxx or podenv=yyy; clearly, there are no such Pods
              - key: podenv
                operator: In
                values:
                  - "xxx"
                  - "yyy"
          topologyKey: kubernetes.io/hostname

podAntiAffinity (Pod Anti-Affinity)#

podAntiAffinity mainly implements the function of preventing newly created Pods from being deployed in the same area as existing Pods.
Its configuration method is the same as podAffinity.

apiVersion: v1
kind: Pod
metadata:
  name: pod-podantiaffinity-required
  namespace: dev
spec:
  containers: # Container configuration
    - name: nginx
      image: nginx:1.17.1
      imagePullPolicy: IfNotPresent
      ports:
        - name: nginx-port
          containerPort: 80
          protocol: TCP
  affinity: # Affinity configuration
    podAntiAffinity: # Pod anti-affinity
      requiredDuringSchedulingIgnoredDuringExecution: # Hard limit
        - labelSelector:
            matchExpressions:
              - key: podenv
                operator: In
                values:
                  - "pro"
          topologyKey: kubernetes.io/hostname

Taints and Tolerations#

Taints#

The previous scheduling methods are all based on the Pod's perspective, determining whether to schedule the Pod to a specified Node by adding attributes to the Pod. We can also approach this from the Node's perspective by adding taint attributes to the Node to decide whether to allow Pod scheduling.
Once a Node is tainted, it creates an exclusion relationship with Pods, thereby rejecting Pods from scheduling in, and can even evict existing Pods.
The format of a taint is: key=value, where key and value are the labels of the taint, and effect describes the effect of the taint, supporting the following three options:

PreferNoSchedule: Kubernetes will try to avoid scheduling Pods to Nodes with this taint unless there are no other nodes available.
NoSchedule: Kubernetes will not schedule Pods to Nodes with this taint, but it will not affect Pods that already exist on the current Node.
NoExecute: Kubernetes will not schedule Pods to Nodes with this taint and will also evict any existing Pods on the Node.

Tolerations#

The above describes the effect of taints; we can add taints to Nodes to reject Pods from scheduling there. However, if we want a Pod to be scheduled to a Node with taints, we need to use tolerations.

Taints are rejections, and tolerations are ignores; Nodes reject Pod scheduling through taints, and Pods ignore rejections through tolerations.

The detailed configuration of tolerations:

kubectl explain pod.spec.tolerations
......
FIELDS:
  key       # Corresponds to the key of the taint to tolerate; empty means match all keys
  value     # Corresponds to the value of the taint to tolerate
  operator  # Key-value operator, supports Equal and Exists (default)
  effect    # Corresponds to the effect of the taint; empty means match all effects
  tolerationSeconds   # Toleration time, effective when the effect is NoExecute, indicating the duration the Pod can stay on the Node

When the operator is Equal, if a Node has multiple Taints, each Taint must be tolerated for the Pod to be deployed.
When the operator is Exists, there are three ways to write:

Tolerate the specified taint, with the specified effect:

Tolerate the specified taint, regardless of the specific effect:

Tolerate all taints (use with caution):

  tolerations: # Tolerations
    - key: "tag" # The key of the taint to tolerate
      operator: Exists # Operator
      effect: NoExecute # Add toleration rules; this must match the taint rules

  tolerations: # Tolerations
    - key: "tag" # The key of the taint to tolerate
      operator: Exists # Operator

 tolerations: # Tolerations
    - operator: Exists # Operator

Pod Controllers#

In Kubernetes, Pods can be divided into two categories based on how they are created:

Standalone Pods: Pods created directly by Kubernetes; these Pods will be lost after deletion and will not be recreated.
Controller-created Pods: Pods created through Pod controllers; these Pods will be automatically recreated after deletion.

Pod controllers serve as an intermediate layer for managing Pods. Once a Pod controller is used, we only need to tell the Pod controller how many Pods of what type we want, and it will create Pods that meet the conditions and ensure that each Pod is in the expected state. If a Pod fails during runtime, the controller will restart or rebuild the Pod based on the specified policy.
Kubernetes has many types of Pod controllers, each suitable for its own scenarios. Common ones include:

ReplicationController: A relatively primitive Pod controller that has been deprecated and replaced by ReplicaSet.
ReplicaSet: Ensures that a specified number of Pods are running and supports changes in the number of Pods and image versions.
Deployment: Controls Pods through ReplicaSet and supports rolling upgrades and version rollback.
Horizontal Pod Autoscaler: Automatically adjusts the number of Pods based on cluster load, achieving load balancing.
DaemonSet: Runs a replica on each specified Node in the cluster, generally used for daemon-like tasks.
Job: The Pods it creates exit immediately after completing their tasks, used for one-time tasks.
CronJob: The Pods it creates execute periodically, used for periodic tasks.
StatefulSet: Manages stateful applications.

ReplicaSet (RS)#

The main function of ReplicaSet is to ensure that a certain number of Pods can run normally. It continuously monitors the running status of these Pods, and once a Pod fails, it will restart or rebuild it. It also supports scaling the number of Pods.

The resource manifest file for ReplicaSet:

apiVersion: apps/v1 # Version number 
kind: ReplicaSet # Type 
metadata: # Metadata 
  name: # RS name
  namespace: # Namespace 
  labels: # Labels 
    controller: rs 
spec: # Detailed description 
  replicas: 3 # Number of replicas 
  selector: # Selector, specifies which Pods this controller manages
    matchLabels: # Labels matching rules 
      app: nginx-pod 
    matchExpressions: # Expressions matching rules 
      - {key: app, operator: In, values: [nginx-pod]} 
template: # Template, when the number of replicas is insufficient, Pods will be created based on this template 
  metadata: 
    labels: 
      app: nginx-pod 
  spec: 
    containers: 
      - name: nginx 
        image: nginx:1.17.1 
        ports: 
        - containerPort: 80

Here, the new configuration items to understand are several options under spec:

replicas: Specifies the number of replicas, which is essentially the number of Pods created by the current RS, defaulting to 1.
selector: The selector establishes the relationship between the Pod controller and the Pods, using the Label Selector mechanism (defining Labels on the Pod module and defining selectors on the controller indicates which Pods the current controller can manage).
template: The template is the definition used by the current controller to create Pods.

Deployment (Deploy)#

To better solve service orchestration issues, Kubernetes introduced the Deployment controller starting from version v1.2. It is worth mentioning that the Deployment controller does not directly manage Pods but indirectly manages them through ReplicaSets, meaning that Deployment has more powerful functions than ReplicaSets.

The main functions of Deployment include:

Supports all functions of ReplicaSet.
Supports stopping and continuing deployments.
Supports rolling updates and version rollback.

The resource manifest for Deployment:

apiVersion: apps/v1 # Version number 
kind: Deployment # Type 
metadata: # Metadata 
  name: # RS name 
  namespace: # Namespace 
  labels: # Labels 
    controller: deploy 
spec: # Detailed description 
  replicas: 3 # Number of replicas 
  revisionHistoryLimit: 3 # Retain historical versions, default is 10 
  paused: false # Pause deployment, default is false 
  progressDeadlineSeconds: 600 # Deployment timeout (s), default is 600 
  strategy: # Strategy 
    type: RollingUpdate # Rolling update strategy 
    rollingUpdate: # Rolling update 
      maxSurge: 30% # Maximum additional replicas that can exist, can be a percentage or an integer 
      maxUnavailable: 30% # Maximum number of Pods that can be unavailable during the update, can be a percentage or an integer 
  selector: # Selector, specifies which Pods this controller manages 
    matchLabels: # Labels matching rules 
      app: nginx-pod 
    matchExpressions: # Expressions matching rules 
      - {key: app, operator: In, values: [nginx-pod]} 
  template: # Template, when the number of replicas is insufficient, Pods will be created based on this template 
    metadata: 
      labels: 
        app: nginx-pod 
    spec: 
      containers: 
      - name: nginx 
        image: nginx:1.17.1 
        ports: 
        - containerPort: 80

Deployment supports two image update strategies: Recreate update and Rolling update (default), which can be configured through the strategy option.

strategy: Specifies the strategy for replacing old Pods with new Pods, supporting two attributes
  type: Specifies the strategy type, supporting two strategies
    Recreate: All existing Pods will be killed before creating new Pods
    RollingUpdate: Rolling update, which kills some and starts some; during the update process, there are two versions of Pods
  rollingUpdate: Effective when type is RollingUpdate, used to set parameters for rollingUpdate, supporting two attributes:
    maxUnavailable: Specifies the maximum number of Pods that can be unavailable during the upgrade, default is 25%.
    maxSurge: Specifies the maximum number of Pods that can exceed the expected number during the upgrade, default is 25%.

Deployment supports pause and resume functions during version upgrades, as well as version rollback and many other functions. Let's look at these in detail:

# Version upgrade related functions
kubectl rollout parameters deploy xx  # Supports the following options
# status Displays the current upgrade status
# history Displays upgrade history
# pause Pauses the version upgrade process
# resume Continues the paused version upgrade process
# restart Restarts the version upgrade process
# undo Rolls back to the previous version (can use --to-revision to roll back to a specified version)

The reason why Deployment can achieve version rollback is that it records the historical ReplicaSets. Once a rollback is desired, it only needs to reduce the current version's Pod count to 0 and increase the target count of the rollback version.

Canary Release#

Deployment supports control during the update process, such as pausing the update operation (pause) or continuing the update operation (resume).
For example, after a batch of new Pod resources are created, the update process is immediately paused. At this point, only a portion of the new version of the application exists, while the majority is still the old version. Then, a small portion of user requests are routed to the new version of the Pod application, and the system continues to observe whether it runs stably as expected. If there are no issues, the remaining Pod resources will continue to be rolled out; otherwise, an immediate rollback operation will be performed.

Horizontal Pod Autoscaler (HPA)#

We can manually execute the kubectl scale command to achieve scaling of Pods, but this clearly does not align with Kubernetes' goal of automation and intelligence. Kubernetes aims to automatically adjust the number of Pods based on monitoring the usage of Pods, leading to the creation of the HPA controller.
HPA can obtain the utilization of each Pod, compare it with the metrics defined in HPA, calculate the specific value that needs to be scaled, and finally adjust the number of Pods. In fact, HPA, like the previous Deployment, is also a type of Kubernetes resource object that determines whether to adjust the target Pod's replica count based on tracking and analyzing the load changes of the target Pods.

If there is no program in the cluster to collect resource usage, you can choose to install metrics-server.

Test example:

apiVersion: autoscaling/v1 # Version number
kind: HorizontalPodAutoscaler # Type
metadata: # Metadata
  name: pc-hpa # Name of the deployment
  namespace: dev # Namespace
spec:
  minReplicas: 1 # Minimum number of Pods
  maxReplicas: 10 # Maximum number of Pods
  targetCPUUtilizationPercentage: 3 # CPU utilization metric
  scaleTargetRef:  # Specify the information of the Nginx to be controlled
    apiVersion: apps/v1
    kind: Deployment
    name: nginx

DaemonSet (DS)#

The DaemonSet type of controller ensures that a replica runs on every (or specified) node in the cluster, generally suitable for log collection, node monitoring, etc. In other words, if a Pod provides node-level functionality (needed and only needed on each node), then this type of Pod is suitable for creation using the DaemonSet controller.

Characteristics of the DaemonSet controller:

Each time a new node is added to the cluster, the specified Pod replica will also be added to that node.
When a node is removed from the cluster, the Pod will also be garbage collected.

The resource manifest for DaemonSet:

apiVersion: apps/v1 # Version number
kind: DaemonSet # Type
metadata: # Metadata
  name: # Name
  namespace: # Namespace
  labels: # Labels
    controller: daemonset
spec: # Detailed description
  revisionHistoryLimit: 3 # Retain historical versions
  updateStrategy: # Update strategy
    type: RollingUpdate # Rolling update strategy
    rollingUpdate: # Rolling update
      maxUnavailable: 1 # Maximum number of Pods that can be unavailable, can be a percentage or an integer
  selector: # Selector, specifies which Pods this controller manages
    matchLabels: # Labels matching rules
      app: nginx-pod
    matchExpressions: # Expressions matching rules
      - key: app
        operator: In
        values:
          - nginx-pod
  template: # Template, when the number of replicas is insufficient, Pods will be created based on this template
     metadata:
       labels:
         app: nginx-pod
     spec:
       containers:
         - name: nginx
           image: nginx:1.17.1
           ports:
             - containerPort: 80

Job#

Job is mainly responsible for batch processing of short-lived one-time tasks.
Characteristics of Job:

When a Pod created by Job successfully finishes execution, Job will record the number of successfully finished Pods.
When the number of successfully finished Pods reaches the specified number, Job will complete execution.

Job can ensure that the specified number of Pods complete execution.

The resource manifest for Job:

apiVersion: batch/v1 # Version number
kind: Job # Type
metadata: # Metadata
  name:  # Name
  namespace:  # Namespace
  labels: # Labels
    controller: job
spec: # Detailed description
  completions: 1 # Specifies the total number of times the Job needs to successfully run Pods, default is 1
  parallelism: 1 # Specifies the number of Pods that should run concurrently at any given time, default is 1
  activeDeadlineSeconds: 30 # Specifies the time limit for the Job to run; if it exceeds this time and has not finished, the system will attempt to terminate it
  backoffLimit: 6 # Specifies the number of retries after Job failure, default is 6
  manualSelector: true # Whether to use selector to select Pods, default is false
  selector: # Selector, specifies which Pods this controller manages
    matchLabels: # Labels matching rules
      app: counter-pod
    matchExpressions: # Expressions matching rules
      - key: app
        operator: In
        values:
          - counter-pod
  template: # Template, when the number of replicas is insufficient, Pods will be created based on this template
     metadata:
       labels:
         app: counter-pod
     spec:
       restartPolicy: Never # Restart policy can only be set to Never or OnFailure
       containers:
         - name: counter
           image: busybox:1.30
           command: ["/bin/sh","-c","for i in 9 8 7 6 5 4 3 2 1;do echo $i;sleep 20;done"]

Explanation of the restart policy in the template:

If set to OnFailure, the Job will restart the container when the Pod fails, rather than creating a new Pod, and the failed count remains unchanged.

If set to Never, the Job will create a new Pod when the Pod fails, and the failed Pod will not disappear or restart, incrementing the failed count by 1.

If set to Always, it means it will keep restarting, meaning the Pod task will be executed repeatedly, which conflicts with the definition of Job, so it cannot be set to Always.

CronJob (CJ)#

The CronJob controller manages Job controller resources and uses it to manage Pod resource objects. The Job controller's defined job tasks will execute immediately after its controller resource is created, but CronJob can control its execution time points and repeated execution in a manner similar to periodic task scheduling in Linux operating systems. In other words, CronJob can run job tasks at specific time points (repeatedly).

The resource manifest file for CronJob:

apiVersion: batch/v1beta1 # Version number
kind: CronJob # Type       
metadata: # Metadata
  name: # RS name 
  namespace: # Namespace 
  labels: # Labels
    controller: cronjob
spec: # Detailed description
  schedule: # Cron format job scheduling execution time point, used to control when the task is executed
  concurrencyPolicy: # Concurrency execution policy, used to define whether and how to run the next job when the previous job is still running
  failedJobHistoryLimit: # Number of historical records to retain for failed job executions, default is 1
  successfulJobHistoryLimit: # Number of historical records to retain for successful job executions, default is 3
  startingDeadlineSeconds: # Timeout duration for starting job errors
  jobTemplate: # Job controller template, used to generate job objects for the cronjob controller; below is actually the definition of the job
    metadata:
    spec:
      completions: 1
      parallelism: 1
      activeDeadlineSeconds: 30
      backoffLimit: 6
      manualSelector: true
      selector:
        matchLabels:
          app: counter-pod
        matchExpressions: # Rules
          - {key: app, operator: In, values: [counter-pod]}
      template:
        metadata:
          labels:
            app: counter-pod
        spec:
          restartPolicy: Never 
          containers:
          - name: counter
            image: busybox:1.30
            command: ["bin/sh","-c","for i in 9 8 7 6 5 4 3 2 1; do echo $i;sleep 20;done"]

Several options that need to be explained in detail:
schedule: Cron expression used to specify the execution time of the task
*/1 * * * *

Minute values range from 0 to 59.
Hour values range from 0 to 23.
Day values range from 1 to 31.
Month values range from 1 to 12.
Week values range from 0 to 6, where 0 represents Sunday.
Multiple times can be separated by commas; ranges can be given using hyphens; * can be used as a wildcard; / represents every...

concurrencyPolicy:
Allow: Allows Jobs to run concurrently (default)
Forbid: Prohibits concurrent running; if the previous run is not completed, the next run will be skipped.
Replace: Cancels the currently running job and replaces it with a new job.

StatefulSet (Stateful)#

Stateless applications:

Consider all Pods to be the same.
There are no order requirements.
There is no need to consider which Node the Pods run on.
Scaling and expanding can be done arbitrarily.

Stateful applications:

There are order requirements.
Each Pod is considered unique.
It is necessary to consider which Node the Pods run on.
Scaling and expanding must be done in order.
Each Pod must be independent, maintaining the startup order and uniqueness of Pods.

StatefulSet is a load management controller provided by Kubernetes for managing stateful applications.
StatefulSet deployment requires a Headless Service.

Why is a Headless Service needed?

When using Deployment, each Pod name is unordered and is a random string, so the Pod name is unordered. However, in StatefulSet, it is required to be ordered, and each Pod cannot be arbitrarily replaced; after Pod reconstruction, the Pod name remains the same.

Since Pod IPs are variable, they are identified by Pod names. The Pod name is a unique identifier for the Pod and must be persistently stable and valid. This is where a Headless Service comes in, as it can give each Pod a unique name.

StatefulSet is commonly used to deploy RabbitMQ clusters, Zookeeper clusters, MySQL clusters, Eureka clusters, etc.

Demonstration example:

apiVersion: v1
kind: Service
metadata:
  name: service-headless
  namespace: dev
spec:
  selector:
    app: nginx-pod
  clusterIP: None # Set clusterIP to None to create a headless Service
  type: ClusterIP
  ports:
    - port: 80 # Service port
      targetPort: 80 # Pod port
...

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: pc-statefulset
  namespace: dev
spec:
  replicas: 3
  serviceName: service-headless
  selector:
    matchLabels:
      app: nginx-pod
  template:
    metadata:
      labels:
        app: nginx-pod
    spec:
      containers:
        - name: nginx
          image: nginx:1.17.1
          ports:
            - containerPort: 80

Service#

In Kubernetes, Pods are the carriers of applications, and we can access applications via the Pod's IP. However, the Pod's IP address is not fixed, which means it is inconvenient to directly access services using the Pod's IP.
To solve this problem, Kubernetes provides the Service resource, which aggregates multiple Pods that provide the same service and provides a unified entry address. By accessing the entry address of the Service, you can access the underlying Pod services.

In many cases, Service is just a concept; the real function is performed by the kube-proxy service process, which runs a kube-proxy service process on each Node. When a Service is created, the information of the created Service is written to etcd through the API Server, and kube-proxy discovers changes in the Service through a listening mechanism, converting the latest Service information into corresponding access rules.

Kube-proxy currently supports three working modes:

userspace mode:
- In userspace mode, kube-proxy creates a listening port for each Service. Requests sent to the Cluster IP are redirected to the port listened to by kube-proxy via iptables rules. Kube-proxy selects a Pod providing the service based on the LB algorithm (load balancing algorithm) and establishes a connection to forward the request to the Pod.
- In this mode, kube-proxy acts as a layer 4 load balancer. Since kube-proxy runs in userspace, data copying between the kernel and user space increases during forwarding processing, making it stable but very inefficient.

iptables mode:
- In iptables mode, kube-proxy creates corresponding iptables rules for each Pod behind the Service, directly redirecting requests sent to the Cluster IP to a Pod's IP.
- In this mode, kube-proxy does not act as a layer 4 load balancer; it only creates iptables rules. The advantage of this mode is that it is more efficient than userspace mode, but it cannot provide flexible LB strategies and cannot retry when backend Pods are unavailable.

ipvs mode:
The ipvs mode is similar to iptables; kube-proxy monitors Pod changes and creates corresponding ipvs rules. IPVS is more efficient in forwarding than iptables and supports more LB (load balancing) algorithms.

Service Types#

The resource manifest for Service:

apiVersion: v1 # Version
kind: Service # Type
metadata: # Metadata
  name: # Resource name
  namespace: # Namespace
spec:
  selector: # Label selector, used to determine which Pods the current Service proxies
    app: nginx
  type: NodePort # Type of Service, specifies the access method of the Service
  clusterIP: # Virtual service IP address
  sessionAffinity: # Session affinity, supports ClientIP and None options, default is None
  ports: # Port information
    - port: 8080 # Service port
      protocol: TCP # Protocol
      targetPort : # Pod port
      nodePort:  # Host port

Explanation of spec.type:

ClusterIP: The default value, which is a virtual IP automatically assigned by the Kubernetes system, can only be accessed within the cluster.

NodePort: Exposes the Service through a specified port on the Node, allowing access to the service from outside the cluster.

LoadBalancer: Uses an external load balancer to distribute load to the service; note that this mode requires support from external cloud environments.

ExternalName: Introduces external services into the cluster, accessed directly using this Service.

ClusterIP Type Service#

Endpoint (not commonly used)

Endpoint is a resource object in Kubernetes, stored in etcd, used to record the access addresses of all Pods corresponding to a service. It is generated based on the selector described in the service configuration file.
A service consists of a set of Pods, and these Pods are exposed through Endpoints. In other words, the connection between the service and Pods is implemented through Endpoints.

Load Balancing Strategy

Access to the Service is distributed to the backend Pods. Currently, Kubernetes provides two load balancing strategies:

If not defined, the default uses kube-proxy's strategy, such as random, polling, etc.
Session persistence mode based on client addresses, meaning that all requests initiated from the same client will be forwarded to a fixed Pod. This is friendly for traditional session-based authentication projects, and this mode can be added in spec with the sessionAffinity: ClientIP option.

Headless Type Service#

In some scenarios, developers may not want to use the load balancing functionality provided by Service and prefer to control the load balancing strategy themselves. For this situation, Kubernetes provides Headless Service, which does not allocate Cluster IP. If you want to access the Service, you can only query it through the Service's domain name.

NodePort Type Service#

In the previous examples, the IP address of the created Service could only be accessed within the cluster. If you want to expose the Service for external use, you need to use another type of Service called NodePort. The working principle of NodePort is to map the Service's port to a port on the Node, allowing access to the Service via NodeIP.

LoadBalancer Type Service#

ExternalName Type Service#

The ExternalName type Service is used to introduce external services into the cluster. It specifies an address of a service through the externalName attribute, allowing access to this Service from within the cluster to reach the external service.

Ingress#

We already know that the main ways to expose services outside the cluster are NodePort and LoadBalancer, but both of these methods have certain drawbacks:

The drawback of the NodePort method is that it occupies many ports on the cluster machines, which becomes increasingly evident as the number of cluster services increases.
The drawback of LoadBalancer is that each Service requires a LB, which is wasteful, cumbersome, and requires support from devices outside Kubernetes.

Based on this situation, Kubernetes provides the Ingress resource object, which can meet the need to expose multiple Services with just one NodePort or one LB. The working mechanism is roughly illustrated in the following diagram:

In fact, Ingress is equivalent to a layer 7 load balancer, an abstraction of reverse proxy in Kubernetes. Its working principle is similar to Nginx, which can be understood as Ingress establishing many mapping rules. The Ingress Controller listens for these configuration rules and converts them into Nginx reverse proxy configurations, then provides services externally.

Ingress: An object in Kubernetes that defines the rules for how requests are forwarded to Services.
Ingress Controller: A program that implements reverse proxy and load balancing, parses the rules defined by Ingress, and forwards requests according to the configured rules. There are many ways to implement this, such as Nginx, Contour, Haproxy, etc.

The working principle of Ingress (using Nginx) is as follows:

Users write Ingress rules, specifying which domain name corresponds to which Service in the Kubernetes cluster.
The Ingress controller dynamically perceives changes in Ingress service rules and generates a corresponding Nginx reverse proxy configuration.
The Ingress controller writes the generated Nginx configuration into a running Nginx service and updates it dynamically.
At this point, the actual work is done by Nginx, which is internally configured with the user-defined request rules.

Ingress supports HTTP and HTTPS proxy.

Data Storage#

As mentioned earlier, the lifecycle of containers can be very short, being frequently created and destroyed. When a container is destroyed, the data stored in it will also be cleared. This outcome is undesirable for users in certain situations. To persistently save container data, Kubernetes introduces the concept of Volume.

A Volume is a shared directory in a Pod that can be accessed by multiple containers. It is defined on the Pod and then mounted to specific file directories by multiple containers within a Pod. Kubernetes uses Volumes to achieve data sharing between different containers in the same Pod and to persistently store data. The lifecycle of a Volume is not tied to the lifecycle of individual containers in the Pod; when a container terminates or restarts, the data in the Volume will not be lost.

Kubernetes supports various types of Volumes, with the following being the most common:

Basic storage: EmptyDir, HostPath, NFS
Advanced storage: PV, PVC
Configuration storage: ConfigMap, Secret

Basic Storage#

EmptyDir#

EmptyDir is the most basic type of Volume; an EmptyDir is an empty directory on the Host.

EmptyDir is created when the Pod is assigned to a Node, its initial content is empty, and there is no need to specify a corresponding directory file on the host, as Kubernetes will automatically allocate a directory. When the Pod is destroyed, the data in EmptyDir will also be permanently deleted. The uses of EmptyDir include:

Temporary space, such as for temporary directories required by certain applications during runtime, which do not need to be permanently retained.
A directory for one container to obtain data from another container (multi-container shared directory).

Next, let's use EmptyDir through a case of file sharing between containers.

In a Pod, prepare two containers, nginx and busybox, and declare a Volume mounted to the directories of both containers. The nginx container is responsible for writing logs to the Volume, while busybox reads the log content to the console.

HostPath#

As mentioned in the previous section, data in EmptyDir will not be persisted; it will be destroyed with the Pod. If you want to simply persist data to the host, you can choose HostPath.

HostPath mounts an actual directory from the Node host into the Pod for container use. This design ensures that even if the Pod is destroyed, the data can still exist on the Node host.

NFS#

While HostPath can solve the problem of data persistence, if a Node fails and the Pod is moved to another Node, issues will arise. At this point, a separate network storage system is needed, with NFS and CIFS being commonly used.

NFS is a network file storage system; you can set up an NFS server and directly connect the storage in the Pod to the NFS system. This way, regardless of how the Pod moves between nodes, as long as the Node can connect to NFS, the data can be successfully accessed.

Advanced Storage#

Having learned to use NFS for storage, it requires users to set up the NFS system and configure it in YAML. Since Kubernetes supports many storage systems, it is unrealistic to expect users to master all of them. To abstract the underlying storage implementation details and facilitate user use, Kubernetes introduces two resource objects: PV and PVC.

PV (Persistent Volume) refers to a persistent volume, which is an abstraction of underlying shared storage. Generally, PV is created and configured by Kubernetes administrators and is related to specific shared storage technologies, interfacing with shared storage through plugins.
PVC (Persistent Volume Claim) refers to a persistent volume claim, which is a declaration of storage requirements by users. In other words, PVC is essentially a resource request made by users to the Kubernetes system.

With PV and PVC, the work can be further subdivided:

Storage: Maintained by storage engineers.
PV: Maintained by Kubernetes administrators.
PVC: Maintained by Kubernetes users.

PV#

PV is an abstraction of storage resources. Below is the resource manifest file:

apiVersion: v1  
kind: PersistentVolume
metadata:
  name: pv2
spec:
  nfs: # Storage type, corresponding to the underlying actual storage
  capacity:  # Storage capacity, currently only supports setting storage space
    storage: 2Gi
  accessModes:  # Access modes
  storageClassName: # Storage class
  persistentVolumeReclaimPolicy: # Reclamation policy

Key configuration parameters for PV:

Storage Type

The type of underlying actual storage; Kubernetes supports various storage types, and the configuration for each type varies.
Storage Capacity (capacity)

Currently only supports setting storage space (storage=1Gi), but may include configurations for IOPS, throughput, etc., in the future.

Access Modes (accessModes)

Describes the access permissions for user applications to the storage resources. Access permissions include the following methods:
- ReadWriteOnce (RWO): Read-write permission, but can only be mounted by a single node.
- ReadOnlyMany (ROX): Read-only permission, can be mounted by multiple nodes.
- ReadWriteMany (RWX): Read-write permission, can be mounted by multiple nodes.
Note that different underlying storage types may support different access modes.
Reclamation Policy (persistentVolumeReclaimPolicy)

The handling method for PV when it is no longer in use. Currently supports three policies:
- Retain (keep): Retain data; requires manual cleanup by the administrator.
- Recycle (recycle): Clear data in the PV, equivalent to executing rm -rf /thevolume/*.
- Delete (delete): Perform deletion operations on the backend storage associated with the PV; this is common in cloud service providers' storage services.
Note that different underlying storage types may support different reclamation policies.
Storage Class

PV can specify a storage class through the storageClassName parameter.
- PV with a specific class can only be bound to PVCs that request that class.
- PVs without a specified class can only be bound to PVCs that do not request any class.
Status (status)

A PV may be in one of four different stages during its lifecycle:
- Available: Indicates available status; it has not yet been bound to any PVC.
- Bound: Indicates that the PV has been bound to a PVC.
- Released: Indicates that the PVC has been deleted, but the resource has not yet been reclaimed by the cluster.
- Failed: Indicates that the automatic reclamation of the PV has failed.

PVC#

PVC is a request for resources, used to declare requirements for storage space, access modes, and storage classes. Below is the resource manifest file:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc
  namespace: dev
spec:
  accessModes: # Access modes
  selector: # Use labels to select PVs
  storageClassName: # Storage class
  resources: # Request space
    requests:
      storage: 5Gi

Key configuration parameters for PVC:

Access Modes (accessModes)

Describes the access permissions for user applications to the storage resources.

Selection Criteria (selector)

Through the setting of Label Selector, PVC can filter existing PVs in the system.
Storage Class (storageClassName)

PVC can specify the required backend storage class when defined; only PVs with that class can be selected by the system.
Resource Requests (Resources)

Describes the request for storage resources.

Lifecycle#

PVC and PV correspond one-to-one, and the interaction between PV and PVC follows the lifecycle below:

Resource Supply: Administrators manually create underlying storage and PV.
Resource Binding: Users create PVCs, and Kubernetes is responsible for finding PVs that meet the PVC's declaration and binding them.

After users define PVCs, the system will select a PV that meets the storage resource request from the existing PVs.
- Once found, the PV will be bound to the user-defined PVC, allowing the user's application to use this PVC.
- If none are found, the PVC will remain in a Pending state indefinitely until a system administrator creates a PV that meets its requirements.
Once a PV is bound to a PVC, it will be exclusively owned by that PVC and cannot be bound to other PVCs.
Resource Usage: Users can use PVC in Pods like Volumes.

Pods use the definition of Volume to mount PVC to a certain path in the container for use.
Resource Release: Users delete PVCs to release PVs.

When storage resources are no longer needed, users can delete PVCs. The PV bound to that PVC will be marked as "released," but it cannot be immediately bound to other PVCs. The data previously written by the PVC may still be left on the storage device, and only after clearing can the PV be reused.
Resource Reclamation: Kubernetes reclaims resources based on the reclamation policy set for PV.

Administrators can set reclamation policies for PVs to determine how to handle leftover data after the PVC bound to them is released. Only after the storage space of the PV is reclaimed can it be bound and used by new PVCs.

Configuration Storage#

ConfigMap#

ConfigMap is a special type of storage volume mainly used to store configuration information.

Create a configmap.yaml file with the following content:

apiVersion: v1
kind: ConfigMap
metadata:
  name: configmap
  namespace: dev
data:
  info: |
    username:admin
    password:123456

Next, create the ConfigMap using this configuration file:

# Create configmap
[root@k8s-master01 ~]# kubectl create -f configmap.yaml
configmap/configmap created

# View configmap details
[root@k8s-master01 ~]# kubectl describe cm configmap -n dev
Name:         configmap
Namespace:    dev
Labels:       <none>
Annotations:  <none>

Data
====
info:
....
username:admin
password:123456

Events:  <none>

Next, create a pod-configmap.yaml file to mount the created ConfigMap:

apiVersion: v1
kind: Pod
metadata:
  name: pod-configmap
  namespace: dev
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
    volumeMounts: # Mount the configmap to the directory
    - name: config
      mountPath: /configmap/config
  volumes: # Reference the configmap
  - name: config
    configMap:
      name: configmap
# Create pod
[root@k8s-master01 ~]# kubectl create -f pod-configmap.yaml
pod/pod-configmap created

# View pod
[root@k8s-master01 ~]# kubectl get pod pod-configmap -n dev
NAME            READY   STATUS    RESTARTS   AGE
pod-configmap   1/1     Running   0          6s

# Enter the container
[root@k8s-master01 ~]# kubectl exec -it pod-configmap -n dev /bin/sh
# cd /configmap/config/
# ls
info
# more info
username:admin
password:123456

# You can see that the mapping has been successful; each configmap is mapped as a directory.
# The key represents the file, and the value represents the content of the file.
# If the content of the configmap is updated at this point, the values in the container will also be updated dynamically.

Secret#

In Kubernetes, there is another object very similar to ConfigMap, called Secret. It is mainly used to store sensitive information, such as passwords, keys, certificates, etc.

First, encode the data using base64:

[root@k8s-master01 ~]# echo -n 'admin' | base64 # Prepare username
YWRtaW4=
[root@k8s-master01 ~]# echo -n '123456' | base64 # Prepare password
MTIzNDU2

Next, write a secret.yaml file and create the Secret:

apiVersion: v1
kind: Secret
metadata:
  name: secret
  namespace: dev
type: Opaque
data:
  username: YWRtaW4=
  password: MTIzNDU2
# Create secret
[root@k8s-master01 ~]# kubectl create -f secret.yaml
secret/secret created

# View secret details
[root@k8s-master01 ~]# kubectl describe secret secret -n dev
Name:         secret
Namespace:    dev
Labels:       <none>
Annotations:  <none>
Type:  Opaque
Data
====
password:  6 bytes
username:  5 bytes

Create a pod-secret.yaml file to mount the created Secret:

apiVersion: v1
kind: Pod
metadata:
  name: pod-secret
  namespace: dev
spec:
  containers:
  - name: nginx
    image: nginx:1.17.1
    volumeMounts: # Mount the secret to the directory
    - name: config
      mountPath: /secret/config
  volumes:
  - name: config
    secret:
      secretName: secret
# Create pod
[root@k8s-master01 ~]# kubectl create -f pod-secret.yaml
pod/pod-secret created

# View pod
[root@k8s-master01 ~]# kubectl get pod pod-secret -n dev
NAME            READY   STATUS    RESTARTS   AGE
pod-secret      1/1     Running   0          2m28s

# Enter the container and check the secret information, finding that it has been automatically decoded
[root@k8s-master01 ~]# kubectl exec -it pod-secret /bin/sh -n dev
/ # ls /secret/config/
password  username
/ # more /secret/config/username
admin
/ # more /secret/config/password
123456

Thus, we have achieved the use of secrets to encode information.

Security Authentication#

Kubernetes, as a management tool for distributed clusters, considers ensuring the security of the cluster to be one of its important tasks. The so-called security is essentially ensuring authentication and authorization operations for various clients of Kubernetes.

Clients

In a Kubernetes cluster, there are generally two types of clients:

User Account: Generally, user accounts managed by services outside of Kubernetes.
Service Account: Accounts managed by Kubernetes, used to provide identity for service processes in Pods when accessing Kubernetes.

Authentication, Authorization, and Admission Control

ApiServer is the only entry point for accessing and managing resource objects. Any request to access ApiServer must go through the following three processes:

Authentication: Identity verification; only the correct account can pass authentication.
Authorization: Determines whether the user has permission to perform specific actions on the accessed resources.
Admission Control: Used to supplement the authorization mechanism to achieve more refined access control functions.

Authentication Management#

The key point of security in a Kubernetes cluster is how to identify and authenticate client identities. It provides three methods for client identity authentication:

HTTP Basic Authentication: Authentication via username + password.

This authentication method encodes the "username" string using the BASE64 algorithm and sends it in the HTTP request's Header Authorization field to the server. The server decodes it upon receipt, retrieves the username and password, and then performs the user identity authentication process.
HTTP Token Authentication: Identifies legitimate users through a Token.

This authentication method uses a long, difficult-to-imitate string—Token—to represent client identity. Each Token corresponds to a username. When the client initiates an API call request, it needs to include the Token in the HTTP Header. The API Server compares the received Token with the stored tokens on the server, then performs user identity authentication.
HTTPS Certificate Authentication: A two-way digital certificate authentication method based on CA root certificate signatures.

This authentication method is the most secure but also the most cumbersome to operate.

HTTPS authentication generally consists of three processes:

Certificate application and issuance.

Both parties in HTTPS communication apply for certificates from a CA organization, which issues root certificates, server certificates, and private keys to the applicant.
Mutual authentication between client and server.

(1) The client initiates a request to the server, which sends its certificate to the client. The client receives the certificate, decrypts it using its private key, and obtains the server's public key. The client uses the server's public key to authenticate the information in the certificate; if consistent, it recognizes the server.
(2) The client sends its certificate to the server. The server receives the certificate, decrypts it using its private key, and obtains the client's public key. It uses this public key to authenticate the certificate information, confirming whether the client is legitimate.

Communication between the server and client.

After the server and client negotiate the encryption scheme, the client generates a random secret key, encrypts it, and sends it to the server. The server receives this key, and all subsequent communications between the two parties are encrypted using this random key.

Note: Kubernetes allows multiple authentication methods to be configured simultaneously; as long as any one method passes authentication, it is sufficient.

Authorization Management#

Authorization occurs after successful authentication. Once authentication is successful, Kubernetes will determine whether the user has permission to access the resources based on pre-defined authorization policies. This process is called authorization.

Each request sent to the ApiServer carries information about the user and resources: for example, the user sending the request, the request path, the request action, etc. Authorization compares this information with the authorization policies, and if it meets the policy, the authorization is considered successful; otherwise, an error is returned.

The API Server currently supports the following authorization policies:

AlwaysDeny: Denies all requests, generally used for testing.
AlwaysAllow: Allows all requests, equivalent to no authorization process (the default policy in Kubernetes).
ABAC: Attribute-Based Access Control, which uses user-defined authorization rules to match and control user requests.
Webhook: Authorizes users by calling an external REST service.
Node: A special mode used to control access to requests made by kubelet.
RBAC: Role-Based Access Control (the default option under kubeadm installation).

RBAC (Role-Based Access Control) is mainly about describing one thing: which objects are granted which permissions.

This involves the following concepts:

Objects: User, Groups, ServiceAccount.
Roles: A collection of actions (permissions) defined on resources.
Bindings: Binding the defined roles to users.

RBAC introduces four top-level resource objects:

Role, ClusterRole: Roles used to specify a set of permissions.
RoleBinding, ClusterRoleBinding: Role bindings used to assign roles (permissions) to objects.

Role, ClusterRole

A role is a collection of permissions. The permissions here are all in the form of allowances (whitelists).

# Role can only authorize resources within the namespace and requires specifying the namespace
kind: Role
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  namespace: dev
  name: authorization-role
rules:
- apiGroups: [""]  # Supported API group list; "" empty string indicates core API group
  resources: ["pods"] # Supported resource object list
  verbs: ["get", "watch", "list"] # Allowed operations on resource objects
# ClusterRole can authorize resources at the cluster level, across namespaces, and non-resource types
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
 name: authorization-clusterrole
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "watch", "list"]

It is important to explain the parameters in rules:

apiGroups: Supported API group list

  "","apps", "autoscaling", "batch"

resources: Supported resource object list

  "services", "endpoints", "pods","secrets","configmaps","crontabs","deployments","jobs",
  "nodes","rolebindings","clusterroles","daemonsets","replicasets","statefulsets",
  "horizontalpodautoscalers","replicationcontrollers","cronjobs"

verbs: List of operation methods on resource objects

  "get", "list", "watch", "create", "update", "patch", "delete", "exec"

RoleBinding, ClusterRoleBinding

Role bindings are used to bind a role to a target object. The binding target can be User, Group, or ServiceAccount.

# RoleBinding can bind subjects within the same namespace to a specific Role, granting that subject the permissions defined by that Role
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: authorization-role-binding
  namespace: dev
subjects:
 - kind: User
  name: heima
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: authorization-role
  apiGroup: rbac.authorization.k8s.io