Dynamic Kubelet Configuration

Table of Contents

Kubelet, at launch time, loads configuration files from pre-specified files. Changed configurations are not applied into the running Kubelet process during runtime, hence manual restarting Kubelet is required after modification.

Dynamic Kubelet configuration eliminates this burden, making Kubelet monitors its configuration changes and restarts when it is updated¹. It uses Kubernetes a ConfigMap object.

Kubelet Flags for Dynamic Configuration #

Dynamic kubelet configuration is not enabled by default. To be specific, one of required configurations is missing; the following flags for Kubelet are required for dynamic configuration:

--feature-gates="DynamicKubeletConfig=true": enabled by default since Kubernetes 1.11 ², so nothing should be done manually.
--dynamic-config-dir=<path>: this flag for Kubelet is required to enable dynamic kubelet configuration.

Why `--dynamic-config-dir` flag is required? #

Kubelet reconfigure diagram. Source: Kubernetes: Dynamic Kubelet Configuration

The figure above illustrates how Kubelet uses dynamic Kubelet reconfiguration. At step 4~5, Kubelet “checkpoints” the modified configmap from the apiserver to its local storage, and loads the data after restarted. Here, the checkpointed configmap should be stored at somewhere, and --dynamic-config-dir is the one. When you specify --dynamic-config-dir, Kubelet automatically initializes itself with config checkpoint:

Aug 21 18:54:30 localhost.worker kubelet[107800]: I0821 18:54:30.679461  107800 watch.go:128] kubelet config controller: assigned ConfigMap was updated
Aug 21 18:54:34 localhost.worker kubelet[107800]: I0821 18:54:34.776572  107800 download.go:181] kubelet config controller: checking in-memory store for /api/v1/namespaces/kube-system/configmaps/k8s-worker16
Aug 21 18:54:34 localhost.worker kubelet[107800]: I0821 18:54:34.776605  107800 download.go:187] kubelet config controller: found /api/v1/namespaces/kube-system/configmaps/k8s-worker16 in in-memory store, UID: 5808539f-48e3-4e47-8ce6-c261c48fa1d9, ResourceVersion: 437429
Aug 21 18:54:34 localhost.worker kubelet[107800]: I0821 18:54:34.806121  107800 configsync.go:205] kubelet config controller: Kubelet restarting to use /api/v1/namespaces/kube-system/configmaps/k8s-worker16, UID: 5808539f-48e3-4e47-8ce6-c261c48fa1d9, ResourceVersion: 437429, KubeletConfigKey: kubelet
Aug 21 18:54:34 localhost.worker systemd[1]: kubelet.service: Succeeded.
Aug 21 18:54:45 localhost.worker systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 19.
Aug 21 18:54:45 localhost.worker systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
Aug 21 18:54:45 localhost.worker systemd[1]: Started kubelet: The Kubernetes Node Agent.
Aug 21 18:54:45 localhost.worker kubelet[108430]: I0821 18:54:45.115977  108430 controller.go:101] kubelet config controller: starting controller
Aug 21 18:54:45 localhost.worker kubelet[108430]: I0821 18:54:45.116038  108430 controller.go:267] kubelet config controller: ensuring filesystem is set up correctly
Aug 21 18:54:45 localhost.worker kubelet[108430]: I0821 18:54:45.116045  108430 fsstore.go:59] kubelet config controller: initializing config checkpoints directory "/var/lib/kubelet/dynamic-config/store"
...

Dynamic Configuration Setup #

Dynamic Kubelet Configuration uses ConfigMap objects. Therefore, a ConfigMap object for Kubelet configurations and related authorized access permission is required.

Creating a ConfigMap #

Dynamic Kubelet onfiguration cannot be used with the existing configurations given by flags and --config file. In other words, even if a configuration is set by the flags or --config file, this configurations are not enabled and only Dynamic configurations are applied. For this reason, ¹ recommends to start from the current configuration.

$ kubectl proxy --port=8001 &
$ NODE_NAME="the-name-of-the-node-you-are-reconfiguring"; curl -sSL "http://localhost:8001/api/v1/nodes/${NODE_NAME}/proxy/configz" | jq '.kubeletconfig|.kind="KubeletConfiguration"|.apiVersion="kubelet.config.k8s.io/v1beta1"' > kubelet_configz_${NODE_NAME}
$ kubectl (-n kube-system) create configmap node-config --from-file=kubelet=kubelet_configz_${NODE_NAME}

It only works at control plane nodes, and currently I could not figure out which permission is required for this operation.

I found that the configmap does not need to be the namespace kube-system. Also works if it is stored in the default namespace.

Adding more Configurations into the ConfigMap #

You should see your ConfigMap object with the following command:

$ kubectl get configmaps (-n NAMESPACE)

Now you can edit the ConfigMap object with edit subcommand:

$ kubectl edit configmaps node-config (-n NAMESPACE)

The format of the ConfigMap object looks like this:

apiVersion: v1
kind: ConfigMap
metadta: <not important>
name: node-config
namespace: default
data:
    kubelet: |     # <- Do not omit '|'.
        {
            "kind": "KubeletConfiguration",
            "apiVersion": "kubelet.config.k8s.io/v1beta1",
            <other Kubelet configurations in JSON format>
        }

The ConfigMap object itself follows YAML format, but its content in data.kubelet foolows JSON format. Do not be confused. Available Kubelet configuration key/value can be investigated in [here]

In my case, I want to try to restrict allocatable resources via dynamic Kubernetes configuration. I added several Kubelet flags in the ConfigMap object:

<omitted>
data:
    kubelet: |
        {
            "kind": "KubeletConfiguration",
            "apiVersion": "kubelet.config.k8s.io/v1beta1",
            <other default configurations are omitted>,
            "systemReserved": {
                "cpu": "2",
                "memory": "2Gi"
            },
            "kubeReserved": {
                "cpu": "1",
                "memory": "1Gi"
            },
            "systemReservedCgroup": "/system.slice",  <-- not necessary
            "kubeReservedCgroup": "/kubepods.slice",  <-- not necessary
            "cgroupPerQOS": true,
            "enforceNodeAllocatable": ["pods"],
        }

Note that systemReservedCgroup and kubeReservedCgroup are not necessary, since I do not put "system-reserved" and "kube-reserved" in enforceNodeAllocatable.

Note that cgroupPerQOS should be true for resource restriction on Kubernetes Pods, and enforceNodeAllocatable must not contain neither system-reserved nore kube-reserved string elements, otherwise you will see the following wierd error in the Kubelet log. I am not sure whether it is a bug.
Aug 26 16:55:47 localhost.worker kubelet[388430]: F0826 16:55:47.291827  388430 kubelet.go:1384] Failed to start ContainerManager Failed to enforce System Reserved Cgroup Limits on "/system.slice": ["system"] cgroup does not exist      

You can see Kubelet method that creates /kubepods.slice cgroup in [here], or as follows.

// enforceNodeAllocatableCgroups enforce Node Allocatable Cgroup settings.
func (cm *containerManagerImpl) enforceNodeAllocatableCgroups() error {
    nc := cm.NodeConfig.NodeAllocatableConfig

    // We need to update limits on node allocatable cgroup no matter what because
    // default cpu shares on cgroups are low and can cause cpu starvation.
    nodeAllocatable := cm.internalCapacity
    // Use Node Allocatable limits instead of capacity if the user requested enforcing node allocatable.
    if cm.CgroupsPerQOS && nc.EnforceNodeAllocatable.Has(kubetypes.NodeAllocatableEnforcementKey) {
        nodeAllocatable = cm.getNodeAllocatableInternalAbsolute()
    }

    klog.V(4).Infof("Attempting to enforce Node Allocatable with config: %+v", nc)

    cgroupConfig := &CgroupConfig{
        Name:               cm.cgroupRoot,
        ResourceParameters: getCgroupConfig(nodeAllocatable),
    }

    // Using ObjectReference for events as the node maybe not cached; refer to #42701 for detail.
    nodeRef := &v1.ObjectReference{
        Kind:      "Node",
        Name:      cm.nodeInfo.Name,
        UID:       types.UID(cm.nodeInfo.Name),
        Namespace: "",
    }

    // If Node Allocatable is enforced on a node that has not been drained or is updated on an existing node to a lower value,
    // existing memory usage across pods might be higher than current Node Allocatable Memory Limits.
    // Pod Evictions are expected to bring down memory usage to below Node Allocatable limits.
    // Until evictions happen retry cgroup updates.
    // Update limits on non root cgroup-root to be safe since the default limits for CPU can be too low.
    // Check if cgroupRoot is set to a non-empty value (empty would be the root container)
    if len(cm.cgroupRoot) > 0 {
        go func() {
            for {
                err := cm.cgroupManager.Update(cgroupConfig)
                if err == nil {
                    cm.recorder.Event(nodeRef, v1.EventTypeNormal, events.SuccessfulNodeAllocatableEnforcement, "Updated Node Allocatable limit across pods")
                    return
                }
                message := fmt.Sprintf("Failed to update Node Allocatable Limits %q: %v", cm.cgroupRoot, err)
                cm.recorder.Event(nodeRef, v1.EventTypeWarning, events.FailedNodeAllocatableEnforcement, message)
                time.Sleep(time.Minute)
            }
        }()
    }

    ...
}

If cm.cgroupManager.Update(cgroupConfig) returns nil, the following event is logged in the target node:

$ kubectl describe nodes k8s-worker16
...
  Normal   KubeletConfigChanged              33m   kubelet, k8s-worker16  Kubelet restarting to use /api/v1/namespaces/default/configmaps/node-config, UID: a8b4da9d-0b6e-478d-9e35-4a3696f5778c, ResourceVersion: 1464790, KubeletConfigKey: kubelet
  Normal   Starting                          32m   kubelet, k8s-worker16  Starting kubelet.
  Normal   NodeAllocatableEnforced           32m   kubelet, k8s-worker16  Updated Node Allocatable limit across pods

Specifying Node to use the ConfigMap #

One more thing to do is to add spec to Kubernetes Node object to indicate which ConfigMap object should be used for dynamic Kubelet configuration for which node.

Oen an editor with the following command to specify it:

$ kubectl edit node ${NODE_NAME}

You have nothing in spec section if it is the first time to edit it. Add configSource element as follows (note that it is YAML format):

spec:
    configSource:
        kubeletConfigKey: kubelet
        name: node-config
        namespace: default

each of which key means:

kubeletConfigKey: the key in data dictionary in the ConfigMap object. In our case, we specified it as kubelet (kubelet: | { ... }).
name: name of the ConfigMap object.
namespace: namespace that the ConfigMap object is in.

Once you specify the spec, the Kubelet in the target node will automatically restart to apply dynamic configuration.

Verifying Dynamic Kubelet Configuration #

Check the result with $ kubectl get nodes ${NODE_NAME} -o yaml.

First, check whether the spec is successfully applied:

managedFields:
    - apiVersion: v1
    - fieldsType: FieldsV1
    fieldsV1:
        f:spec:
            f:configSource:
            ...
spec:
    configSource:
        kubeletConfigKey: kubelet
        name: node-config
        namespace: default

Second, the Kubelet updates its status whether the configuration is successfully applied or not:

status:
    addresses: <omitted>
    allocatable: <omitted>
    capacity: <omitted>
    conditions: <omitted>
        active:              <-- must exist
            configMap:
                kubeletConfigKey: kubelet
                name: node-config 
                namespace: default
                resourceVersion: "1024519"
                uid: a8b4da9d-0b6e-478d-9e35-4a3696f5778c
        assigned:
            configMap:
                kubeletConfigKey: kubelet
                name: node-config 
                namespace: default
                resourceVersion: "1024519"
                uid: a8b4da9d-0b6e-478d-9e35-4a3696f5778c
        lastKnownGood:
            configMap:
                kubeletConfigKey: kubelet
                name: node-config 
                namespace: default
                resourceVersion: "1024519"
                uid: a8b4da9d-0b6e-478d-9e35-4a3696f5778c

You must check that status.conditions.active fields exist, and there is no error field in status.conditions.assigned. If an error occured when applying the specified configuration, the Kubelet restarts with status.conditions.lastKnownGood config hence status.conditions.active.configMap.resourceVersion may different from that in status.conditions.assigned.

If an error occured, a systemd journal log indicates what would be an error:

## This error is logged when you missed data.kubelet.kind.
Aug 24 13:16:04 localhost.worker kubelet[278409]: E0824 13:16:04.368361  278409 controller.go:151] kubelet config controller: failed to load config, see Kubelet log for details, error: failed to decode: Object 'Kind' is missing in '{

Also, my configuration contains resource restriction; it is applied as intended:

$ kubectl get nodes ${NODE_NAME} -o yaml
status:
    allocatable:    
        cpu: "1"
        ephemeral-storage: "66100978374"
        hugepages-1Gi: "0"
        hugepages-2Mi: "0"
        memory: 4775280Ki
        pods: "110"
    capacity:
        cpu: "4"
        ephemeral-storage: 71724152Ki
        hugepages-1Gi: "0"
        hugepages-2Mi: "0"
        memory: 8023408Ki
        pods: "110"

In my configuration, I restricted CPU and memory resource, so be sure to check whether it actually is applied to Linux cgroup:

/sys/fs/cgroup/memory/kubepods.slice $ cat memory.limit_in_bytes
4994744320 // Similar to allocatable memory 4775280Ki

Authentication and Authorization #

During the setup, the Kubelet queries a ConfigMap object from the apiserver, which requires authentication. According to ¹, the node authorizer now automatically configures an RBAC rules so that there is nothing to do.

Previously, you were required to manually create RBAC rules to allow Nodes to access their assigned ConfigMaps. The Node Authorizer now automatically configures these rules.

Node authorizor means that default authorization mode contains Node ^[node-auth]: SystemReservedEnforcementKey

ps -ef | grep kube-apiserver
root       91113   91100  2 Aug19 ?        02:29:37 kube-apiserver ... --authorization-mode=Node,RBAC

so that any users in the system:nodes group (username system:node:<nodeName>) are allowed to:

read services, endpoints, nodes, pods, secrets, configmaps, and persistent volumes (bound to the Kubelet’s node)
write nodes (+status), pods (+status), and events.

Dynamic Kubelet configuration requires permission for (1) read configmaps, and (2) write node status.

Note that getting an object and listing objects are different. kubectl get configmaps node-config should be allowed, while kubectl get configmaps (get all configmaps, which means listing configmaps) are still denied.