Fix conflict between Calico and NetworkManager

* Observed frequent kube-scheduler and controller-manager restarts with Calico as the CNI provider. Root cause was unclear since control plane was functional and tests of pod to pod network connectivity passed * Root cause: Calico sets up cali* and tunl* network interfaces for containers on hosts. NetworkManager tries to manage these interfaces. It periodically disconnected veth pairs. Logs did not surface this issue since its not an error per-se, just Calico and NetworkManager dueling for control. Kubernetes correctly restarted pods failing health checks and ensured 2 replicas were running so the control plane functioned mostly normally. Pod to pod connecitivity was only affected occassionally. Pain to debug. * Solution: Configure NetworkManager to ignore the Calico ifaces per Calico's recommendation. Cloud-init writes files after NetworkManager starts, so a restart is required on first boot. On subsequent boots, the file is present so no restart is needed
2025-09-09 13:58:27 +02:00 · 2018-04-25 21:33:26 -07:00
parent 0a7fab56e2
commit 567e18f015
6 changed files with 42 additions and 0 deletions
--- a/google-cloud/fedora-atomic/kubernetes/cloudinit/controller.yaml.tmpl
+++ b/google-cloud/fedora-atomic/kubernetes/cloudinit/controller.yaml.tmpl
@ -72,6 +72,12 @@ write_files:
    content: |
      ${kubeconfig}
  - path: /var/lib/bootkube/.keep
+  - path: /etc/NetworkManager/conf.d/typhoon.conf
+    content: |
+      [main]
+      plugins=keyfile
+      [keyfile]
+      unmanaged-devices=interface-name:cali*;interface-name:tunl*
  - path: /etc/selinux/config
    owner: root:root
    permissions: '0644'
@ -85,6 +91,7 @@ bootcmd:
  - [modprobe, ip_vs]
 runcmd:
  - [systemctl, daemon-reload]
+  - [systemctl, restart, NetworkManager]
  - "atomic install --system --name=etcd quay.io/poseidon/etcd:v3.3.3"
  - "atomic install --system --name=kubelet quay.io/poseidon/kubelet:v1.10.1"
  - "atomic install --system --name=bootkube quay.io/poseidon/bootkube:v0.12.0"