From 4cfafeaa07b5f18272906c1655eeeb06676c02e1 Mon Sep 17 00:00:00 2001 From: Dalton Hubble Date: Thu, 18 Jun 2020 23:08:01 -0700 Subject: [PATCH] Fix Kubelet starting before hostname set on FCOS AWS * Fedora CoreOS `kubelet.service` can start before the hostname is set. Kubelet reads the hostname to determine the node name to register. If the hostname was read as localhost, Kubelet will continue trying to register as localhost (problem) * This race manifests as a node that appears NotReady, the Kubelet is trying to register as localhost, while the host itself (by then) has an AWS provided hostname. Restarting kubelet.service is a manual fix so Kubelet re-reads the hostname * This race could only be shown on AWS, not on Google Cloud or Azure despite attempts. Bare-metal and DigitalOcean differ and use hostname-override (e.g. afterburn) so they're not affected * Wait for nodes to have a non-localhost hostname in the oneshot that awaits /etc/resolve.conf. Typhoon has no valid cases for a node hostname being localhost (not even single-node clusters) Related Openshift: https://github.com/openshift/machine-config-operator/pull/1813 Close https://github.com/poseidon/typhoon/issues/765 --- CHANGES.md | 5 +++++ aws/fedora-coreos/kubernetes/fcc/controller.yaml | 3 ++- aws/fedora-coreos/kubernetes/workers/fcc/worker.yaml | 3 ++- 3 files changed, 9 insertions(+), 2 deletions(-) diff --git a/CHANGES.md b/CHANGES.md index 731e85e5..f35e44fe 100644 --- a/CHANGES.md +++ b/CHANGES.md @@ -27,6 +27,11 @@ Notable changes between versions. * Use `strict` Fedora CoreOS Config (FCC) snippet parsing ([#755](https://github.com/poseidon/typhoon/pull/755)) +#### AWS + +* Fix Kubelet service race with hostname update ([#766](https://github.com/poseidon/typhoon/pull/766)) + * Wait for a hostname to avoid Kubelet trying to register as `localhost` + ### Flatcar Linux * Use `strict` Container Linux Config (CLC) snippet parsing ([#755](https://github.com/poseidon/typhoon/pull/755)) diff --git a/aws/fedora-coreos/kubernetes/fcc/controller.yaml b/aws/fedora-coreos/kubernetes/fcc/controller.yaml index 5e0119b2..2bc88e1e 100644 --- a/aws/fedora-coreos/kubernetes/fcc/controller.yaml +++ b/aws/fedora-coreos/kubernetes/fcc/controller.yaml @@ -38,11 +38,12 @@ systemd: enabled: true contents: | [Unit] - Description=Wait for DNS entries + Description=Wait for DNS and hostname Before=kubelet.service [Service] Type=oneshot RemainAfterExit=true + ExecStartPre=/bin/sh -c 'while [ `hostname -s` == "localhost" ]; do sleep 1; done;' ExecStart=/bin/sh -c 'while ! /usr/bin/grep '^[^#[:space:]]' /etc/resolv.conf > /dev/null; do sleep 1; done' [Install] RequiredBy=kubelet.service diff --git a/aws/fedora-coreos/kubernetes/workers/fcc/worker.yaml b/aws/fedora-coreos/kubernetes/workers/fcc/worker.yaml index 3febec95..a5645dea 100644 --- a/aws/fedora-coreos/kubernetes/workers/fcc/worker.yaml +++ b/aws/fedora-coreos/kubernetes/workers/fcc/worker.yaml @@ -9,11 +9,12 @@ systemd: enabled: true contents: | [Unit] - Description=Wait for DNS entries + Description=Wait for DNS and hostname Before=kubelet.service [Service] Type=oneshot RemainAfterExit=true + ExecStartPre=/bin/sh -c 'while [ `hostname -s` == "localhost" ]; do sleep 1; done;' ExecStart=/bin/sh -c 'while ! /usr/bin/grep '^[^#[:space:]]' /etc/resolv.conf > /dev/null; do sleep 1; done' [Install] RequiredBy=kubelet.service