typhoon

mirror of https://github.com/puppetmaster/typhoon.git synced 2025-10-05 10:34:38 +02:00

Author	SHA1	Message	Date
Dalton Hubble	430d139a5b	Remove os_image variable on Google Cloud Fedora CoreOS * In v1.18.3, the `os_stream` variable was added to select a Fedora CoreOS image stream (stable, testing, next) on AWS and Google Cloud (which publish official streams) * Remove `os_image` variable deprecated in v1.18.3. Manually uploaded images are no longer needed	2020-06-29 22:57:11 -07:00
Dalton Hubble	7bce15975c	Update Kubernetes from v1.18.4 to v1.18.5 * https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md#v1185	2020-06-27 13:52:18 -07:00
Dalton Hubble	1f83ae7dbb	Update Calico from v3.14.1 to v3.15.0 * https://docs.projectcalico.org/v3.15/release-notes/	2020-06-26 02:40:12 -07:00
Dalton Hubble	d27f367004	Update Cilium from v1.8.0-rc4 to v1.8.0 * https://github.com/cilium/cilium/releases/tag/v1.8.0	2020-06-22 22:26:49 -07:00
Dalton Hubble	e9c8520359	Add experimental Cilium CNI provider * Accept experimental CNI `networking` mode "cilium" * Run Cilium v1.8.0-rc4 with overlay vxlan tunnels and a minimal set of features. We're interested in: * IPAM: Divide pod_cidr into /24 subnets per node * CNI networking pod-to-pod, pod-to-external * BPF masquerade * NetworkPolicy as defined by Kubernetes (no L7 Policy) * Continue using kube-proxy with Cilium probe mode * Firewall changes: * Require UDP 8472 for vxlan (Linux kernel default) between nodes * Optional ICMP echo(8) between nodes for host reachability (health) * Optional TCP 4240 between nodes for endpoint reachability (health) Known Issues: * Containers with `hostPort` don't listen on all host addresses, these workloads must use `hostNetwork` for now https://github.com/cilium/cilium/issues/12116 * Erroneous warning on Fedora CoreOS https://github.com/cilium/cilium/issues/10256 Note: This is experimental. It is not listed in docs and may be changed or removed without a deprecation notice Related: * https://github.com/poseidon/terraform-render-bootstrap/pull/192 * https://github.com/cilium/cilium/issues/12217	2020-06-21 20:41:53 -07:00
Dalton Hubble	90e23f5822	Rename controller node label and NoSchedule taint * Remove node label `node.kubernetes.io/master` from controller nodes * Use `node.kubernetes.io/controller` (present since v1.9.5, [#160](https://github.com/poseidon/typhoon/pull/160)) to node select controllers * Rename controller NoSchedule taint from `node-role.kubernetes.io/master` to `node-role.kubernetes.io/controller` * Tolerate the new taint name for workloads that may run on controller nodes and stop tolerating `node-role.kubernetes.io/master` taint	2020-06-19 00:12:13 -07:00
Dalton Hubble	c25c59058c	Update Kubernetes from v1.18.3 to v1.18.4 * https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md#v1184	2020-06-17 19:53:19 -07:00
Dalton Hubble	413585681b	Remove unused Kubelet lock-file and exit-on-lock-contention * Kubelet `--lock-file` and `--exit-on-lock-contention` date back to usage of bootkube and at one point running Kubelet in a "self-hosted" style whereby an on-host Kubelet (rkt) started pods, but then a Kubelet DaemonSet was scheduled and able to take over (hence self-hosted). `lock-file` and `exit-on-lock-contention` flags supported this pivot. The pattern has been out of favor (in bootkube too) for years because of dueling Kubelet complexity * Typhoon runs Kubelet as a container via an on-host systemd unit using podman (Fedora CoreOS) or rkt (Flatcar Linux). In fact, Typhoon no longer uses bootkube or control plane pivot (let alone Kubelet pivot) and uses static pods since v1.16.0 * https://github.com/poseidon/typhoon/pull/536	2020-06-12 00:06:41 -07:00
Dalton Hubble	96711d7f17	Remove unused Kubelet cert / key Terraform state * Generated Kubelet TLS certificate and key are not longer used or distributed to machines since Kubelet TLS bootstrap is used instead. Remove the certificate and key from state	2020-06-11 21:24:36 -07:00
Dalton Hubble	a287920169	Use strict mode for Container Linux Configs * Enable terraform-provider-ct `strict` mode for parsing Container Linux Configs and snippets * Fix Container Linux Config systemd unit syntax `enable` (old) to `enabled` * Align with Fedora CoreOS which uses strict mode already	2020-06-09 23:00:36 -07:00
Dalton Hubble	20bfd69780	Change Kubelet container image publishing * Build Kubelet container images internally and publish to Quay and Dockerhub (new) as an alternative in case of registry outage or breach * Use our infra to provide single and multi-arch (default) Kublet images for possible future use * Docs: Show how to use alternative Kubelet images via snippets and a systemd dropin (builds on #737) Changes: * Update docs with changes to Kubelet image building * If you prefer to trust images built by Quay/Dockerhub, automated image builds are still available with unique tags (albeit with some limitations): * Quay automated builds are tagged `build-{short_sha}` (limit: only amd64) * Dockerhub automated builts are tagged `build-{tag}` and `build-master` (limit: only amd64, no shas) Links: * Kubelet: https://github.com/poseidon/kubelet * Docs: https://typhoon.psdn.io/topics/security/#container-images * Registries: * quay.io/poseidon/kubelet * docker.io/psdn/kubelet	2020-05-30 23:34:23 -07:00
Dalton Hubble	ba44408b76	Update Calico from v3.14.0 to v3.14.1 * https://docs.projectcalico.org/v3.14/release-notes/	2020-05-30 22:08:37 -07:00
Dalton Hubble	283e14f3e0	Update recommended Terraform provider versions * Sync Terraform provider plugin versions to those actively used internally * Fix terraform fmt	2020-05-22 01:12:53 -07:00
Dalton Hubble	e72f916c8d	Update etcd from v3.4.8 to v3.4.9 * https://github.com/etcd-io/etcd/blob/master/CHANGELOG-3.4.md#v349-2020-05-20	2020-05-22 00:52:20 -07:00
Dalton Hubble	ecae6679ff	Update Kubernetes from v1.18.2 to v1.18.3 * https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md	2020-05-20 20:37:39 -07:00
Dalton Hubble	4760543356	Set Kubelet image via kubelet.service KUBELET_IMAGE * Write the systemd kubelet.service to use `KUBELET_IMAGE` as the Kubelet. This provides a nice way to use systemd dropins to temporarily override the image (e.g. during a registry outage) Note: Only Typhoon Kubelet images and registries are supported.	2020-05-19 22:39:53 -07:00
Dalton Hubble	09eb208b4e	Fix Fedora CoreOS on GCP proposing controller recreate * With Fedora CoreOS image stream support (#727), the latest resolved image will change over the lifecycle of a cluster. * Fix issue where an image diff proposed replacing a Fedora CoreOS controller on GCP, introduced in #727 (unreleased) * Also ignore image diffs to the GCP managed instance group of workers. This aligns with worker AMI diffs being ignored on AWS and similar on Azure, since workers update themselves. Background: * Controller nodes should strictly not be recreated by Terraform, they are stateful (etcd) and should not be replaced * Across cloud platforms, OS image diffs are ignored since both Flatcar Linux and Fedora CoreOS nodes update themselves. For workers, user-data or disk size diffs (where relevant) are allowed to recreate workers templates/configs since these are considered to be user-initiated declarations that a reprovision should be done	2020-05-19 21:41:51 -07:00
Dalton Hubble	8d024d22ad	Update etcd from v3.4.7 to v3.4.8 * https://github.com/etcd-io/etcd/blob/master/CHANGELOG-3.4.md#v348-2020-05-18	2020-05-18 23:50:46 -07:00
Dalton Hubble	a18bd0a707	Highlight SELinux enforcing mode in features	2020-05-13 21:57:38 -07:00
Dalton Hubble	a2db4fa8c4	Update Calico from v3.13.3 to v3.14.0 * https://docs.projectcalico.org/v3.14/release-notes/	2020-05-09 16:05:30 -07:00
Dalton Hubble	358854e712	Fix Calico install-cni crash loop on Pod restarts * Set a consistent MCS level/range for Calico install-cni * Note: Rebooting a node was a workaround, because Kubelet relabels /etc/kubernetes(/cni/net.d) Background: * On SELinux enforcing systems, the Calico CNI install-cni container ran with default SELinux context and a random MCS pair. install-cni places CNI configs by first creating a temporary file and then moving them into place, which means the file MCS categories depend on the containers SELinux context. * calico-node Pod restarts creates a new install-cni container with a different MCS pair that cannot access the earlier written file (it places configs every time), causing the init container to error and calico-node to crash loop * https://github.com/projectcalico/cni-plugin/issues/874 ``` mv: inter-device move failed: '/calico.conf.tmp' to '/host/etc/cni/net.d/10-calico.conflist'; unable to remove target: Permission denied Failed to mv files. This may be caused by selinux configuration on the host, or something else. ``` Note, this isn't a host SELinux configuration issue. Related: * https://github.com/poseidon/terraform-render-bootstrap/pull/186	2020-05-09 16:01:44 -07:00
Dalton Hubble	b5dabcea31	Use Fedora CoreOS image streams on Google Cloud * Add `os_stream` variable to set a Fedora CoreOS stream to `stable` (default), `testing`, or `next` * Deprecate `os_image` variable. Remove docs about uploading Fedora CoreOS images manually, this is no longer needed * https://docs.fedoraproject.org/en-US/fedora-coreos/update-streams/ Rel: https://github.com/coreos/fedora-coreos-docs/pull/70	2020-05-08 01:23:12 -07:00
Dalton Hubble	fd044ee117	Enable Kubelet TLS bootstrap and NodeRestriction * Enable bootstrap token authentication on kube-apiserver * Generate the bootstrap.kubernetes.io/token Secret that may be used as a bootstrap token * Generate a bootstrap kubeconfig (with a bootstrap token) to be securely distributed to nodes. Each Kubelet will use the bootstrap kubeconfig to authenticate to kube-apiserver as `system:bootstrappers` and send a node-unique CSR for kube-controller-manager to automatically approve to issue a Kubelet certificate and kubeconfig (expires in 72 hours) * Add ClusterRoleBinding for bootstrap token subjects (`system:bootstrappers`) to have the `system:node-bootstrapper` ClusterRole * Add ClusterRoleBinding for bootstrap token subjects (`system:bootstrappers`) to have the csr nodeclient ClusterRole * Add ClusterRoleBinding for bootstrap token subjects (`system:bootstrappers`) to have the csr selfnodeclient ClusterRole * Enable NodeRestriction admission controller to limit the scope of Node or Pod objects a Kubelet can modify to those of the node itself * Ability for a Kubelet to delete its Node object is retained as preemptible nodes or those in auto-scaling instance groups need to be able to remove themselves on shutdown. This need continues to have precedence over any risk of a node deleting itself maliciously Security notes: 1. Issued Kubelet certificates authenticate as user `system:node:NAME` and group `system:nodes` and are limited in their authorization to perform API operations by Node authorization and NodeRestriction admission. Previously, a Kubelet's authorization was broader. This is the primary security motivation. 2. The bootstrap kubeconfig credential has the same sensitivity as the previous generated TLS client-certificate kubeconfig. It must be distributed securely to nodes. Its compromise still allows an attacker to obtain a Kubelet kubeconfig 3. Bootstrapping Kubelet kubeconfig's with a limited lifetime offers a slight security improvement. * An attacker who obtains the kubeconfig can likely obtain the bootstrap kubeconfig as well, to obtain the ability to renew their access * A compromised bootstrap kubeconfig could plausibly be handled by replacing the bootstrap token Secret, distributing the token to new nodes, and expiration. Whereas a compromised TLS-client certificate kubeconfig can't be revoked (no CRL). However, replacing a bootstrap token can be impractical in real cluster environments, so the limited lifetime is mostly a theoretical benefit. * Cluster CSR objects are visible via kubectl which is nice 4. Bootstrapping node-unique Kubelet kubeconfigs means Kubelet clients have more identity information, which can improve the utility of audits and future features Rel: https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet-tls-bootstrapping/ Rel: https://github.com/poseidon/terraform-render-bootstrap/pull/185	2020-04-28 19:35:33 -07:00
Dalton Hubble	38a6bddd06	Update Calico from v3.13.1 to v3.13.3 * https://docs.projectcalico.org/v3.13/release-notes/	2020-04-23 23:58:02 -07:00
Dalton Hubble	d8966afdda	Remove extraneous sudo from layout asset unpacking	2020-04-22 20:28:01 -07:00
Dalton Hubble	feac94605a	Fix bootstrap mount to use shared volume SELinux label * Race: During initial bootstrap, static control plane pods could hang with Permission denied to bootstrap secrets. A manual fix involved restarting Kubelet, which relabeled mounts The race had no effect on subsequent reboots. * bootstrap.service runs podman with a private unshared mount of /etc/kubernetes/bootstrap-secrets which uses an SELinux MCS label with a category pair. However, bootstrap-secrets should be shared as its mounted by Docker pods kube-apiserver, kube-scheduler, and kube-controller-manager. Restarting Kubelet was a manual fix because Kubelet relabels all /etc/kubernetes * Fix bootstrap Pod to use the shared volume label, which leaves bootstrap-secrets files with SELinux level s0 without MCS * Also allow failed bootstrap.service to be re-applied. This was missing on bare-metal and AWS	2020-04-19 16:31:32 -07:00
Dalton Hubble	bf22222f7d	Remove temporary workaround for v1.18.0 apply issue * In v1.18.0, kubectl apply would fail to apply manifests if any single manifest was unable to validate. For example, if a CRD and CR were defined in the same directory, apply would fail since the CR would be invalid as the CRD wouldn't exist * Typhoon temporary workaround was to separate CNI CRD manifests and explicitly apply them first. No longer needed in v1.18.1+ * Kubernetes v1.18.1 restored the prior behavior where kubectl apply applies as many valid manifests as it can. In the example above, the CRD would be applied and the CR could be applied if the kubectl apply was re-run (allowing for apply loops). * Upstream fix: https://github.com/kubernetes/kubernetes/pull/89864	2020-04-16 23:49:55 -07:00
Dalton Hubble	671eacb86e	Update Kubernetes from v1.18.1 to v1.18.2 * https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md#changelog-since-v1181	2020-04-16 23:40:52 -07:00
Dalton Hubble	5c4a3f73d5	Add support for Fedora CoreOS on Azure * Add `azure/fedora-coreos/kubernetes` module	2020-04-12 16:35:49 -07:00
Dalton Hubble	76ab4c4c2a	Change `container-linux` module preference to Flatcar Linux * No change to Fedora CoreOS modules * For Container Linx AWS and Azure, change the `os_image` default from coreos-stable to flatcar-stable * For Container Linux GCP and DigitalOcean, change `os_image` to be required since users should upload a Flatcar Linux image and set the variable * For Container Linux bare-metal, recommend users change the `os_channel` to Flatcar Linux. No actual module change.	2020-04-11 14:52:30 -07:00
Dalton Hubble	73af2f3b7c	Update Kubernetes from v1.18.0 to v1.18.1 * https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md#v1181	2020-04-08 19:41:48 -07:00
Dalton Hubble	17ea547723	Update etcd from v3.4.5 to v3.4.7 * https://github.com/etcd-io/etcd/releases/tag/v3.4.7 * https://github.com/etcd-io/etcd/releases/tag/v3.4.6	2020-04-06 21:09:25 -07:00
Dalton Hubble	3c1be7b0e0	Fix terraform fmt	2020-03-31 21:42:51 -07:00
Dalton Hubble	135c6182b8	Update flannel from v0.11.0 to v0.12.0 * https://github.com/coreos/flannel/releases/tag/v0.12.0	2020-03-31 18:31:59 -07:00
Dalton Hubble	c53dc66d4a	Rename Container Linux snippets variable for consistency * Rename controller_clc_snippets to controller_snippets (cloud platforms) * Rename worker_clc_snippets to worker_snippets (cloud platforms) * Rename clc_snippets to snippets (bare-metal)	2020-03-31 18:25:51 -07:00
Dalton Hubble	9960972726	Fix bootstrap regression when networking="flannel" * Fix bootstrap error for missing `manifests-networking/crdyaml` when `networking = "flannel"` Cleanup manifest-networking directory left during bootstrap * Regressed in v1.18.0 changes for Calico https://github.com/poseidon/typhoon/pull/675	2020-03-31 18:21:59 -07:00
Dalton Hubble	bac5acb3bd	Change default kube-system DaemonSet tolerations * Change kube-proxy, flannel, and calico-node DaemonSet tolerations to tolerate `node.kubernetes.io/not-ready` and `node-role.kubernetes.io/master` (i.e. controllers) explicitly, rather than tolerating all taints * kube-system DaemonSets will no longer tolerate custom node taints by default. Instead, custom node taints must be enumerated to opt-in to scheduling/executing the kube-system DaemonSets * Consider setting the daemonset_tolerations variable of terraform-render-bootstrap at a later date Background: Tolerating all taints ruled out use-cases where certain nodes might legitimately need to keep kube-proxy or CNI networking disabled Related: https://github.com/poseidon/terraform-render-bootstrap/pull/179	2020-03-31 01:00:45 -07:00
Dalton Hubble	70bdc9ec94	Allow bootstrap re-apply for Fedora CoreOS GCP * Problem: Fedora CoreOS images are manually uploaded to GCP. When a cluster is created with a stale image, Zincati immediately checks for the latest stable image, fetches, and reboots. In practice, this can unfortunately occur exactly during the initial cluster bootstrap phase. * Recommended: Upload the latest Fedora CoreOS image regularly * Mitigation: Allow a failed bootstrap.service run (which won't touch the done ConditionalPathExists) to be re-run by running `terraforma apply` again. Add a known issue to CHANGES * Update docs to show the current Fedora CoreOS stable version to reduce likelihood users see this issue Longer term ideas: * Ideal: Fedora CoreOS publishes a stable channel. Instances will always boot with the latest image in a channel. The problem disappears since it works the same way AWS does * Timer: Consider some timer-based approach to have zincati delay any system reboots for the first ~30 min of a machine's life. Possibly just configured on the controller node https://github.com/coreos/zincati/pull/251 * External coordination: For Container Linux, locksmith filled a similar role and was disabled to allow CLUO to coordinate reboots. By running atop Kubernetes, it was not possible for the reboot to occur before cluster bootstrap * Rely on https://github.com/coreos/zincati/issues/115 to delay the reboot since bootstrap involves an SSH session * Use path-based activation of zincati on controllers and set that path at the end of the bootstrap process Rel: https://github.com/coreos/fedora-coreos-tracker/issues/239	2020-03-28 18:12:31 -07:00
Dalton Hubble	fc686c8fc7	Fix delete-node.service kubectl service exec's * Fix delete-node service that runs on worker (cloud-only) shutdown to delete a Kubernetes node. Regressed in #669 (unreleased) * Use rkt `--exec` to invoke kubectl binary in the kubelet image * Use podman `--entrypoint` to invoke the kubectl binary in the kubelet image	2020-03-28 12:35:23 -07:00
Dalton Hubble	ef5f953e04	Set docker log driver to journald on Fedora CoreOS * Before Kubernetes v1.18.0, Kubelet only supported kubectl `--limit-bytes` with the Docker `json-file` log driver so the Fedora CoreOS default was overridden for conformance. See https://github.com/poseidon/typhoon/pull/642 * Kubelet v1.18+ implemented support for other docker log drivers, so the Fedora CoreOS default `journald` can be used again Rel: https://github.com/kubernetes/kubernetes/issues/86367	2020-03-26 22:06:45 -07:00
Dalton Hubble	d25f23e675	Update docs from Kubernetes v1.17.4 to v1.18.0	2020-03-25 20:28:30 -07:00
Dalton Hubble	f100a90d28	Update Kubernetes from v1.17.4 to v1.18.0 * https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md	2020-03-25 17:51:50 -07:00
Dalton Hubble	590d941f50	Switch from upstream hyperkube image to individual images * Kubernetes plans to stop releasing the hyperkube container image * Upstream will continue to publish `kube-apiserver`, `kube-controller-manager`, `kube-scheduler`, and `kube-proxy` container images to `k8s.gcr.io` * Upstream will publish Kubelet only as a binary for distros to package, either as a DEB/RPM on traditional distros or a container image on container-optimized operating systems * Typhoon will package the upstream Kubelet (checksummed) and its dependencies as a container image for use on CoreOS Container Linux, Flatcar Linux, and Fedora CoreOS * Update the Typhoon container image security policy to list `quay.io/poseidon/kubelet`as an official distributed artifact Hyperkube: https://github.com/kubernetes/kubernetes/pull/88676 Kubelet Container Image: https://github.com/poseidon/kubelet Kubelet Quay Repo: https://quay.io/repository/poseidon/kubelet	2020-03-21 15:43:05 -07:00
Dalton Hubble	c3ef21dbf5	Update etcd from v3.4.4 to v3.4.5 * https://github.com/etcd-io/etcd/releases/tag/v3.4.5	2020-03-18 20:50:41 -07:00
Dalton Hubble	bc7902f40a	Update Kubernetes from v1.17.3 to v1.17.4 * https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.17.md#v1174	2020-03-13 00:06:41 -07:00
Dalton Hubble	70bf39bb9a	Update Calico from v3.12.0 to v3.13.1 * https://docs.projectcalico.org/v3.13/release-notes/	2020-03-12 23:00:38 -07:00
Dalton Hubble	51cee6d5a4	Change Container Linux etcd-member to fetch with docker:// * Quay has historically generated ACI signatures for images to facilitate rkt's notions of verification (it allowed authors to actually sign images, though `--trust-keys-from-https` is in use since etcd and most authors don't sign images). OCI standardization didn't adopt verification ideas and checking signatures has fallen out of favor. * Fix an issue where Quay no longer seems to be generating ACI signatures for new images (e.g. quay.io/coreos/etcd:v.3.4.4) * Don't be alarmed by rkt `--insecure-options=image`. It refers to disabling image signature checking (i.e. docker pull doesn't check signatures either) * System containers for Kubelet and bootstrap have transitioned to the docker:// transport, so there is precedent and this brings all the system containers on Container Linux controllers into alignment	2020-03-02 19:57:45 -08:00
Dalton Hubble	87f9a2fc35	Add automatic worker deletion on Fedora CoreOS clouds * On clouds where workers can scale down or be preempted (AWS, GCP, Azure), shutdown runs delete-node.service to remove a node a prevent NotReady nodes from lingering * Add the delete-node.service that wasn't carried over from Container Linux and port it to use podman	2020-02-29 20:22:03 -08:00
Dalton Hubble	6de5cf5a55	Update etcd from v3.4.3 to v3.4.4 * https://github.com/etcd-io/etcd/releases/tag/v3.4.4	2020-02-29 16:19:29 -08:00
Dalton Hubble	e4d977bfcd	Fix worker_node_labels for initial Fedora CoreOS * Add Terraform strip markers to consume beginning and trailing whitespace in templated Kubelet arguments for podman (Fedora CoreOS only) * Fix initial `worker_node_labels` being quietly ignored on Fedora CoreOS cloud platforms that offer the feature * Close https://github.com/poseidon/typhoon/issues/650	2020-02-22 15:12:35 -08:00

1 2 3 4 5 ...

349 Commits