typhoon

mirror of https://github.com/puppetmaster/typhoon.git synced 2025-10-03 15:34:37 +02:00

Author	SHA1	Message	Date
Dalton Hubble	358854e712	Fix Calico install-cni crash loop on Pod restarts * Set a consistent MCS level/range for Calico install-cni * Note: Rebooting a node was a workaround, because Kubelet relabels /etc/kubernetes(/cni/net.d) Background: * On SELinux enforcing systems, the Calico CNI install-cni container ran with default SELinux context and a random MCS pair. install-cni places CNI configs by first creating a temporary file and then moving them into place, which means the file MCS categories depend on the containers SELinux context. * calico-node Pod restarts creates a new install-cni container with a different MCS pair that cannot access the earlier written file (it places configs every time), causing the init container to error and calico-node to crash loop * https://github.com/projectcalico/cni-plugin/issues/874 ``` mv: inter-device move failed: '/calico.conf.tmp' to '/host/etc/cni/net.d/10-calico.conflist'; unable to remove target: Permission denied Failed to mv files. This may be caused by selinux configuration on the host, or something else. ``` Note, this isn't a host SELinux configuration issue. Related: * https://github.com/poseidon/terraform-render-bootstrap/pull/186	2020-05-09 16:01:44 -07:00
Ben Drucker	317416b316	Use Terraform element wrap-around for AWS controllers subnet_id (#714 ) * Fix Terraform plan error when controller_count exceeds available AWS zones (e.g. 5 controllers)	2020-04-29 20:41:08 -07:00
Dalton Hubble	fd044ee117	Enable Kubelet TLS bootstrap and NodeRestriction * Enable bootstrap token authentication on kube-apiserver * Generate the bootstrap.kubernetes.io/token Secret that may be used as a bootstrap token * Generate a bootstrap kubeconfig (with a bootstrap token) to be securely distributed to nodes. Each Kubelet will use the bootstrap kubeconfig to authenticate to kube-apiserver as `system:bootstrappers` and send a node-unique CSR for kube-controller-manager to automatically approve to issue a Kubelet certificate and kubeconfig (expires in 72 hours) * Add ClusterRoleBinding for bootstrap token subjects (`system:bootstrappers`) to have the `system:node-bootstrapper` ClusterRole * Add ClusterRoleBinding for bootstrap token subjects (`system:bootstrappers`) to have the csr nodeclient ClusterRole * Add ClusterRoleBinding for bootstrap token subjects (`system:bootstrappers`) to have the csr selfnodeclient ClusterRole * Enable NodeRestriction admission controller to limit the scope of Node or Pod objects a Kubelet can modify to those of the node itself * Ability for a Kubelet to delete its Node object is retained as preemptible nodes or those in auto-scaling instance groups need to be able to remove themselves on shutdown. This need continues to have precedence over any risk of a node deleting itself maliciously Security notes: 1. Issued Kubelet certificates authenticate as user `system:node:NAME` and group `system:nodes` and are limited in their authorization to perform API operations by Node authorization and NodeRestriction admission. Previously, a Kubelet's authorization was broader. This is the primary security motivation. 2. The bootstrap kubeconfig credential has the same sensitivity as the previous generated TLS client-certificate kubeconfig. It must be distributed securely to nodes. Its compromise still allows an attacker to obtain a Kubelet kubeconfig 3. Bootstrapping Kubelet kubeconfig's with a limited lifetime offers a slight security improvement. * An attacker who obtains the kubeconfig can likely obtain the bootstrap kubeconfig as well, to obtain the ability to renew their access * A compromised bootstrap kubeconfig could plausibly be handled by replacing the bootstrap token Secret, distributing the token to new nodes, and expiration. Whereas a compromised TLS-client certificate kubeconfig can't be revoked (no CRL). However, replacing a bootstrap token can be impractical in real cluster environments, so the limited lifetime is mostly a theoretical benefit. * Cluster CSR objects are visible via kubectl which is nice 4. Bootstrapping node-unique Kubelet kubeconfigs means Kubelet clients have more identity information, which can improve the utility of audits and future features Rel: https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet-tls-bootstrapping/ Rel: https://github.com/poseidon/terraform-render-bootstrap/pull/185	2020-04-28 19:35:33 -07:00
Dalton Hubble	38a6bddd06	Update Calico from v3.13.1 to v3.13.3 * https://docs.projectcalico.org/v3.13/release-notes/	2020-04-23 23:58:02 -07:00
Dalton Hubble	d8966afdda	Remove extraneous sudo from layout asset unpacking	2020-04-22 20:28:01 -07:00
Dalton Hubble	feac94605a	Fix bootstrap mount to use shared volume SELinux label * Race: During initial bootstrap, static control plane pods could hang with Permission denied to bootstrap secrets. A manual fix involved restarting Kubelet, which relabeled mounts The race had no effect on subsequent reboots. * bootstrap.service runs podman with a private unshared mount of /etc/kubernetes/bootstrap-secrets which uses an SELinux MCS label with a category pair. However, bootstrap-secrets should be shared as its mounted by Docker pods kube-apiserver, kube-scheduler, and kube-controller-manager. Restarting Kubelet was a manual fix because Kubelet relabels all /etc/kubernetes * Fix bootstrap Pod to use the shared volume label, which leaves bootstrap-secrets files with SELinux level s0 without MCS * Also allow failed bootstrap.service to be re-applied. This was missing on bare-metal and AWS	2020-04-19 16:31:32 -07:00
Dalton Hubble	bf22222f7d	Remove temporary workaround for v1.18.0 apply issue * In v1.18.0, kubectl apply would fail to apply manifests if any single manifest was unable to validate. For example, if a CRD and CR were defined in the same directory, apply would fail since the CR would be invalid as the CRD wouldn't exist * Typhoon temporary workaround was to separate CNI CRD manifests and explicitly apply them first. No longer needed in v1.18.1+ * Kubernetes v1.18.1 restored the prior behavior where kubectl apply applies as many valid manifests as it can. In the example above, the CRD would be applied and the CR could be applied if the kubectl apply was re-run (allowing for apply loops). * Upstream fix: https://github.com/kubernetes/kubernetes/pull/89864	2020-04-16 23:49:55 -07:00
Dalton Hubble	671eacb86e	Update Kubernetes from v1.18.1 to v1.18.2 * https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md#changelog-since-v1181	2020-04-16 23:40:52 -07:00
Dalton Hubble	76ab4c4c2a	Change `container-linux` module preference to Flatcar Linux * No change to Fedora CoreOS modules * For Container Linx AWS and Azure, change the `os_image` default from coreos-stable to flatcar-stable * For Container Linux GCP and DigitalOcean, change `os_image` to be required since users should upload a Flatcar Linux image and set the variable * For Container Linux bare-metal, recommend users change the `os_channel` to Flatcar Linux. No actual module change.	2020-04-11 14:52:30 -07:00
Dalton Hubble	73af2f3b7c	Update Kubernetes from v1.18.0 to v1.18.1 * https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md#v1181	2020-04-08 19:41:48 -07:00
Dalton Hubble	17ea547723	Update etcd from v3.4.5 to v3.4.7 * https://github.com/etcd-io/etcd/releases/tag/v3.4.7 * https://github.com/etcd-io/etcd/releases/tag/v3.4.6	2020-04-06 21:09:25 -07:00
Dalton Hubble	3c1be7b0e0	Fix terraform fmt	2020-03-31 21:42:51 -07:00
Dalton Hubble	135c6182b8	Update flannel from v0.11.0 to v0.12.0 * https://github.com/coreos/flannel/releases/tag/v0.12.0	2020-03-31 18:31:59 -07:00
Dalton Hubble	9960972726	Fix bootstrap regression when networking="flannel" * Fix bootstrap error for missing `manifests-networking/crdyaml` when `networking = "flannel"` Cleanup manifest-networking directory left during bootstrap * Regressed in v1.18.0 changes for Calico https://github.com/poseidon/typhoon/pull/675	2020-03-31 18:21:59 -07:00
Dalton Hubble	bac5acb3bd	Change default kube-system DaemonSet tolerations * Change kube-proxy, flannel, and calico-node DaemonSet tolerations to tolerate `node.kubernetes.io/not-ready` and `node-role.kubernetes.io/master` (i.e. controllers) explicitly, rather than tolerating all taints * kube-system DaemonSets will no longer tolerate custom node taints by default. Instead, custom node taints must be enumerated to opt-in to scheduling/executing the kube-system DaemonSets * Consider setting the daemonset_tolerations variable of terraform-render-bootstrap at a later date Background: Tolerating all taints ruled out use-cases where certain nodes might legitimately need to keep kube-proxy or CNI networking disabled Related: https://github.com/poseidon/terraform-render-bootstrap/pull/179	2020-03-31 01:00:45 -07:00
Dalton Hubble	5fca08064b	Fix Fedora CoreOS AMI to filter for stable images * Fix issue observed in us-east-1 where AMI filters chose the latest testing channel release, rather than the stable chanel * Fedora CoreOS AMI filter selects the latest image with a matching name, x86_64, and hvm, excluding dev images. Add a filter for "Fedora CoreOS stable", which seems to be the only distinguishing metadata indicating the channel	2020-03-28 12:57:45 -07:00
Dalton Hubble	ef5f953e04	Set docker log driver to journald on Fedora CoreOS * Before Kubernetes v1.18.0, Kubelet only supported kubectl `--limit-bytes` with the Docker `json-file` log driver so the Fedora CoreOS default was overridden for conformance. See https://github.com/poseidon/typhoon/pull/642 * Kubelet v1.18+ implemented support for other docker log drivers, so the Fedora CoreOS default `journald` can be used again Rel: https://github.com/kubernetes/kubernetes/issues/86367	2020-03-26 22:06:45 -07:00
Dalton Hubble	d25f23e675	Update docs from Kubernetes v1.17.4 to v1.18.0	2020-03-25 20:28:30 -07:00
Dalton Hubble	f100a90d28	Update Kubernetes from v1.17.4 to v1.18.0 * https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md	2020-03-25 17:51:50 -07:00
Dalton Hubble	590d941f50	Switch from upstream hyperkube image to individual images * Kubernetes plans to stop releasing the hyperkube container image * Upstream will continue to publish `kube-apiserver`, `kube-controller-manager`, `kube-scheduler`, and `kube-proxy` container images to `k8s.gcr.io` * Upstream will publish Kubelet only as a binary for distros to package, either as a DEB/RPM on traditional distros or a container image on container-optimized operating systems * Typhoon will package the upstream Kubelet (checksummed) and its dependencies as a container image for use on CoreOS Container Linux, Flatcar Linux, and Fedora CoreOS * Update the Typhoon container image security policy to list `quay.io/poseidon/kubelet`as an official distributed artifact Hyperkube: https://github.com/kubernetes/kubernetes/pull/88676 Kubelet Container Image: https://github.com/poseidon/kubelet Kubelet Quay Repo: https://quay.io/repository/poseidon/kubelet	2020-03-21 15:43:05 -07:00
Dalton Hubble	c3ef21dbf5	Update etcd from v3.4.4 to v3.4.5 * https://github.com/etcd-io/etcd/releases/tag/v3.4.5	2020-03-18 20:50:41 -07:00
Dalton Hubble	bc7902f40a	Update Kubernetes from v1.17.3 to v1.17.4 * https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.17.md#v1174	2020-03-13 00:06:41 -07:00
Dalton Hubble	70bf39bb9a	Update Calico from v3.12.0 to v3.13.1 * https://docs.projectcalico.org/v3.13/release-notes/	2020-03-12 23:00:38 -07:00
Dalton Hubble	87f9a2fc35	Add automatic worker deletion on Fedora CoreOS clouds * On clouds where workers can scale down or be preempted (AWS, GCP, Azure), shutdown runs delete-node.service to remove a node a prevent NotReady nodes from lingering * Add the delete-node.service that wasn't carried over from Container Linux and port it to use podman	2020-02-29 20:22:03 -08:00
Dalton Hubble	6de5cf5a55	Update etcd from v3.4.3 to v3.4.4 * https://github.com/etcd-io/etcd/releases/tag/v3.4.4	2020-02-29 16:19:29 -08:00
Dalton Hubble	3250994c95	Use a route table with separate (rather than inline) routes * Allow users to extend the route table using a data reference and adding route resources (e.g. unusual peering setups) * Note: Internally connecting AWS clusters can reduce cross-cloud flexibility and inhibits blue-green cluster patterns. It is not recommended	2020-02-25 23:21:58 -08:00
Dalton Hubble	e4d977bfcd	Fix worker_node_labels for initial Fedora CoreOS * Add Terraform strip markers to consume beginning and trailing whitespace in templated Kubelet arguments for podman (Fedora CoreOS only) * Fix initial `worker_node_labels` being quietly ignored on Fedora CoreOS cloud platforms that offer the feature * Close https://github.com/poseidon/typhoon/issues/650	2020-02-22 15:12:35 -08:00
Dalton Hubble	4a38fb5927	Update CoreDNS from v1.6.6 to v1.6.7 * https://coredns.io/2020/01/28/coredns-1.6.7-release/	2020-02-18 21:46:19 -08:00
Suraj Deshmukh	c4e64a9d1b	Change Kubelet /var/lib/calico mount to read-only (#643 ) * Kubelet only requires read access to /var/lib/calico Signed-off-by: Suraj Deshmukh <surajd.service@gmail.com>	2020-02-18 21:40:58 -08:00
Dalton Hubble	49d3b9e6b3	Set docker log driver to json-file on Fedora CoreOS * Fix the last minor issue for Fedora CoreOS clusters to pass CNCF's Kubernetes conformance tests * Kubelet supports a seldom used feature `kubectl logs --limit-bytes=N` to trim a log stream to a desired length. Kubelet handles this in the CRI driver. The Kubelet docker shim only supports the limit bytes feature when Docker is configured with the default `json-file` logging driver * CNCF conformance tests started requiring limit-bytes be supported, indirectly forcing the log driver choice until either the Kubelet or the conformance tests are fixed * Fedora CoreOS defaults Docker to use `journald` (desired). For now, as a workaround to offer conformant clusters, the log driver can be set back to `json-file`. RHEL CoreOS likely won't have noticed the non-conformance since its using crio runtime * https://github.com/kubernetes/kubernetes/issues/86367 Note: When upstream has a fix, the aim is to drop the docker config override and use the journald default	2020-02-11 23:00:38 -08:00
Dalton Hubble	1243f395d1	Update Kubernetes from v1.17.2 to v1.17.3 * https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.17.md#v1173	2020-02-11 20:22:14 -08:00
Dalton Hubble	ca96a1335c	Update Calico from v3.11.2 to v3.12.0 * https://docs.projectcalico.org/release-notes/#v3120 * Remove reverse packet filter override, since Calico no longer relies on the setting * https://github.com/coreos/fedora-coreos-tracker/issues/219 * https://github.com/projectcalico/felix/pull/2189	2020-02-06 00:43:33 -08:00
Dalton Hubble	1cda5bcd2a	Update Kubernetes from v1.17.1 to v1.17.2 * https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.17.md#v1172	2020-01-21 18:27:39 -08:00
Dalton Hubble	7daabd28b5	Update Calico from v3.11.1 to v3.11.2 * https://docs.projectcalico.org/v3.11/release-notes/	2020-01-18 13:45:24 -08:00
Dalton Hubble	b642e3b41b	Update Kubernetes from v1.17.0 to v1.17.1 * https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.17.md#v1171	2020-01-14 20:21:36 -08:00
Dalton Hubble	ac786a2efc	Update AWS Fedora CoreOS AMI filter for fedora-coreos-31 * Select the most recent fedora-coreos-31 AMI on AWS, instead of the most recent fedora-coreos-30 AMI (Nov 27, 2019) * Evaluated with fedora-coreos-31.20200108.2.0-hvm	2020-01-14 20:06:14 -08:00
Dalton Hubble	ce0569e03b	Remove unneeded Kubelet /var/run mount on Fedora CoreOS * /var/run symlinks to /run (already mounted)	2020-01-11 15:15:39 -08:00
Dalton Hubble	43e05b9131	Enable kube-proxy metrics and allow Prometheus scrapes * Configure kube-proxy --metrics-bind-address=0.0.0.0 (default 127.0.0.1) to serve metrics on 0.0.0.0:10249 * Add firewall rules to allow Prometheus (resides on a worker) to scrape kube-proxy service endpoints on controllers or workers * Add a clusterIP: None service for kube-proxy endpoint discovery	2020-01-06 21:11:18 -08:00
Dalton Hubble	b2eb3e05d0	Disable Kubelet 127.0.0.1.10248 healthz endpoint * Kubelet runs a healthz server listening on 127.0.0.1:10248 by default. Its unused by Typhoon and can be disabled * https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/	2019-12-29 11:23:25 -08:00
Dalton Hubble	50db3d0231	Rename CLC files and favor Terraform list index syntax * Rename Container Linux Config (CLC) files to .yaml to align with Fedora CoreOS Config (FCC) files and for syntax highlighting Replace common uses of Terraform `element` (which wraps around) with `list[index]` syntax to surface index errors	2019-12-28 12:14:01 -08:00
Dalton Hubble	11565ffa8a	Update Calico from v3.10.2 to v3.11.1 * https://docs.projectcalico.org/v3.11/release-notes/	2019-12-28 11:08:03 -08:00
Dalton Hubble	daa8d9d9ec	Update CoreDNS from v1.6.5 to v1.6.6 * https://coredns.io/2019/12/11/coredns-1.6.6-release/	2019-12-22 10:47:19 -05:00
Dalton Hubble	c0ce04e1de	Update Calico from v3.10.1 to v3.10.2 * https://docs.projectcalico.org/v3.10/release-notes/	2019-12-09 21:03:00 -08:00
Dalton Hubble	ed3550dce1	Update systemd services for the v0.17.x hyperkube * Binary asset locations within the upstream hyperkube image changed https://github.com/kubernetes/kubernetes/pull/84662 * Fix Container Linux and Flatcar Linux kubelet.service (rkt-fly with fairly dated CoreOS kubelet-wrapper) * Fix Fedora CoreOS kubelet.service (podman) * Fix Fedora CoreOS bootstrap.service * Fix delete-node kubectl usage for workers where nodes may delete themselves on shutdown (e.g. preemptible instances)	2019-12-09 18:39:17 -08:00
Dalton Hubble	de36d99afc	Update Kubernetes from v1.16.3 to v1.17.0 * https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.17.md/#v1170	2019-12-09 18:31:58 -08:00
Dalton Hubble	4fce9485c8	Reduce kube-controller-manager pod eviction timeout from 5m to 1m * Reduce time to delete pods on unready nodes from 5m to 1m * Present since v1.13.3, but mistakenly removed in v1.16.0 static pod control plane migration Related: * https://github.com/poseidon/terraform-render-bootstrap/pull/148 * https://github.com/poseidon/terraform-render-bootstrap/pull/164	2019-12-08 22:58:31 -08:00
Dalton Hubble	d9c7a9e049	Add/update docs for asset_dir and kubeconfig usage * Original tutorials favored including the platform (e.g. google-cloud) in modules (e.g. google-cloud-yavin). Prefer naming conventions where each module / cluster has a simple name (e.g. yavin) since the platform is usually redundant * Retain the example cluster naming themes per platform	2019-12-05 22:56:42 -08:00
Dalton Hubble	2837275265	Introduce cluster creation without local writes to asset_dir * Allow generated assets (TLS materials, manifests) to be securely distributed to controller node(s) via file provisioner (i.e. ssh-agent) as an assets bundle file, rather than relying on assets being locally rendered to disk in an asset_dir and then securely distributed * Change `asset_dir` from required to optional. Left unset, asset_dir defaults to "" and no assets will be written to files on the machine that runs terraform apply * Enhancement: Managed cluster assets are kept only in Terraform state, which supports different backends (GCS, S3, etcd, etc) and optional encryption. terraform apply accesses state, runs in-memory, and distributes sensitive materials to controllers without making use of local disk (simplifies use in CI systems) * Enhancement: Improve asset unpack and layout process to position etcd certificates and control plane certificates more cleanly, without unneeded secret materials Details: * Terraform file provisioner support for distributing directories of contents (with unknown structure) has been limited to reading from a local directory, meaning local writes to asset_dir were required. https://github.com/poseidon/typhoon/issues/585 discusses the problem and newer or upcoming Terraform features that might help. * Observation: Terraform provisioner support for single files works well, but iteration isn't viable. We're also constrained to Terraform language features on the apply side (no extra plugins, no shelling out) and CoreOS / Fedora tools on the receive side. * Take a map representation of the contents that would have been splayed out in asset_dir and pack/encode them into a single file format devised for easy unpacking. Use an awk one-liner on the receive side to unpack. In pratice, this has worked well and its rather nice that a single assets file is transferred by file provisioner (all or none) Rel: https://github.com/poseidon/terraform-render-bootstrap/pull/162	2019-12-05 01:24:50 -08:00
Dalton Hubble	4b485a9bf2	Fix recent deletion of bootstrap module pinned SHA * Fix deletion of bootstrap module pinned SHA, which was introduced recently through an automation mistake creating https://github.com/poseidon/typhoon/pull/589	2019-11-21 22:34:09 -08:00
Dalton Hubble	8a9e8595ae	Fix terraform fmt formatting	2019-11-13 23:44:02 -08:00

1 2

94 Commits