typhoon

mirror of https://github.com/puppetmaster/typhoon.git synced 2025-10-04 04:24:37 +02:00

Author	SHA1	Message	Date
Dalton Hubble	fcbee12334	Fix race condition creating DigitalOcean firewall rules * DigitalOcean firewall rules should reference Terraform tag resources rather than using tag strings. Otherwise, terraform apply can fail (neeeds rerun) if a tag has not yet been created v1.18.2	2020-04-19 16:55:02 -07:00
Dalton Hubble	feac94605a	Fix bootstrap mount to use shared volume SELinux label * Race: During initial bootstrap, static control plane pods could hang with Permission denied to bootstrap secrets. A manual fix involved restarting Kubelet, which relabeled mounts The race had no effect on subsequent reboots. * bootstrap.service runs podman with a private unshared mount of /etc/kubernetes/bootstrap-secrets which uses an SELinux MCS label with a category pair. However, bootstrap-secrets should be shared as its mounted by Docker pods kube-apiserver, kube-scheduler, and kube-controller-manager. Restarting Kubelet was a manual fix because Kubelet relabels all /etc/kubernetes * Fix bootstrap Pod to use the shared volume label, which leaves bootstrap-secrets files with SELinux level s0 without MCS * Also allow failed bootstrap.service to be re-applied. This was missing on bare-metal and AWS	2020-04-19 16:31:32 -07:00
Dalton Hubble	2b1b918b43	Revert Flatcar Linux Azure to manual upload images * Initial support for Flatcar Linux on Azure used the Flatcar Linux Azure Marketplace images (e.g. `flatcar-stable`) in https://github.com/poseidon/typhoon/pull/664 * Flatcar Linux Azure Marketplace images have some unresolved items https://github.com/poseidon/typhoon/issues/703 * Until the Marketplace items are resolved, revert to requiring Flatcar Linux's images be manually uploaded (like GCP and DigitalOcean)	2020-04-18 15:40:57 -07:00
Dalton Hubble	bf22222f7d	Remove temporary workaround for v1.18.0 apply issue * In v1.18.0, kubectl apply would fail to apply manifests if any single manifest was unable to validate. For example, if a CRD and CR were defined in the same directory, apply would fail since the CR would be invalid as the CRD wouldn't exist * Typhoon temporary workaround was to separate CNI CRD manifests and explicitly apply them first. No longer needed in v1.18.1+ * Kubernetes v1.18.1 restored the prior behavior where kubectl apply applies as many valid manifests as it can. In the example above, the CRD would be applied and the CR could be applied if the kubectl apply was re-run (allowing for apply loops). * Upstream fix: https://github.com/kubernetes/kubernetes/pull/89864	2020-04-16 23:49:55 -07:00
Dalton Hubble	671eacb86e	Update Kubernetes from v1.18.1 to v1.18.2 * https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md#changelog-since-v1181	2020-04-16 23:40:52 -07:00
Dalton Hubble	e2d4af43be	Fix Fedora CoreOS Azure MTU with Calico * With Calico VXLAN on Fedora CoreOS the 1450 MTU should be used	2020-04-12 23:20:04 -07:00
Dalton Hubble	5c4a3f73d5	Add support for Fedora CoreOS on Azure * Add `azure/fedora-coreos/kubernetes` module	2020-04-12 16:35:49 -07:00
Dalton Hubble	76ab4c4c2a	Change `container-linux` module preference to Flatcar Linux * No change to Fedora CoreOS modules * For Container Linx AWS and Azure, change the `os_image` default from coreos-stable to flatcar-stable * For Container Linux GCP and DigitalOcean, change `os_image` to be required since users should upload a Flatcar Linux image and set the variable * For Container Linux bare-metal, recommend users change the `os_channel` to Flatcar Linux. No actual module change.	2020-04-11 14:52:30 -07:00
Dalton Hubble	1627ecaf27	Fix docs TOC to include Fedora CoreOS DigitalOcean	2020-04-11 14:07:09 -07:00
Dalton Hubble	1420700bc0	Update CHANGES for v1.18.1 release * Change order of modules in the README v1.18.1	2020-04-11 13:23:49 -07:00
Dalton Hubble	80538e2953	Add support for Fedora CoreOS on DigitalOcean * Add `digital-ocean/fedora-coreos/kubernetes` module * DigitalOcean custom uploaded images do not permit droplet IPv6 networking	2020-04-09 23:55:29 -07:00
Dalton Hubble	73af2f3b7c	Update Kubernetes from v1.18.0 to v1.18.1 * https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md#v1181	2020-04-08 19:41:48 -07:00
Dalton Hubble	17ea547723	Update etcd from v3.4.5 to v3.4.7 * https://github.com/etcd-io/etcd/releases/tag/v3.4.7 * https://github.com/etcd-io/etcd/releases/tag/v3.4.6	2020-04-06 21:09:25 -07:00
Dalton Hubble	2b5dfece93	Update Grafana from v6.7.1 to v6.7.2 * https://github.com/grafana/grafana/releases/tag/v6.7.2	2020-04-04 13:13:19 -07:00
Dalton Hubble	d47d40b517	Refresh Prometheus rules/alerts and Grafana dashboards * Refresh upstream Prometheus rules and alerts and Grafana dashboards * All Loki recording rules for convenience	2020-03-31 21:53:01 -07:00
Dalton Hubble	3c1be7b0e0	Fix terraform fmt	2020-03-31 21:42:51 -07:00
Dalton Hubble	bbbaf949f9	Fix UDP outbound and clock sync timeouts on Azure workers * Add "lb" outbound rule for worker TCP _and_ UDP traffic * Fix Azure worker nodes clock synchronization being inactive due to timeouts reaching the CoreOS / Flatcar NTP pool * Fix Azure worker nodes not providing outbount UDP connectivity Background: Azure provides VMs outbound connectivity either by having a public IP or via an SNAT masquerade feature bundled with their virtual load balancing abstraction (in contrast with, say, a NAT gateway). Azure worker nodes have only a private IP, but are associated with the cluster load balancer's backend pool and ingress frontend IP. Outbound traffic uses SNAT with this frontend IP. A subtle detail with Azure SNAT seems to be that since both inbound lb_rule's are TCP only, outbound UDP traffic isn't SNAT'd (highlights the reasons Azure shouldn't have conflated inbound load balancing with outbound SNAT concepts). However, adding a separate outbound rule and disabling outbound SNAT on our ingress lb_rule's we can tell Azure to continue load balancing as before, and support outbound SNAT for worker traffic of both the TCP and UDP protocol. Fixes clock synchronization timeouts: ``` systemd-timesyncd[786]: Timed out waiting for reply from 45.79.36.123:123 (3.flatcar.pool.ntp.org) ``` Azure controller nodes have their own public IP, so controllers (and etcd) nodes have not had clock synchronization or outbound UDP issues	2020-03-31 21:00:16 -07:00
Dalton Hubble	135c6182b8	Update flannel from v0.11.0 to v0.12.0 * https://github.com/coreos/flannel/releases/tag/v0.12.0	2020-03-31 18:31:59 -07:00
Dalton Hubble	c53dc66d4a	Rename Container Linux snippets variable for consistency * Rename controller_clc_snippets to controller_snippets (cloud platforms) * Rename worker_clc_snippets to worker_snippets (cloud platforms) * Rename clc_snippets to snippets (bare-metal)	2020-03-31 18:25:51 -07:00
Dalton Hubble	9960972726	Fix bootstrap regression when networking="flannel" * Fix bootstrap error for missing `manifests-networking/crdyaml` when `networking = "flannel"` Cleanup manifest-networking directory left during bootstrap * Regressed in v1.18.0 changes for Calico https://github.com/poseidon/typhoon/pull/675	2020-03-31 18:21:59 -07:00
Dalton Hubble	bac5acb3bd	Change default kube-system DaemonSet tolerations * Change kube-proxy, flannel, and calico-node DaemonSet tolerations to tolerate `node.kubernetes.io/not-ready` and `node-role.kubernetes.io/master` (i.e. controllers) explicitly, rather than tolerating all taints * kube-system DaemonSets will no longer tolerate custom node taints by default. Instead, custom node taints must be enumerated to opt-in to scheduling/executing the kube-system DaemonSets * Consider setting the daemonset_tolerations variable of terraform-render-bootstrap at a later date Background: Tolerating all taints ruled out use-cases where certain nodes might legitimately need to keep kube-proxy or CNI networking disabled Related: https://github.com/poseidon/terraform-render-bootstrap/pull/179	2020-03-31 01:00:45 -07:00
Dalton Hubble	70bdc9ec94	Allow bootstrap re-apply for Fedora CoreOS GCP * Problem: Fedora CoreOS images are manually uploaded to GCP. When a cluster is created with a stale image, Zincati immediately checks for the latest stable image, fetches, and reboots. In practice, this can unfortunately occur exactly during the initial cluster bootstrap phase. * Recommended: Upload the latest Fedora CoreOS image regularly * Mitigation: Allow a failed bootstrap.service run (which won't touch the done ConditionalPathExists) to be re-run by running `terraforma apply` again. Add a known issue to CHANGES * Update docs to show the current Fedora CoreOS stable version to reduce likelihood users see this issue Longer term ideas: * Ideal: Fedora CoreOS publishes a stable channel. Instances will always boot with the latest image in a channel. The problem disappears since it works the same way AWS does * Timer: Consider some timer-based approach to have zincati delay any system reboots for the first ~30 min of a machine's life. Possibly just configured on the controller node https://github.com/coreos/zincati/pull/251 * External coordination: For Container Linux, locksmith filled a similar role and was disabled to allow CLUO to coordinate reboots. By running atop Kubernetes, it was not possible for the reboot to occur before cluster bootstrap * Rely on https://github.com/coreos/zincati/issues/115 to delay the reboot since bootstrap involves an SSH session * Use path-based activation of zincati on controllers and set that path at the end of the bootstrap process Rel: https://github.com/coreos/fedora-coreos-tracker/issues/239 v1.18.0	2020-03-28 18:12:31 -07:00
Dalton Hubble	144bb9403c	Add support for Fedora CoreOS snippets * Refresh snippets customization docs * Requires terraform-provider-ct v0.5+	2020-03-28 16:15:04 -07:00
Dalton Hubble	5fca08064b	Fix Fedora CoreOS AMI to filter for stable images * Fix issue observed in us-east-1 where AMI filters chose the latest testing channel release, rather than the stable chanel * Fedora CoreOS AMI filter selects the latest image with a matching name, x86_64, and hvm, excluding dev images. Add a filter for "Fedora CoreOS stable", which seems to be the only distinguishing metadata indicating the channel	2020-03-28 12:57:45 -07:00
Dalton Hubble	fc686c8fc7	Fix delete-node.service kubectl service exec's * Fix delete-node service that runs on worker (cloud-only) shutdown to delete a Kubernetes node. Regressed in #669 (unreleased) * Use rkt `--exec` to invoke kubectl binary in the kubelet image * Use podman `--entrypoint` to invoke the kubectl binary in the kubelet image	2020-03-28 12:35:23 -07:00
Dalton Hubble	a1a5da6bc2	Add CoreOS Container Linux EOL recommendation to CHANGES * Recommend that users who have not yet tried Fedora CoreOS or Flatcar Linux do so. Likely, Container Linux will reach EOL and platform support / stability ratings will be in a mixed state. Nevertheless, folks should migrate by September.	2020-03-26 23:41:54 -07:00
Dalton Hubble	076b8e3c42	Update Prometheus from v2.17.0 to v2.17.1 * https://github.com/prometheus/prometheus/releases/tag/v2.17.1	2020-03-26 22:17:13 -07:00
Dalton Hubble	ef5f953e04	Set docker log driver to journald on Fedora CoreOS * Before Kubernetes v1.18.0, Kubelet only supported kubectl `--limit-bytes` with the Docker `json-file` log driver so the Fedora CoreOS default was overridden for conformance. See https://github.com/poseidon/typhoon/pull/642 * Kubelet v1.18+ implemented support for other docker log drivers, so the Fedora CoreOS default `journald` can be used again Rel: https://github.com/kubernetes/kubernetes/issues/86367	2020-03-26 22:06:45 -07:00
Dalton Hubble	d25f23e675	Update docs from Kubernetes v1.17.4 to v1.18.0	2020-03-25 20:28:30 -07:00
Dalton Hubble	f100a90d28	Update Kubernetes from v1.17.4 to v1.18.0 * https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md	2020-03-25 17:51:50 -07:00
Dalton Hubble	c3bf8bcf96	Add Fedora CoreOS to issue template and docs * Update several Container Linux references to start referring to Flatcar Linux * Update docs and mentions of Fedora CoreOS	2020-03-25 00:36:15 -07:00
Dalton Hubble	5d1e4ad333	Deprecate asset_dir variable and remove docs * Remove docs for the `asset_dir` variable and deprecate it in CHANGES. It will be removed in an upcoming release * Typhoon v1.17.0 introduced a new mechanism for managing and distributing generated assets that stopped relying on writing out to disk. `asset_dir` became optional and defaulted to being unset / off (recommended)	2020-03-25 00:00:01 -07:00
Dalton Hubble	9f702c72d2	Rename DigitalOcean image variable to os_image * Rename variable `image` to `os_image` to match the naming used for the same purpose on other supported platforms (e.g. AWS, Azure, GCP)	2020-03-24 23:49:37 -07:00
Dalton Hubble	e556bc2167	Update Prometheus from v2.17.0-rc.3 to v2.17.0 * https://github.com/prometheus/prometheus/releases/tag/v2.17.0	2020-03-24 23:15:49 -07:00
Dalton Hubble	1bf4f3b801	Fix image tag for Container Linux AWS workers * #669 left one reference to the original SHA tagged image before the v1.17.4 image tag was applied	2020-03-21 15:44:33 -07:00
Dalton Hubble	590d941f50	Switch from upstream hyperkube image to individual images * Kubernetes plans to stop releasing the hyperkube container image * Upstream will continue to publish `kube-apiserver`, `kube-controller-manager`, `kube-scheduler`, and `kube-proxy` container images to `k8s.gcr.io` * Upstream will publish Kubelet only as a binary for distros to package, either as a DEB/RPM on traditional distros or a container image on container-optimized operating systems * Typhoon will package the upstream Kubelet (checksummed) and its dependencies as a container image for use on CoreOS Container Linux, Flatcar Linux, and Fedora CoreOS * Update the Typhoon container image security policy to list `quay.io/poseidon/kubelet`as an official distributed artifact Hyperkube: https://github.com/kubernetes/kubernetes/pull/88676 Kubelet Container Image: https://github.com/poseidon/kubelet Kubelet Quay Repo: https://quay.io/repository/poseidon/kubelet	2020-03-21 15:43:05 -07:00
Dalton Hubble	ddc1ff5348	Update Grafana from v6.6.2 to v6.7.1 * https://github.com/grafana/grafana/releases/tag/v6.7.1	2020-03-21 15:27:55 -07:00
Dalton Hubble	61557e89a6	Update Prometheus from v2.16.0 to v2.17.0-rc.3 * https://github.com/prometheus/prometheus/releases/tag/v2.17.0-rc.3	2020-03-19 22:38:05 -07:00
Dalton Hubble	c3ef21dbf5	Update etcd from v3.4.4 to v3.4.5 * https://github.com/etcd-io/etcd/releases/tag/v3.4.5	2020-03-18 20:50:41 -07:00
Dalton Hubble	2a5dddeb9d	Promote Fedora CoreOS AWS and Google Cloud * Promote Fedora CoreOS AWS to stable * Promote Fedora CoreOS GCP to beta v1.17.4	2020-03-16 22:12:26 -07:00
Dalton Hubble	75fb4e5d11	Remove Container Linux Update Operator (CLUO) addon * Stop providing example manifests for the Container Linux Update Operator (CLUO) * CLUO requires patches to support Kubernetes v1.16+, but the project and push access is rather unowned * CLUO hasn't been in active use in our clusters and won't be relevant beyond Container Linux. Not to say folks can't patch it and run it on their own. Examples just aren't provided here Related: https://github.com/coreos/container-linux-update-operator/pull/197	2020-03-16 22:05:17 -07:00
Dalton Hubble	1a139ef6f1	Update recommended Terraform versions and providers * Sync the documented Terraform versions and provider plugin versions to those that are actively used/tested by the author	2020-03-16 21:40:52 -07:00
Dalton Hubble	bc7902f40a	Update Kubernetes from v1.17.3 to v1.17.4 * https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.17.md#v1174	2020-03-13 00:06:41 -07:00
Dalton Hubble	70bf39bb9a	Update Calico from v3.12.0 to v3.13.1 * https://docs.projectcalico.org/v3.13/release-notes/	2020-03-12 23:00:38 -07:00
Dalton Hubble	4e1b8f22df	Add support for Flatcar Linux on Azure * Accept `os_image` "flatcar-stable" and "flatcar-beta" to use Kinvolk's Flatcar Linux images from the Azure Marketplace Note: Flatcar Linux Azure Marketplace images require terms be accepted before use	2020-03-12 22:52:48 -07:00
Dalton Hubble	ab7913a061	Accept initial worker node labels and taints map on bare-metal * Add `worker_node_labels` map from node name to a list of initial node label strings * Add `worker_node_taints` map from node name to a list of initial node taint strings * Unlike cloud platforms, bare-metal node labels and taints are defined via a map from node name to list of labels/taints. Bare-metal clusters may have heterogeneous hardware so per node labels and taints are accepted * Only worker node names are allowed. Workloads are not scheduled on controller nodes so altering their labels/taints isn't suitable ``` module "mercury" { ... worker_node_labels = { "node2" = ["role=special"] } worker_node_taints = { "node2" = ["role=special:NoSchedule"] } } ``` Related: https://github.com/poseidon/typhoon/issues/429	2020-03-09 00:12:02 -07:00
Dalton Hubble	7b0ea23cdc	Upgrade terraform-provider-azurerm to v2.0+ * Add support for `terraform-provider-azurerm` v2.0+. Require `terraform-provider-azurerm` v2.0+ and drop v1.x support since the Azure provider major release is not backwards compatible * Use Azure's new Linux VM and Linux VM Scale Set resources * Change controller's Azure disk caching to None * Associate subnets (in addition to NICs) with security groups (aesthetic) * If set, change `worker_priority` from `Low` to `Spot` (action required) Related: * https://www.terraform.io/docs/providers/azurerm/guides/2.0-upgrade-guide.html	2020-03-08 17:40:13 -07:00
Dalton Hubble	c4683c5bad	Refresh Prometheus alerts and Grafana dashboards * Add 2 min wait before KubeNodeUnreachable to be less noisy on premeptible clusters * Add a BlackboxProbeFailure alert for any failing probes for services annotated `prometheus.io/probe: true`	2020-03-02 20:08:37 -08:00
Dalton Hubble	51cee6d5a4	Change Container Linux etcd-member to fetch with docker:// * Quay has historically generated ACI signatures for images to facilitate rkt's notions of verification (it allowed authors to actually sign images, though `--trust-keys-from-https` is in use since etcd and most authors don't sign images). OCI standardization didn't adopt verification ideas and checking signatures has fallen out of favor. * Fix an issue where Quay no longer seems to be generating ACI signatures for new images (e.g. quay.io/coreos/etcd:v.3.4.4) * Don't be alarmed by rkt `--insecure-options=image`. It refers to disabling image signature checking (i.e. docker pull doesn't check signatures either) * System containers for Kubelet and bootstrap have transitioned to the docker:// transport, so there is precedent and this brings all the system containers on Container Linux controllers into alignment	2020-03-02 19:57:45 -08:00
Dalton Hubble	87f9a2fc35	Add automatic worker deletion on Fedora CoreOS clouds * On clouds where workers can scale down or be preempted (AWS, GCP, Azure), shutdown runs delete-node.service to remove a node a prevent NotReady nodes from lingering * Add the delete-node.service that wasn't carried over from Container Linux and port it to use podman	2020-02-29 20:22:03 -08:00

... 2 3 4 5 6 ...

1092 Commits