typhoon

mirror of https://github.com/puppetmaster/typhoon.git synced 2025-02-19 00:01:27 +01:00

Author	SHA1	Message	Date
Dalton Hubble	187bb17d39	Update Grafana from v7.0.0 to v7.0.1 * https://github.com/grafana/grafana/releases/tag/v7.0.1	2020-05-27 21:35:24 -07:00
Dalton Hubble	abc31c3711	Update node-exporter from v1.0.0-rc.1 to v1.0.0 * https://github.com/prometheus/node_exporter/releases/tag/v1.0.0	2020-05-27 21:33:03 -07:00
Dalton Hubble	e72f916c8d	Update etcd from v3.4.8 to v3.4.9 * https://github.com/etcd-io/etcd/blob/master/CHANGELOG-3.4.md#v349-2020-05-20	2020-05-22 00:52:20 -07:00
Dalton Hubble	c52f9f8d08	Upgrade docs packages and refresh content * Promote DigitalOcean from alpha to beta for Fedora CoreOS and Flatcar Linux * Upgrade mkdocs-material and PyPI packages for docs * Replace docs mentions of Container Linux with Flatcar Linux and move docs/cl to docs/flatcar-linux * Deprecate CoreOS Container Linux support. Its still usable for some time, but start removing docs	2020-05-20 23:31:26 -07:00
Dalton Hubble	3bdddc452c	Update Grafana from v7.0.0-beta2 to v7.0.0 * https://grafana.com/docs/grafana/latest/guides/whats-new-in-v7-0/	2020-05-18 23:42:32 -07:00
Dalton Hubble	ff4187a1fb	Use new Azure subnet to set address_prefixes list * Update Azure subnet `address_prefix` to `azure_prefixes` list * Fix warning that `address_prefix` is deprecated * Require `terraform-provider-azurerm` v2.8.0+ (action required) Rel: https://github.com/terraform-providers/terraform-provider-azurerm/pull/6493	2020-05-18 23:35:47 -07:00
Dalton Hubble	90edcd3d77	Update node-exporter from v1.0.0-rc.0 to v1.0.0-rc.1 * https://github.com/prometheus/node_exporter/releases/tag/v1.0.0-rc.1	2020-05-15 18:03:19 -07:00
Dalton Hubble	a927c7c790	Update kube-state-metrics from v1.9.5 to v1.9.6 * https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.9.6	2020-05-15 17:42:24 -07:00
Dalton Hubble	d952576d2f	Update Grafana from v7.0.0-beta3 to v7.0.0 * https://github.com/grafana/grafana/releases/tag/7.0.0	2020-05-15 17:38:59 -07:00
Dalton Hubble	70e389f37f	Restore use of Flatcar Linux Azure Marketplace image * Switch Flatcar Linux Azure to use the Marketplace image from Kinvolk (offer `flatcar-container-linux-free`) * Accepting Azure Marketplace terms is still neccessary, update docs to show accepting the free offer rather than BYOL * Upstream Flatcar: https://github.com/flatcar-linux/Flatcar/issues/82 * Typhoon: https://github.com/poseidon/typhoon/issues/703	2020-05-13 22:50:24 -07:00
Dalton Hubble	01905b00bc	Support Fedora CoreOS OS image streams on AWS * Add `os_stream` variable to set the stream to stable (default), testing, or next * Remove unused os_image variable on Fedora CoreOS AWS	2020-05-13 21:45:12 -07:00
Dalton Hubble	f4194cd57a	Update Grafana from v7.0.0-beta2 to v7.0.0-beta.3 * https://github.com/grafana/grafana/releases/tag/v7.0.0-beta3	2020-05-09 17:50:40 -07:00
Dalton Hubble	a2db4fa8c4	Update Calico from v3.13.3 to v3.14.0 * https://docs.projectcalico.org/v3.14/release-notes/	2020-05-09 16:05:30 -07:00
Dalton Hubble	358854e712	Fix Calico install-cni crash loop on Pod restarts * Set a consistent MCS level/range for Calico install-cni * Note: Rebooting a node was a workaround, because Kubelet relabels /etc/kubernetes(/cni/net.d) Background: * On SELinux enforcing systems, the Calico CNI install-cni container ran with default SELinux context and a random MCS pair. install-cni places CNI configs by first creating a temporary file and then moving them into place, which means the file MCS categories depend on the containers SELinux context. * calico-node Pod restarts creates a new install-cni container with a different MCS pair that cannot access the earlier written file (it places configs every time), causing the init container to error and calico-node to crash loop * https://github.com/projectcalico/cni-plugin/issues/874 ``` mv: inter-device move failed: '/calico.conf.tmp' to '/host/etc/cni/net.d/10-calico.conflist'; unable to remove target: Permission denied Failed to mv files. This may be caused by selinux configuration on the host, or something else. ``` Note, this isn't a host SELinux configuration issue. Related: * https://github.com/poseidon/terraform-render-bootstrap/pull/186	2020-05-09 16:01:44 -07:00
Dalton Hubble	b5dabcea31	Use Fedora CoreOS image streams on Google Cloud * Add `os_stream` variable to set a Fedora CoreOS stream to `stable` (default), `testing`, or `next` * Deprecate `os_image` variable. Remove docs about uploading Fedora CoreOS images manually, this is no longer needed * https://docs.fedoraproject.org/en-US/fedora-coreos/update-streams/ Rel: https://github.com/coreos/fedora-coreos-docs/pull/70	2020-05-08 01:23:12 -07:00
Dalton Hubble	3f0a5d2715	Update Grafana from v7.0.0-beta1 to v7.0.0-beta2 * https://github.com/grafana/grafana/releases/tag/v7.0.0-beta2	2020-05-07 23:04:44 -07:00
Dalton Hubble	33173c0206	Update Prometheus from v2.18.0 to v2.18.1 * https://github.com/prometheus/prometheus/releases/tag/v2.18.1	2020-05-07 22:59:11 -07:00
Dalton Hubble	70f30d9c07	Update Prometheus from v2.18.0-rc.1 to v2.18.0 * https://github.com/prometheus/prometheus/releases/tag/v2.18.0	2020-05-05 22:31:11 -07:00
Dalton Hubble	6afc1643d9	Update nginx-ingress from v0.30.0 to v0.32.0 * Add support for IngressClass and RBAC authorization * Since our nginx ingress controller example uses the flag `--ingress-class=public`, add an IngressClass to go along with it Rel: https://kubernetes.io/docs/concepts/services-networking/ingress/#ingress-class	2020-05-03 23:24:19 -07:00
Dalton Hubble	e71e27e769	Update Prometheus from v2.17.2 to v2.18.0-rc.1 * https://github.com/prometheus/prometheus/releases/tag/v2.18.0-rc.1	2020-04-29 20:57:48 -07:00
Dalton Hubble	64035005d4	Update Grafana from v6.7.2 to v7.0.0-beta1 * https://github.com/grafana/grafana/releases/tag/v7.0.0-beta1	2020-04-29 20:53:30 -07:00
Dalton Hubble	fd044ee117	Enable Kubelet TLS bootstrap and NodeRestriction * Enable bootstrap token authentication on kube-apiserver * Generate the bootstrap.kubernetes.io/token Secret that may be used as a bootstrap token * Generate a bootstrap kubeconfig (with a bootstrap token) to be securely distributed to nodes. Each Kubelet will use the bootstrap kubeconfig to authenticate to kube-apiserver as `system:bootstrappers` and send a node-unique CSR for kube-controller-manager to automatically approve to issue a Kubelet certificate and kubeconfig (expires in 72 hours) * Add ClusterRoleBinding for bootstrap token subjects (`system:bootstrappers`) to have the `system:node-bootstrapper` ClusterRole * Add ClusterRoleBinding for bootstrap token subjects (`system:bootstrappers`) to have the csr nodeclient ClusterRole * Add ClusterRoleBinding for bootstrap token subjects (`system:bootstrappers`) to have the csr selfnodeclient ClusterRole * Enable NodeRestriction admission controller to limit the scope of Node or Pod objects a Kubelet can modify to those of the node itself * Ability for a Kubelet to delete its Node object is retained as preemptible nodes or those in auto-scaling instance groups need to be able to remove themselves on shutdown. This need continues to have precedence over any risk of a node deleting itself maliciously Security notes: 1. Issued Kubelet certificates authenticate as user `system:node:NAME` and group `system:nodes` and are limited in their authorization to perform API operations by Node authorization and NodeRestriction admission. Previously, a Kubelet's authorization was broader. This is the primary security motivation. 2. The bootstrap kubeconfig credential has the same sensitivity as the previous generated TLS client-certificate kubeconfig. It must be distributed securely to nodes. Its compromise still allows an attacker to obtain a Kubelet kubeconfig 3. Bootstrapping Kubelet kubeconfig's with a limited lifetime offers a slight security improvement. * An attacker who obtains the kubeconfig can likely obtain the bootstrap kubeconfig as well, to obtain the ability to renew their access * A compromised bootstrap kubeconfig could plausibly be handled by replacing the bootstrap token Secret, distributing the token to new nodes, and expiration. Whereas a compromised TLS-client certificate kubeconfig can't be revoked (no CRL). However, replacing a bootstrap token can be impractical in real cluster environments, so the limited lifetime is mostly a theoretical benefit. * Cluster CSR objects are visible via kubectl which is nice 4. Bootstrapping node-unique Kubelet kubeconfigs means Kubelet clients have more identity information, which can improve the utility of audits and future features Rel: https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet-tls-bootstrapping/ Rel: https://github.com/poseidon/terraform-render-bootstrap/pull/185	2020-04-28 19:35:33 -07:00
Dalton Hubble	38a6bddd06	Update Calico from v3.13.1 to v3.13.3 * https://docs.projectcalico.org/v3.13/release-notes/	2020-04-23 23:58:02 -07:00
Dalton Hubble	84ed0a31c3	Update Prometheus from v2.17.1 to v2.17.2 * https://github.com/prometheus/prometheus/releases/tag/v2.17.2	2020-04-20 18:09:24 -07:00
Dalton Hubble	fcbee12334	Fix race condition creating DigitalOcean firewall rules * DigitalOcean firewall rules should reference Terraform tag resources rather than using tag strings. Otherwise, terraform apply can fail (neeeds rerun) if a tag has not yet been created	2020-04-19 16:55:02 -07:00
Dalton Hubble	feac94605a	Fix bootstrap mount to use shared volume SELinux label * Race: During initial bootstrap, static control plane pods could hang with Permission denied to bootstrap secrets. A manual fix involved restarting Kubelet, which relabeled mounts The race had no effect on subsequent reboots. * bootstrap.service runs podman with a private unshared mount of /etc/kubernetes/bootstrap-secrets which uses an SELinux MCS label with a category pair. However, bootstrap-secrets should be shared as its mounted by Docker pods kube-apiserver, kube-scheduler, and kube-controller-manager. Restarting Kubelet was a manual fix because Kubelet relabels all /etc/kubernetes * Fix bootstrap Pod to use the shared volume label, which leaves bootstrap-secrets files with SELinux level s0 without MCS * Also allow failed bootstrap.service to be re-applied. This was missing on bare-metal and AWS	2020-04-19 16:31:32 -07:00
Dalton Hubble	2b1b918b43	Revert Flatcar Linux Azure to manual upload images * Initial support for Flatcar Linux on Azure used the Flatcar Linux Azure Marketplace images (e.g. `flatcar-stable`) in https://github.com/poseidon/typhoon/pull/664 * Flatcar Linux Azure Marketplace images have some unresolved items https://github.com/poseidon/typhoon/issues/703 * Until the Marketplace items are resolved, revert to requiring Flatcar Linux's images be manually uploaded (like GCP and DigitalOcean)	2020-04-18 15:40:57 -07:00
Dalton Hubble	671eacb86e	Update Kubernetes from v1.18.1 to v1.18.2 * https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md#changelog-since-v1181	2020-04-16 23:40:52 -07:00
Dalton Hubble	5c4a3f73d5	Add support for Fedora CoreOS on Azure * Add `azure/fedora-coreos/kubernetes` module	2020-04-12 16:35:49 -07:00
Dalton Hubble	76ab4c4c2a	Change `container-linux` module preference to Flatcar Linux * No change to Fedora CoreOS modules * For Container Linx AWS and Azure, change the `os_image` default from coreos-stable to flatcar-stable * For Container Linux GCP and DigitalOcean, change `os_image` to be required since users should upload a Flatcar Linux image and set the variable * For Container Linux bare-metal, recommend users change the `os_channel` to Flatcar Linux. No actual module change.	2020-04-11 14:52:30 -07:00
Dalton Hubble	1420700bc0	Update CHANGES for v1.18.1 release * Change order of modules in the README	2020-04-11 13:23:49 -07:00
Dalton Hubble	80538e2953	Add support for Fedora CoreOS on DigitalOcean * Add `digital-ocean/fedora-coreos/kubernetes` module * DigitalOcean custom uploaded images do not permit droplet IPv6 networking	2020-04-09 23:55:29 -07:00
Dalton Hubble	73af2f3b7c	Update Kubernetes from v1.18.0 to v1.18.1 * https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md#v1181	2020-04-08 19:41:48 -07:00
Dalton Hubble	17ea547723	Update etcd from v3.4.5 to v3.4.7 * https://github.com/etcd-io/etcd/releases/tag/v3.4.7 * https://github.com/etcd-io/etcd/releases/tag/v3.4.6	2020-04-06 21:09:25 -07:00
Dalton Hubble	2b5dfece93	Update Grafana from v6.7.1 to v6.7.2 * https://github.com/grafana/grafana/releases/tag/v6.7.2	2020-04-04 13:13:19 -07:00
Dalton Hubble	d47d40b517	Refresh Prometheus rules/alerts and Grafana dashboards * Refresh upstream Prometheus rules and alerts and Grafana dashboards * All Loki recording rules for convenience	2020-03-31 21:53:01 -07:00
Dalton Hubble	bbbaf949f9	Fix UDP outbound and clock sync timeouts on Azure workers * Add "lb" outbound rule for worker TCP _and_ UDP traffic * Fix Azure worker nodes clock synchronization being inactive due to timeouts reaching the CoreOS / Flatcar NTP pool * Fix Azure worker nodes not providing outbount UDP connectivity Background: Azure provides VMs outbound connectivity either by having a public IP or via an SNAT masquerade feature bundled with their virtual load balancing abstraction (in contrast with, say, a NAT gateway). Azure worker nodes have only a private IP, but are associated with the cluster load balancer's backend pool and ingress frontend IP. Outbound traffic uses SNAT with this frontend IP. A subtle detail with Azure SNAT seems to be that since both inbound lb_rule's are TCP only, outbound UDP traffic isn't SNAT'd (highlights the reasons Azure shouldn't have conflated inbound load balancing with outbound SNAT concepts). However, adding a separate outbound rule and disabling outbound SNAT on our ingress lb_rule's we can tell Azure to continue load balancing as before, and support outbound SNAT for worker traffic of both the TCP and UDP protocol. Fixes clock synchronization timeouts: ``` systemd-timesyncd[786]: Timed out waiting for reply from 45.79.36.123:123 (3.flatcar.pool.ntp.org) ``` Azure controller nodes have their own public IP, so controllers (and etcd) nodes have not had clock synchronization or outbound UDP issues	2020-03-31 21:00:16 -07:00
Dalton Hubble	135c6182b8	Update flannel from v0.11.0 to v0.12.0 * https://github.com/coreos/flannel/releases/tag/v0.12.0	2020-03-31 18:31:59 -07:00
Dalton Hubble	c53dc66d4a	Rename Container Linux snippets variable for consistency * Rename controller_clc_snippets to controller_snippets (cloud platforms) * Rename worker_clc_snippets to worker_snippets (cloud platforms) * Rename clc_snippets to snippets (bare-metal)	2020-03-31 18:25:51 -07:00
Dalton Hubble	9960972726	Fix bootstrap regression when networking="flannel" * Fix bootstrap error for missing `manifests-networking/crdyaml` when `networking = "flannel"` Cleanup manifest-networking directory left during bootstrap * Regressed in v1.18.0 changes for Calico https://github.com/poseidon/typhoon/pull/675	2020-03-31 18:21:59 -07:00
Dalton Hubble	bac5acb3bd	Change default kube-system DaemonSet tolerations * Change kube-proxy, flannel, and calico-node DaemonSet tolerations to tolerate `node.kubernetes.io/not-ready` and `node-role.kubernetes.io/master` (i.e. controllers) explicitly, rather than tolerating all taints * kube-system DaemonSets will no longer tolerate custom node taints by default. Instead, custom node taints must be enumerated to opt-in to scheduling/executing the kube-system DaemonSets * Consider setting the daemonset_tolerations variable of terraform-render-bootstrap at a later date Background: Tolerating all taints ruled out use-cases where certain nodes might legitimately need to keep kube-proxy or CNI networking disabled Related: https://github.com/poseidon/terraform-render-bootstrap/pull/179	2020-03-31 01:00:45 -07:00
Dalton Hubble	70bdc9ec94	Allow bootstrap re-apply for Fedora CoreOS GCP * Problem: Fedora CoreOS images are manually uploaded to GCP. When a cluster is created with a stale image, Zincati immediately checks for the latest stable image, fetches, and reboots. In practice, this can unfortunately occur exactly during the initial cluster bootstrap phase. * Recommended: Upload the latest Fedora CoreOS image regularly * Mitigation: Allow a failed bootstrap.service run (which won't touch the done ConditionalPathExists) to be re-run by running `terraforma apply` again. Add a known issue to CHANGES * Update docs to show the current Fedora CoreOS stable version to reduce likelihood users see this issue Longer term ideas: * Ideal: Fedora CoreOS publishes a stable channel. Instances will always boot with the latest image in a channel. The problem disappears since it works the same way AWS does * Timer: Consider some timer-based approach to have zincati delay any system reboots for the first ~30 min of a machine's life. Possibly just configured on the controller node https://github.com/coreos/zincati/pull/251 * External coordination: For Container Linux, locksmith filled a similar role and was disabled to allow CLUO to coordinate reboots. By running atop Kubernetes, it was not possible for the reboot to occur before cluster bootstrap * Rely on https://github.com/coreos/zincati/issues/115 to delay the reboot since bootstrap involves an SSH session * Use path-based activation of zincati on controllers and set that path at the end of the bootstrap process Rel: https://github.com/coreos/fedora-coreos-tracker/issues/239	2020-03-28 18:12:31 -07:00
Dalton Hubble	144bb9403c	Add support for Fedora CoreOS snippets * Refresh snippets customization docs * Requires terraform-provider-ct v0.5+	2020-03-28 16:15:04 -07:00
Dalton Hubble	5fca08064b	Fix Fedora CoreOS AMI to filter for stable images * Fix issue observed in us-east-1 where AMI filters chose the latest testing channel release, rather than the stable chanel * Fedora CoreOS AMI filter selects the latest image with a matching name, x86_64, and hvm, excluding dev images. Add a filter for "Fedora CoreOS stable", which seems to be the only distinguishing metadata indicating the channel	2020-03-28 12:57:45 -07:00
Dalton Hubble	a1a5da6bc2	Add CoreOS Container Linux EOL recommendation to CHANGES * Recommend that users who have not yet tried Fedora CoreOS or Flatcar Linux do so. Likely, Container Linux will reach EOL and platform support / stability ratings will be in a mixed state. Nevertheless, folks should migrate by September.	2020-03-26 23:41:54 -07:00
Dalton Hubble	076b8e3c42	Update Prometheus from v2.17.0 to v2.17.1 * https://github.com/prometheus/prometheus/releases/tag/v2.17.1	2020-03-26 22:17:13 -07:00
Dalton Hubble	ef5f953e04	Set docker log driver to journald on Fedora CoreOS * Before Kubernetes v1.18.0, Kubelet only supported kubectl `--limit-bytes` with the Docker `json-file` log driver so the Fedora CoreOS default was overridden for conformance. See https://github.com/poseidon/typhoon/pull/642 * Kubelet v1.18+ implemented support for other docker log drivers, so the Fedora CoreOS default `journald` can be used again Rel: https://github.com/kubernetes/kubernetes/issues/86367	2020-03-26 22:06:45 -07:00
Dalton Hubble	f100a90d28	Update Kubernetes from v1.17.4 to v1.18.0 * https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md	2020-03-25 17:51:50 -07:00
Dalton Hubble	5d1e4ad333	Deprecate asset_dir variable and remove docs * Remove docs for the `asset_dir` variable and deprecate it in CHANGES. It will be removed in an upcoming release * Typhoon v1.17.0 introduced a new mechanism for managing and distributing generated assets that stopped relying on writing out to disk. `asset_dir` became optional and defaulted to being unset / off (recommended)	2020-03-25 00:00:01 -07:00
Dalton Hubble	9f702c72d2	Rename DigitalOcean image variable to os_image * Rename variable `image` to `os_image` to match the naming used for the same purpose on other supported platforms (e.g. AWS, Azure, GCP)	2020-03-24 23:49:37 -07:00

1 2 3 4 5 ...

656 Commits