typhoon

mirror of https://github.com/puppetmaster/typhoon.git synced 2024-12-26 06:19:33 +01:00

Author	SHA1	Message	Date
Dalton Hubble	e72f916c8d	Update etcd from v3.4.8 to v3.4.9 * https://github.com/etcd-io/etcd/blob/master/CHANGELOG-3.4.md#v349-2020-05-20	2020-05-22 00:52:20 -07:00
Dalton Hubble	c52f9f8d08	Upgrade docs packages and refresh content * Promote DigitalOcean from alpha to beta for Fedora CoreOS and Flatcar Linux * Upgrade mkdocs-material and PyPI packages for docs * Replace docs mentions of Container Linux with Flatcar Linux and move docs/cl to docs/flatcar-linux * Deprecate CoreOS Container Linux support. Its still usable for some time, but start removing docs	2020-05-20 23:31:26 -07:00
Dalton Hubble	3bdddc452c	Update Grafana from v7.0.0-beta2 to v7.0.0 * https://grafana.com/docs/grafana/latest/guides/whats-new-in-v7-0/	2020-05-18 23:42:32 -07:00
Dalton Hubble	ff4187a1fb	Use new Azure subnet to set address_prefixes list * Update Azure subnet `address_prefix` to `azure_prefixes` list * Fix warning that `address_prefix` is deprecated * Require `terraform-provider-azurerm` v2.8.0+ (action required) Rel: https://github.com/terraform-providers/terraform-provider-azurerm/pull/6493	2020-05-18 23:35:47 -07:00
Dalton Hubble	90edcd3d77	Update node-exporter from v1.0.0-rc.0 to v1.0.0-rc.1 * https://github.com/prometheus/node_exporter/releases/tag/v1.0.0-rc.1	2020-05-15 18:03:19 -07:00
Dalton Hubble	a927c7c790	Update kube-state-metrics from v1.9.5 to v1.9.6 * https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.9.6	2020-05-15 17:42:24 -07:00
Dalton Hubble	d952576d2f	Update Grafana from v7.0.0-beta3 to v7.0.0 * https://github.com/grafana/grafana/releases/tag/7.0.0	2020-05-15 17:38:59 -07:00
Dalton Hubble	70e389f37f	Restore use of Flatcar Linux Azure Marketplace image * Switch Flatcar Linux Azure to use the Marketplace image from Kinvolk (offer `flatcar-container-linux-free`) * Accepting Azure Marketplace terms is still neccessary, update docs to show accepting the free offer rather than BYOL * Upstream Flatcar: https://github.com/flatcar-linux/Flatcar/issues/82 * Typhoon: https://github.com/poseidon/typhoon/issues/703	2020-05-13 22:50:24 -07:00
Dalton Hubble	01905b00bc	Support Fedora CoreOS OS image streams on AWS * Add `os_stream` variable to set the stream to stable (default), testing, or next * Remove unused os_image variable on Fedora CoreOS AWS	2020-05-13 21:45:12 -07:00
Dalton Hubble	f4194cd57a	Update Grafana from v7.0.0-beta2 to v7.0.0-beta.3 * https://github.com/grafana/grafana/releases/tag/v7.0.0-beta3	2020-05-09 17:50:40 -07:00
Dalton Hubble	a2db4fa8c4	Update Calico from v3.13.3 to v3.14.0 * https://docs.projectcalico.org/v3.14/release-notes/	2020-05-09 16:05:30 -07:00
Dalton Hubble	358854e712	Fix Calico install-cni crash loop on Pod restarts * Set a consistent MCS level/range for Calico install-cni * Note: Rebooting a node was a workaround, because Kubelet relabels /etc/kubernetes(/cni/net.d) Background: * On SELinux enforcing systems, the Calico CNI install-cni container ran with default SELinux context and a random MCS pair. install-cni places CNI configs by first creating a temporary file and then moving them into place, which means the file MCS categories depend on the containers SELinux context. * calico-node Pod restarts creates a new install-cni container with a different MCS pair that cannot access the earlier written file (it places configs every time), causing the init container to error and calico-node to crash loop * https://github.com/projectcalico/cni-plugin/issues/874 ``` mv: inter-device move failed: '/calico.conf.tmp' to '/host/etc/cni/net.d/10-calico.conflist'; unable to remove target: Permission denied Failed to mv files. This may be caused by selinux configuration on the host, or something else. ``` Note, this isn't a host SELinux configuration issue. Related: * https://github.com/poseidon/terraform-render-bootstrap/pull/186	2020-05-09 16:01:44 -07:00
Dalton Hubble	b5dabcea31	Use Fedora CoreOS image streams on Google Cloud * Add `os_stream` variable to set a Fedora CoreOS stream to `stable` (default), `testing`, or `next` * Deprecate `os_image` variable. Remove docs about uploading Fedora CoreOS images manually, this is no longer needed * https://docs.fedoraproject.org/en-US/fedora-coreos/update-streams/ Rel: https://github.com/coreos/fedora-coreos-docs/pull/70	2020-05-08 01:23:12 -07:00
Dalton Hubble	3f0a5d2715	Update Grafana from v7.0.0-beta1 to v7.0.0-beta2 * https://github.com/grafana/grafana/releases/tag/v7.0.0-beta2	2020-05-07 23:04:44 -07:00
Dalton Hubble	33173c0206	Update Prometheus from v2.18.0 to v2.18.1 * https://github.com/prometheus/prometheus/releases/tag/v2.18.1	2020-05-07 22:59:11 -07:00
Dalton Hubble	70f30d9c07	Update Prometheus from v2.18.0-rc.1 to v2.18.0 * https://github.com/prometheus/prometheus/releases/tag/v2.18.0	2020-05-05 22:31:11 -07:00
Dalton Hubble	6afc1643d9	Update nginx-ingress from v0.30.0 to v0.32.0 * Add support for IngressClass and RBAC authorization * Since our nginx ingress controller example uses the flag `--ingress-class=public`, add an IngressClass to go along with it Rel: https://kubernetes.io/docs/concepts/services-networking/ingress/#ingress-class	2020-05-03 23:24:19 -07:00
Dalton Hubble	e71e27e769	Update Prometheus from v2.17.2 to v2.18.0-rc.1 * https://github.com/prometheus/prometheus/releases/tag/v2.18.0-rc.1	2020-04-29 20:57:48 -07:00
Dalton Hubble	64035005d4	Update Grafana from v6.7.2 to v7.0.0-beta1 * https://github.com/grafana/grafana/releases/tag/v7.0.0-beta1	2020-04-29 20:53:30 -07:00
Dalton Hubble	fd044ee117	Enable Kubelet TLS bootstrap and NodeRestriction * Enable bootstrap token authentication on kube-apiserver * Generate the bootstrap.kubernetes.io/token Secret that may be used as a bootstrap token * Generate a bootstrap kubeconfig (with a bootstrap token) to be securely distributed to nodes. Each Kubelet will use the bootstrap kubeconfig to authenticate to kube-apiserver as `system:bootstrappers` and send a node-unique CSR for kube-controller-manager to automatically approve to issue a Kubelet certificate and kubeconfig (expires in 72 hours) * Add ClusterRoleBinding for bootstrap token subjects (`system:bootstrappers`) to have the `system:node-bootstrapper` ClusterRole * Add ClusterRoleBinding for bootstrap token subjects (`system:bootstrappers`) to have the csr nodeclient ClusterRole * Add ClusterRoleBinding for bootstrap token subjects (`system:bootstrappers`) to have the csr selfnodeclient ClusterRole * Enable NodeRestriction admission controller to limit the scope of Node or Pod objects a Kubelet can modify to those of the node itself * Ability for a Kubelet to delete its Node object is retained as preemptible nodes or those in auto-scaling instance groups need to be able to remove themselves on shutdown. This need continues to have precedence over any risk of a node deleting itself maliciously Security notes: 1. Issued Kubelet certificates authenticate as user `system:node:NAME` and group `system:nodes` and are limited in their authorization to perform API operations by Node authorization and NodeRestriction admission. Previously, a Kubelet's authorization was broader. This is the primary security motivation. 2. The bootstrap kubeconfig credential has the same sensitivity as the previous generated TLS client-certificate kubeconfig. It must be distributed securely to nodes. Its compromise still allows an attacker to obtain a Kubelet kubeconfig 3. Bootstrapping Kubelet kubeconfig's with a limited lifetime offers a slight security improvement. * An attacker who obtains the kubeconfig can likely obtain the bootstrap kubeconfig as well, to obtain the ability to renew their access * A compromised bootstrap kubeconfig could plausibly be handled by replacing the bootstrap token Secret, distributing the token to new nodes, and expiration. Whereas a compromised TLS-client certificate kubeconfig can't be revoked (no CRL). However, replacing a bootstrap token can be impractical in real cluster environments, so the limited lifetime is mostly a theoretical benefit. * Cluster CSR objects are visible via kubectl which is nice 4. Bootstrapping node-unique Kubelet kubeconfigs means Kubelet clients have more identity information, which can improve the utility of audits and future features Rel: https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet-tls-bootstrapping/ Rel: https://github.com/poseidon/terraform-render-bootstrap/pull/185	2020-04-28 19:35:33 -07:00
Dalton Hubble	38a6bddd06	Update Calico from v3.13.1 to v3.13.3 * https://docs.projectcalico.org/v3.13/release-notes/	2020-04-23 23:58:02 -07:00
Dalton Hubble	84ed0a31c3	Update Prometheus from v2.17.1 to v2.17.2 * https://github.com/prometheus/prometheus/releases/tag/v2.17.2	2020-04-20 18:09:24 -07:00
Dalton Hubble	fcbee12334	Fix race condition creating DigitalOcean firewall rules * DigitalOcean firewall rules should reference Terraform tag resources rather than using tag strings. Otherwise, terraform apply can fail (neeeds rerun) if a tag has not yet been created	2020-04-19 16:55:02 -07:00
Dalton Hubble	feac94605a	Fix bootstrap mount to use shared volume SELinux label * Race: During initial bootstrap, static control plane pods could hang with Permission denied to bootstrap secrets. A manual fix involved restarting Kubelet, which relabeled mounts The race had no effect on subsequent reboots. * bootstrap.service runs podman with a private unshared mount of /etc/kubernetes/bootstrap-secrets which uses an SELinux MCS label with a category pair. However, bootstrap-secrets should be shared as its mounted by Docker pods kube-apiserver, kube-scheduler, and kube-controller-manager. Restarting Kubelet was a manual fix because Kubelet relabels all /etc/kubernetes * Fix bootstrap Pod to use the shared volume label, which leaves bootstrap-secrets files with SELinux level s0 without MCS * Also allow failed bootstrap.service to be re-applied. This was missing on bare-metal and AWS	2020-04-19 16:31:32 -07:00
Dalton Hubble	2b1b918b43	Revert Flatcar Linux Azure to manual upload images * Initial support for Flatcar Linux on Azure used the Flatcar Linux Azure Marketplace images (e.g. `flatcar-stable`) in https://github.com/poseidon/typhoon/pull/664 * Flatcar Linux Azure Marketplace images have some unresolved items https://github.com/poseidon/typhoon/issues/703 * Until the Marketplace items are resolved, revert to requiring Flatcar Linux's images be manually uploaded (like GCP and DigitalOcean)	2020-04-18 15:40:57 -07:00
Dalton Hubble	671eacb86e	Update Kubernetes from v1.18.1 to v1.18.2 * https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md#changelog-since-v1181	2020-04-16 23:40:52 -07:00
Dalton Hubble	5c4a3f73d5	Add support for Fedora CoreOS on Azure * Add `azure/fedora-coreos/kubernetes` module	2020-04-12 16:35:49 -07:00
Dalton Hubble	76ab4c4c2a	Change `container-linux` module preference to Flatcar Linux * No change to Fedora CoreOS modules * For Container Linx AWS and Azure, change the `os_image` default from coreos-stable to flatcar-stable * For Container Linux GCP and DigitalOcean, change `os_image` to be required since users should upload a Flatcar Linux image and set the variable * For Container Linux bare-metal, recommend users change the `os_channel` to Flatcar Linux. No actual module change.	2020-04-11 14:52:30 -07:00
Dalton Hubble	1420700bc0	Update CHANGES for v1.18.1 release * Change order of modules in the README	2020-04-11 13:23:49 -07:00
Dalton Hubble	80538e2953	Add support for Fedora CoreOS on DigitalOcean * Add `digital-ocean/fedora-coreos/kubernetes` module * DigitalOcean custom uploaded images do not permit droplet IPv6 networking	2020-04-09 23:55:29 -07:00
Dalton Hubble	73af2f3b7c	Update Kubernetes from v1.18.0 to v1.18.1 * https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md#v1181	2020-04-08 19:41:48 -07:00
Dalton Hubble	17ea547723	Update etcd from v3.4.5 to v3.4.7 * https://github.com/etcd-io/etcd/releases/tag/v3.4.7 * https://github.com/etcd-io/etcd/releases/tag/v3.4.6	2020-04-06 21:09:25 -07:00
Dalton Hubble	2b5dfece93	Update Grafana from v6.7.1 to v6.7.2 * https://github.com/grafana/grafana/releases/tag/v6.7.2	2020-04-04 13:13:19 -07:00
Dalton Hubble	d47d40b517	Refresh Prometheus rules/alerts and Grafana dashboards * Refresh upstream Prometheus rules and alerts and Grafana dashboards * All Loki recording rules for convenience	2020-03-31 21:53:01 -07:00
Dalton Hubble	bbbaf949f9	Fix UDP outbound and clock sync timeouts on Azure workers * Add "lb" outbound rule for worker TCP _and_ UDP traffic * Fix Azure worker nodes clock synchronization being inactive due to timeouts reaching the CoreOS / Flatcar NTP pool * Fix Azure worker nodes not providing outbount UDP connectivity Background: Azure provides VMs outbound connectivity either by having a public IP or via an SNAT masquerade feature bundled with their virtual load balancing abstraction (in contrast with, say, a NAT gateway). Azure worker nodes have only a private IP, but are associated with the cluster load balancer's backend pool and ingress frontend IP. Outbound traffic uses SNAT with this frontend IP. A subtle detail with Azure SNAT seems to be that since both inbound lb_rule's are TCP only, outbound UDP traffic isn't SNAT'd (highlights the reasons Azure shouldn't have conflated inbound load balancing with outbound SNAT concepts). However, adding a separate outbound rule and disabling outbound SNAT on our ingress lb_rule's we can tell Azure to continue load balancing as before, and support outbound SNAT for worker traffic of both the TCP and UDP protocol. Fixes clock synchronization timeouts: ``` systemd-timesyncd[786]: Timed out waiting for reply from 45.79.36.123:123 (3.flatcar.pool.ntp.org) ``` Azure controller nodes have their own public IP, so controllers (and etcd) nodes have not had clock synchronization or outbound UDP issues	2020-03-31 21:00:16 -07:00
Dalton Hubble	135c6182b8	Update flannel from v0.11.0 to v0.12.0 * https://github.com/coreos/flannel/releases/tag/v0.12.0	2020-03-31 18:31:59 -07:00
Dalton Hubble	c53dc66d4a	Rename Container Linux snippets variable for consistency * Rename controller_clc_snippets to controller_snippets (cloud platforms) * Rename worker_clc_snippets to worker_snippets (cloud platforms) * Rename clc_snippets to snippets (bare-metal)	2020-03-31 18:25:51 -07:00
Dalton Hubble	9960972726	Fix bootstrap regression when networking="flannel" * Fix bootstrap error for missing `manifests-networking/crdyaml` when `networking = "flannel"` Cleanup manifest-networking directory left during bootstrap * Regressed in v1.18.0 changes for Calico https://github.com/poseidon/typhoon/pull/675	2020-03-31 18:21:59 -07:00
Dalton Hubble	bac5acb3bd	Change default kube-system DaemonSet tolerations * Change kube-proxy, flannel, and calico-node DaemonSet tolerations to tolerate `node.kubernetes.io/not-ready` and `node-role.kubernetes.io/master` (i.e. controllers) explicitly, rather than tolerating all taints * kube-system DaemonSets will no longer tolerate custom node taints by default. Instead, custom node taints must be enumerated to opt-in to scheduling/executing the kube-system DaemonSets * Consider setting the daemonset_tolerations variable of terraform-render-bootstrap at a later date Background: Tolerating all taints ruled out use-cases where certain nodes might legitimately need to keep kube-proxy or CNI networking disabled Related: https://github.com/poseidon/terraform-render-bootstrap/pull/179	2020-03-31 01:00:45 -07:00
Dalton Hubble	70bdc9ec94	Allow bootstrap re-apply for Fedora CoreOS GCP * Problem: Fedora CoreOS images are manually uploaded to GCP. When a cluster is created with a stale image, Zincati immediately checks for the latest stable image, fetches, and reboots. In practice, this can unfortunately occur exactly during the initial cluster bootstrap phase. * Recommended: Upload the latest Fedora CoreOS image regularly * Mitigation: Allow a failed bootstrap.service run (which won't touch the done ConditionalPathExists) to be re-run by running `terraforma apply` again. Add a known issue to CHANGES * Update docs to show the current Fedora CoreOS stable version to reduce likelihood users see this issue Longer term ideas: * Ideal: Fedora CoreOS publishes a stable channel. Instances will always boot with the latest image in a channel. The problem disappears since it works the same way AWS does * Timer: Consider some timer-based approach to have zincati delay any system reboots for the first ~30 min of a machine's life. Possibly just configured on the controller node https://github.com/coreos/zincati/pull/251 * External coordination: For Container Linux, locksmith filled a similar role and was disabled to allow CLUO to coordinate reboots. By running atop Kubernetes, it was not possible for the reboot to occur before cluster bootstrap * Rely on https://github.com/coreos/zincati/issues/115 to delay the reboot since bootstrap involves an SSH session * Use path-based activation of zincati on controllers and set that path at the end of the bootstrap process Rel: https://github.com/coreos/fedora-coreos-tracker/issues/239	2020-03-28 18:12:31 -07:00
Dalton Hubble	144bb9403c	Add support for Fedora CoreOS snippets * Refresh snippets customization docs * Requires terraform-provider-ct v0.5+	2020-03-28 16:15:04 -07:00
Dalton Hubble	5fca08064b	Fix Fedora CoreOS AMI to filter for stable images * Fix issue observed in us-east-1 where AMI filters chose the latest testing channel release, rather than the stable chanel * Fedora CoreOS AMI filter selects the latest image with a matching name, x86_64, and hvm, excluding dev images. Add a filter for "Fedora CoreOS stable", which seems to be the only distinguishing metadata indicating the channel	2020-03-28 12:57:45 -07:00
Dalton Hubble	a1a5da6bc2	Add CoreOS Container Linux EOL recommendation to CHANGES * Recommend that users who have not yet tried Fedora CoreOS or Flatcar Linux do so. Likely, Container Linux will reach EOL and platform support / stability ratings will be in a mixed state. Nevertheless, folks should migrate by September.	2020-03-26 23:41:54 -07:00
Dalton Hubble	076b8e3c42	Update Prometheus from v2.17.0 to v2.17.1 * https://github.com/prometheus/prometheus/releases/tag/v2.17.1	2020-03-26 22:17:13 -07:00
Dalton Hubble	ef5f953e04	Set docker log driver to journald on Fedora CoreOS * Before Kubernetes v1.18.0, Kubelet only supported kubectl `--limit-bytes` with the Docker `json-file` log driver so the Fedora CoreOS default was overridden for conformance. See https://github.com/poseidon/typhoon/pull/642 * Kubelet v1.18+ implemented support for other docker log drivers, so the Fedora CoreOS default `journald` can be used again Rel: https://github.com/kubernetes/kubernetes/issues/86367	2020-03-26 22:06:45 -07:00
Dalton Hubble	f100a90d28	Update Kubernetes from v1.17.4 to v1.18.0 * https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md	2020-03-25 17:51:50 -07:00
Dalton Hubble	5d1e4ad333	Deprecate asset_dir variable and remove docs * Remove docs for the `asset_dir` variable and deprecate it in CHANGES. It will be removed in an upcoming release * Typhoon v1.17.0 introduced a new mechanism for managing and distributing generated assets that stopped relying on writing out to disk. `asset_dir` became optional and defaulted to being unset / off (recommended)	2020-03-25 00:00:01 -07:00
Dalton Hubble	9f702c72d2	Rename DigitalOcean image variable to os_image * Rename variable `image` to `os_image` to match the naming used for the same purpose on other supported platforms (e.g. AWS, Azure, GCP)	2020-03-24 23:49:37 -07:00
Dalton Hubble	e556bc2167	Update Prometheus from v2.17.0-rc.3 to v2.17.0 * https://github.com/prometheus/prometheus/releases/tag/v2.17.0	2020-03-24 23:15:49 -07:00
Dalton Hubble	590d941f50	Switch from upstream hyperkube image to individual images * Kubernetes plans to stop releasing the hyperkube container image * Upstream will continue to publish `kube-apiserver`, `kube-controller-manager`, `kube-scheduler`, and `kube-proxy` container images to `k8s.gcr.io` * Upstream will publish Kubelet only as a binary for distros to package, either as a DEB/RPM on traditional distros or a container image on container-optimized operating systems * Typhoon will package the upstream Kubelet (checksummed) and its dependencies as a container image for use on CoreOS Container Linux, Flatcar Linux, and Fedora CoreOS * Update the Typhoon container image security policy to list `quay.io/poseidon/kubelet`as an official distributed artifact Hyperkube: https://github.com/kubernetes/kubernetes/pull/88676 Kubelet Container Image: https://github.com/poseidon/kubelet Kubelet Quay Repo: https://quay.io/repository/poseidon/kubelet	2020-03-21 15:43:05 -07:00
Dalton Hubble	ddc1ff5348	Update Grafana from v6.6.2 to v6.7.1 * https://github.com/grafana/grafana/releases/tag/v6.7.1	2020-03-21 15:27:55 -07:00
Dalton Hubble	61557e89a6	Update Prometheus from v2.16.0 to v2.17.0-rc.3 * https://github.com/prometheus/prometheus/releases/tag/v2.17.0-rc.3	2020-03-19 22:38:05 -07:00
Dalton Hubble	c3ef21dbf5	Update etcd from v3.4.4 to v3.4.5 * https://github.com/etcd-io/etcd/releases/tag/v3.4.5	2020-03-18 20:50:41 -07:00
Dalton Hubble	2a5dddeb9d	Promote Fedora CoreOS AWS and Google Cloud * Promote Fedora CoreOS AWS to stable * Promote Fedora CoreOS GCP to beta	2020-03-16 22:12:26 -07:00
Dalton Hubble	75fb4e5d11	Remove Container Linux Update Operator (CLUO) addon * Stop providing example manifests for the Container Linux Update Operator (CLUO) * CLUO requires patches to support Kubernetes v1.16+, but the project and push access is rather unowned * CLUO hasn't been in active use in our clusters and won't be relevant beyond Container Linux. Not to say folks can't patch it and run it on their own. Examples just aren't provided here Related: https://github.com/coreos/container-linux-update-operator/pull/197	2020-03-16 22:05:17 -07:00
Dalton Hubble	bc7902f40a	Update Kubernetes from v1.17.3 to v1.17.4 * https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.17.md#v1174	2020-03-13 00:06:41 -07:00
Dalton Hubble	70bf39bb9a	Update Calico from v3.12.0 to v3.13.1 * https://docs.projectcalico.org/v3.13/release-notes/	2020-03-12 23:00:38 -07:00
Dalton Hubble	4e1b8f22df	Add support for Flatcar Linux on Azure * Accept `os_image` "flatcar-stable" and "flatcar-beta" to use Kinvolk's Flatcar Linux images from the Azure Marketplace Note: Flatcar Linux Azure Marketplace images require terms be accepted before use	2020-03-12 22:52:48 -07:00
Dalton Hubble	ab7913a061	Accept initial worker node labels and taints map on bare-metal * Add `worker_node_labels` map from node name to a list of initial node label strings * Add `worker_node_taints` map from node name to a list of initial node taint strings * Unlike cloud platforms, bare-metal node labels and taints are defined via a map from node name to list of labels/taints. Bare-metal clusters may have heterogeneous hardware so per node labels and taints are accepted * Only worker node names are allowed. Workloads are not scheduled on controller nodes so altering their labels/taints isn't suitable ``` module "mercury" { ... worker_node_labels = { "node2" = ["role=special"] } worker_node_taints = { "node2" = ["role=special:NoSchedule"] } } ``` Related: https://github.com/poseidon/typhoon/issues/429	2020-03-09 00:12:02 -07:00
Dalton Hubble	7b0ea23cdc	Upgrade terraform-provider-azurerm to v2.0+ * Add support for `terraform-provider-azurerm` v2.0+. Require `terraform-provider-azurerm` v2.0+ and drop v1.x support since the Azure provider major release is not backwards compatible * Use Azure's new Linux VM and Linux VM Scale Set resources * Change controller's Azure disk caching to None * Associate subnets (in addition to NICs) with security groups (aesthetic) * If set, change `worker_priority` from `Low` to `Spot` (action required) Related: * https://www.terraform.io/docs/providers/azurerm/guides/2.0-upgrade-guide.html	2020-03-08 17:40:13 -07:00
Dalton Hubble	c4683c5bad	Refresh Prometheus alerts and Grafana dashboards * Add 2 min wait before KubeNodeUnreachable to be less noisy on premeptible clusters * Add a BlackboxProbeFailure alert for any failing probes for services annotated `prometheus.io/probe: true`	2020-03-02 20:08:37 -08:00
Dalton Hubble	51cee6d5a4	Change Container Linux etcd-member to fetch with docker:// * Quay has historically generated ACI signatures for images to facilitate rkt's notions of verification (it allowed authors to actually sign images, though `--trust-keys-from-https` is in use since etcd and most authors don't sign images). OCI standardization didn't adopt verification ideas and checking signatures has fallen out of favor. * Fix an issue where Quay no longer seems to be generating ACI signatures for new images (e.g. quay.io/coreos/etcd:v.3.4.4) * Don't be alarmed by rkt `--insecure-options=image`. It refers to disabling image signature checking (i.e. docker pull doesn't check signatures either) * System containers for Kubelet and bootstrap have transitioned to the docker:// transport, so there is precedent and this brings all the system containers on Container Linux controllers into alignment	2020-03-02 19:57:45 -08:00
Dalton Hubble	87f9a2fc35	Add automatic worker deletion on Fedora CoreOS clouds * On clouds where workers can scale down or be preempted (AWS, GCP, Azure), shutdown runs delete-node.service to remove a node a prevent NotReady nodes from lingering * Add the delete-node.service that wasn't carried over from Container Linux and port it to use podman	2020-02-29 20:22:03 -08:00
Dalton Hubble	6de5cf5a55	Update etcd from v3.4.3 to v3.4.4 * https://github.com/etcd-io/etcd/releases/tag/v3.4.4	2020-02-29 16:19:29 -08:00
Dalton Hubble	3250994c95	Use a route table with separate (rather than inline) routes * Allow users to extend the route table using a data reference and adding route resources (e.g. unusual peering setups) * Note: Internally connecting AWS clusters can reduce cross-cloud flexibility and inhibits blue-green cluster patterns. It is not recommended	2020-02-25 23:21:58 -08:00
Dalton Hubble	f4d260645c	Update node-exporter from v0.18.1 to v1.0.0-rc.0 * Update mdadm alert rule; node-exporter adds `state` label to `node_md_disks` and removes `node_md_disks_active` * https://github.com/prometheus/node_exporter/releases/tag/v1.0.0-rc.0	2020-02-25 22:29:52 -08:00
Dalton Hubble	d9219a6722	Update nginx-ingress from v0.29.0 to v0.30.0 * https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.30.0	2020-02-25 22:11:59 -08:00
Dalton Hubble	60c7eb85ee	Update nginx-ingress from v0.28.0 to v0.29.0 * https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.29.0	2020-02-22 15:57:59 -08:00
Dalton Hubble	4c964b56a0	Update kube-state-metrics from v1.9.4 to v1.9.5 * https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.9.5	2020-02-22 15:21:10 -08:00
Dalton Hubble	1fbd6835f2	Update Grafana from v6.6.1 to v6.6.2 * https://github.com/grafana/grafana/releases/tag/v6.6.2	2020-02-22 15:19:24 -08:00
Dalton Hubble	e4d977bfcd	Fix worker_node_labels for initial Fedora CoreOS * Add Terraform strip markers to consume beginning and trailing whitespace in templated Kubelet arguments for podman (Fedora CoreOS only) * Fix initial `worker_node_labels` being quietly ignored on Fedora CoreOS cloud platforms that offer the feature * Close https://github.com/poseidon/typhoon/issues/650	2020-02-22 15:12:35 -08:00
Dalton Hubble	4a38fb5927	Update CoreDNS from v1.6.6 to v1.6.7 * https://coredns.io/2020/01/28/coredns-1.6.7-release/	2020-02-18 21:46:19 -08:00
Dalton Hubble	7ca03e5219	Update Prometheus from v1.15.2 to v1.16.0 * https://github.com/prometheus/prometheus/releases/tag/v2.16.0	2020-02-14 12:10:56 -08:00
Dalton Hubble	362b3fac5c	Add guide for Typhoon with Flatcar Linux on DigitalOcean * Add docs on manually uploading a Flatcar Linux DigitalOcean bin image as a custom image and using a data reference * Set status of Flatcar Linux on DigitalOcean to alpha * IPv6 is not supported for DigitalOcean custom images	2020-02-14 12:08:58 -08:00
Dalton Hubble	32db59b9eb	Update CHANGELOG sections and links	2020-02-14 12:05:51 -08:00
Dalton Hubble	008817b0aa	Promote Fedora CoreOS AWS/bare-metal to beta * Remove alpha warnings from docs headers	2020-02-13 14:25:22 -08:00
Dalton Hubble	49d3b9e6b3	Set docker log driver to json-file on Fedora CoreOS * Fix the last minor issue for Fedora CoreOS clusters to pass CNCF's Kubernetes conformance tests * Kubelet supports a seldom used feature `kubectl logs --limit-bytes=N` to trim a log stream to a desired length. Kubelet handles this in the CRI driver. The Kubelet docker shim only supports the limit bytes feature when Docker is configured with the default `json-file` logging driver * CNCF conformance tests started requiring limit-bytes be supported, indirectly forcing the log driver choice until either the Kubelet or the conformance tests are fixed * Fedora CoreOS defaults Docker to use `journald` (desired). For now, as a workaround to offer conformant clusters, the log driver can be set back to `json-file`. RHEL CoreOS likely won't have noticed the non-conformance since its using crio runtime * https://github.com/kubernetes/kubernetes/issues/86367 Note: When upstream has a fix, the aim is to drop the docker config override and use the journald default	2020-02-11 23:00:38 -08:00
Dalton Hubble	1243f395d1	Update Kubernetes from v1.17.2 to v1.17.3 * https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.17.md#v1173	2020-02-11 20:22:14 -08:00
Dalton Hubble	846f11097f	Update Fedora CoreOS kernel arguments to align with upstream * Align bare-metal kernel arguments with upstream docs * Add missing initrd argument which can cause issues if not present. Fix #638 * Add tty0 and ttyS0 consoles (matches Container Linux) * Remove unused coreos.inst=yes Related: https://docs.fedoraproject.org/en-US/fedora-coreos/bare-metal/	2020-02-11 20:11:19 -08:00
Dalton Hubble	ba84f86dc7	Add guide for Typhoon with Flatcar Linux on Google Cloud * Add docs on manually uploading a Flatcar Linux GCE/GCP gzipped tarball image as a Compute Engine image for use with the Typhoon container-linux module * Set status of Flatcar Linux on Google Cloud to alpha	2020-02-11 19:38:40 -08:00
Dalton Hubble	34c3d7cc39	Update Grafana from v6.6.0 to v6.6.1 * https://github.com/grafana/grafana/releases/tag/v6.6.1	2020-02-08 14:50:33 -08:00
Dalton Hubble	ca96a1335c	Update Calico from v3.11.2 to v3.12.0 * https://docs.projectcalico.org/release-notes/#v3120 * Remove reverse packet filter override, since Calico no longer relies on the setting * https://github.com/coreos/fedora-coreos-tracker/issues/219 * https://github.com/projectcalico/felix/pull/2189	2020-02-06 00:43:33 -08:00
Dalton Hubble	e339fbd2b6	Update kube-state-metrics from v1.9.3 to v1.9.4 * https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.9.4	2020-02-04 21:33:34 -08:00
Dalton Hubble	8cc303c9ac	Add module for Fedora CoreOS on Google Cloud * Add Typhoon Fedora CoreOS on Google Cloud as alpha * Add docs on uploading the Fedora CoreOS GCP gzipped tarball to Google Cloud storage to create a boot disk image	2020-02-01 15:21:40 -08:00
Dalton Hubble	b19ba16afa	Update nginx-ingress from v0.27.1 to v0.28.0 * https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.28.0	2020-01-30 18:00:23 -08:00
Dalton Hubble	d127a7345c	Update Grafana from v6.5.3 to v6.6.0 * https://github.com/grafana/grafana/releases/tag/v6.6.0	2020-01-27 20:46:32 -08:00
Dalton Hubble	5643ad525f	Promote Fedora CoreOS from preview to alpha in docs * Add an announcement to the website as well	2020-01-23 08:47:18 -08:00
Dalton Hubble	d5b7ce8f27	Update kube-state-metrics from v1.9.2 to v1.9.3 * https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.9.3	2020-01-23 00:03:16 -08:00
Dalton Hubble	1cda5bcd2a	Update Kubernetes from v1.17.1 to v1.17.2 * https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.17.md#v1172	2020-01-21 18:27:39 -08:00
Dalton Hubble	bda73264f7	Update nginx-ingress from v0.26.1 to v0.27.1 * Change runAsUser from 33 to 101 for new alpine-based image * https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.27.0 * https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.27.1	2020-01-20 15:22:16 -08:00
Dalton Hubble	dd930a2ff9	Update bare-metal Fedora CoreOS image location * Use Fedora CoreOS production download streams (change) * Use live PXE kernel and initramfs images * https://getfedora.org/coreos/download/ * Update docs example to use public images (cache is still recommended at large scale) and stable stream	2020-01-20 14:44:06 -08:00
Dalton Hubble	03ff3a9cf3	Update kube-state-metrics from v1.9.1 to v1.9.2 * https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.9.2	2020-01-18 15:32:10 -08:00
Dalton Hubble	48703f9906	Update Grafana from v6.5.2 to v6.5.3 * https://github.com/grafana/grafana/releases/tag/v6.5.3	2020-01-18 15:30:39 -08:00
Dalton Hubble	7daabd28b5	Update Calico from v3.11.1 to v3.11.2 * https://docs.projectcalico.org/v3.11/release-notes/	2020-01-18 13:45:24 -08:00
Dalton Hubble	0e2fc89f78	Update kube-state-metrics from v1.9.0 to v1.9.1 * https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.9.1	2020-01-11 14:15:55 -08:00
Dalton Hubble	b1f521fc4a	Allow terraform-provider-google v3.x plugin versions * Typhoon Google Cloud is compatible with `terraform-provider-google` v3.x releases * No v3.x specific features are used, so v2.19+ provider versions are still allowed, to ease migrations	2020-01-11 14:07:18 -08:00
Dalton Hubble	73588cfad3	Update Prometheus from v2.15.1 to v2.15.2 * https://github.com/prometheus/prometheus/releases/tag/v2.15.2	2020-01-06 22:08:34 -08:00
Dalton Hubble	bb586b60da	Reduce Prometheus addon's node-exporter tolerations * Change node-exporter DaemonSet tolerations from tolerating all possible NoSchedule taints to tolerating the master taint and the not ready taint (we'd like metrics regardless) * Users who add custom node taints must add their custom taints to the addon node-exporter DaemonSet. As an addon, its expected users copy and manipulate manifests out-of-band in their own systems	2020-01-06 21:24:24 -08:00
Dalton Hubble	43e05b9131	Enable kube-proxy metrics and allow Prometheus scrapes * Configure kube-proxy --metrics-bind-address=0.0.0.0 (default 127.0.0.1) to serve metrics on 0.0.0.0:10249 * Add firewall rules to allow Prometheus (resides on a worker) to scrape kube-proxy service endpoints on controllers or workers * Add a clusterIP: None service for kube-proxy endpoint discovery	2020-01-06 21:11:18 -08:00
Dalton Hubble	b2eb3e05d0	Disable Kubelet 127.0.0.1.10248 healthz endpoint * Kubelet runs a healthz server listening on 127.0.0.1:10248 by default. Its unused by Typhoon and can be disabled * https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/	2019-12-29 11:23:25 -08:00
Dalton Hubble	f1f4cd6fc0	Inline Container Linux kubelet.service, deprecate kubelet-wrapper * Change kubelet.service on Container Linux nodes to ExecStart Kubelet inline to replace the use of the host OS kubelet-wrapper script * Express rkt run flags and volume mounts in a clear, uniform way to make the Kubelet service easier to audit, manage, and understand * Eliminate reliance on a Container Linux kubelet-wrapper script * Typhoon for Fedora CoreOS developed a kubelet.service that similarly uses an inline ExecStart (except with podman instead of rkt) and a more minimal set of volume mounts. Adopt the volume improvements: * Change Kubelet /etc/kubernetes volume to read-only * Change Kubelet /etc/resolv.conf volume to read-only * Remove unneeded /var/lib/cni volume mount Background: * kubelet-wrapper was added in CoreOS around the time of Kubernetes v1.0 to simplify running a CoreOS-built hyperkube ACI image via rkt-fly. The script defaults are no longer ideal (e.g. rkt's notion of trust dates back to quay.io ACI image serving and signing, which informed the OCI standard images we use today, though they still lack rkt's signing ideas). * Shipping kubelet-wrapper was regretted at CoreOS, but remains in the distro for compatibility. The script is not updated to track hyperkube changes, but it is stable and kubelet.env overrides bridge most gaps * Typhoon Container Linux nodes have used kubelet-wrapper to rkt/rkt-fly run the Kubelet via the official k8s.gcr.io hyperkube image using overrides (new image registry, new image format, restart handling, new mounts, new entrypoint in v1.17). * Observation: Most of what it takes to run a Kubelet container is defined in Typhoon, not in kubelet-wrapper. The wrapper's value is now undermined by having to workaround its dated defaults. Typhoon may be better served defining Kubelet.service explicitly * Typhoon for Fedora CoreOS developed a kubelet.service without the use of a host OS kubelet-wrapper which is both clearer and eliminated some volume mounts	2019-12-29 11:17:26 -08:00
Dalton Hubble	11565ffa8a	Update Calico from v3.10.2 to v3.11.1 * https://docs.projectcalico.org/v3.11/release-notes/	2019-12-28 11:08:03 -08:00
Dalton Hubble	a4e843693f	Update Prometheus from v2.15.0 to v2.15.1 * https://github.com/prometheus/prometheus/releases/tag/v2.15.1	2019-12-26 09:12:55 -05:00
Dalton Hubble	f48e43c0b1	Update Prometheus from v2.14.0 to v2.15.0 * https://github.com/prometheus/prometheus/releases/tag/v2.15.0	2019-12-24 10:52:19 -05:00
Dalton Hubble	daa8d9d9ec	Update CoreDNS from v1.6.5 to v1.6.6 * https://coredns.io/2019/12/11/coredns-1.6.6-release/	2019-12-22 10:47:19 -05:00
Dalton Hubble	52d11096dc	Update kube-state-metrics from v1.9.0-rc.1 to v1.9.0 * https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.9.0 * https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.9.0-rc.1 * https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.9.0-rc.0	2019-12-20 13:53:37 -08:00
Dalton Hubble	1b9fa2e688	Update Grafana from v6.5.1 to v6.5.2 * https://github.com/grafana/grafana/releases/tag/v6.5.2	2019-12-14 15:25:48 -08:00
Dalton Hubble	f69dc2ea0f	Update CHANGES and tutorial notes for release * Update recommended Terraform and provider plugin versions * Update the rough count of resources created per cluster since its not been refreshed in a while (will vary based on cluster options)	2019-12-10 23:03:39 -08:00
Dalton Hubble	c0ce04e1de	Update Calico from v3.10.1 to v3.10.2 * https://docs.projectcalico.org/v3.10/release-notes/	2019-12-09 21:03:00 -08:00
Dalton Hubble	de36d99afc	Update Kubernetes from v1.16.3 to v1.17.0 * https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.17.md/#v1170	2019-12-09 18:31:58 -08:00
Dalton Hubble	4fce9485c8	Reduce kube-controller-manager pod eviction timeout from 5m to 1m * Reduce time to delete pods on unready nodes from 5m to 1m * Present since v1.13.3, but mistakenly removed in v1.16.0 static pod control plane migration Related: * https://github.com/poseidon/terraform-render-bootstrap/pull/148 * https://github.com/poseidon/terraform-render-bootstrap/pull/164	2019-12-08 22:58:31 -08:00
Dalton Hubble	178afe4a9b	Reduce apiserver metrics cardinality and extraneous labels * Stop mapping node labels to targets discovered via Kubernetes nodes (e.g. etcd, kubelet, cadvisor). It is rarely useful to store node labels (e.g. kubernetes.io/os=linux) on these metrics * kube-apiserver's apiserver_request_duration_seconds_bucket metric has a high cardinality that includes labels for the API group, verb, scope, resource, and component for each object type, including for each CRD. This one metric has ~10k time series in a typical cluster (btw 10-40% of total) * Removing the apiserver request duration outright would make latency alerts a NoOp and break a Grafana apiserver panel. Instead, drop series that have a "group" label. Effectively, only request durations for core Kubernetes APIs will be kept (e.g. cardinality won't grow with each CRD added). This reduces the metric to ~2k unique series	2019-12-08 22:48:25 -08:00
Dalton Hubble	d9c7a9e049	Add/update docs for asset_dir and kubeconfig usage * Original tutorials favored including the platform (e.g. google-cloud) in modules (e.g. google-cloud-yavin). Prefer naming conventions where each module / cluster has a simple name (e.g. yavin) since the platform is usually redundant * Retain the example cluster naming themes per platform	2019-12-05 22:56:42 -08:00
Dalton Hubble	26674083b6	Update Grafana from v6.5.0 to v6.5.1 * https://github.com/grafana/grafana/releases/tag/v6.5.1	2019-11-28 14:11:25 -08:00
Dalton Hubble	030a4cec19	Update Grafana from v6.4.4 to v6.5.0 * https://grafana.com/docs/guides/whats-new-in-v6-5/	2019-11-25 22:45:58 -08:00
Dalton Hubble	ddea7dc452	Use new resource dashboards in Grafana deployment * kubernetes-mixin pod resource dashboards were split into two ConfigMap parts because they provide richer networking details * New dashboards have been used by the author at the global level, but were missing in the per-cluster Grafana tracked here	2019-11-25 22:27:11 -08:00
Dalton Hubble	525ae23305	Add node-exporter alerts and Grafana dashboard * Add Prometheus alerts from node-exporter * Add Grafana dashboard nodes.json, from node-exporter * Not adding recording rules, since those are only used by some node-exporter USE dashboards not being included	2019-11-16 13:47:20 -08:00
Dalton Hubble	19ee57dc04	Use GCP region_instance_group_manager version block format * terraform-provider-google v2.19.0 deprecates `instance_template` within `google_compute_region_instance_group_manager` in order to support a scheme with multiple version blocks. Adapt our single version to the new format to resolve deprecation warnings. * Fixes: Warning: "instance_template": [DEPRECATED] This field will be replaced by `version.instance_template` in 3.0.0 * Require terraform-provider-google v2.19.0+ (action required)	2019-11-13 17:41:13 -08:00
Dalton Hubble	0e4ee5efc9	Add small CPU resource requests to static pods * Set small CPU requests on static pods kube-apiserver, kube-controller-manager, and kube-scheduler to align with upstream tooling and for edge cases * Effectively, a practical case for these requests hasn't been observed. However, a small static pod CPU request may offer a slight benefit if a controller became overloaded and the below mechanisms were insufficient Existing safeguards: * Control plane nodes are tainted to isolate them from ordinary workloads. Even dense workloads can only compress CPU resources on worker nodes. * Control plane static pods use the highest priority class, so contention favors control plane pods (over say node-exporter) and CPU is compressible too. See: https://github.com/poseidon/terraform-render-bootstrap/pull/161	2019-11-13 17:18:45 -08:00
Dalton Hubble	a271b9f340	Update CoreDNS from v1.6.2 to v1.6.5 * Add health `lameduck` option 5s. Before CoreDNS shuts down, it will wait and report unhealthy for 5s to allow time for plugins to shutdown cleanly * Minor bug fixes over a few releases * https://coredns.io/2019/08/31/coredns-1.6.3-release/ * https://coredns.io/2019/09/27/coredns-1.6.4-release/ * https://coredns.io/2019/11/05/coredns-1.6.5-release/	2019-11-13 16:47:44 -08:00
Dalton Hubble	cb0598e275	Adopt Terraform v0.12 templatefile function * Update terraform-render-bootstrap module to adopt the Terrform v0.12 templatefile function feature to replace the use of terraform-provider-template's `template_dir` * Require Terraform v0.12.6+ which adds `for_each` Background: * `template_dir` was added to `terraform-provider-template` to add support for template directory rendering in CoreOS Tectonic Kubernetes distribution (~2017) * Terraform v0.12 introduced a native `templatefile` function and v0.12.6 introduced native `for_each` support (July 2019) that makes it possible to replace `template_dir` usage	2019-11-13 16:33:36 -08:00
Dalton Hubble	42b6df89c8	Update Prometheus from v2.14.0-rc.0 to v2.14.0 * https://github.com/prometheus/prometheus/releases/tag/v2.14.0	2019-11-13 13:41:11 -08:00
Dalton Hubble	d7061020ba	Update Kubernetes from v1.16.2 to v1.16.3 * https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.16.md#v1163	2019-11-13 13:05:15 -08:00
Dalton Hubble	a8b7792338	Update Grafana from v6.4.3 to v6.4.4 * https://github.com/grafana/grafana/releases/tag/v6.4.4	2019-11-07 12:00:25 -08:00
Dalton Hubble	a3807086d4	Update Prometheus from v2.13.1 to v2.14.0-rc.0 * Happy PromCon 2019! * https://github.com/prometheus/prometheus/releases/tag/v2.14.0-rc.0	2019-11-07 11:48:23 -08:00
Dalton Hubble	2c163503f1	Update etcd from v3.4.2 to v3.4.3 * etcd v3.4.3 builds with Go v1.12.12 instead of v1.12.9 and adds a few minor metrics fixes * https://github.com/etcd-io/etcd/compare/v3.4.2...v3.4.3	2019-11-07 11:41:01 -08:00
Dalton Hubble	0034a15711	Update Calico from v3.10.0 to v3.10.1 * https://docs.projectcalico.org/v3.10/release-notes/	2019-11-07 11:38:32 -08:00
Dalton Hubble	4775e9d0f7	Upgrade Calico v3.9.2 to v3.10.0 * Allow advertising Kubernetes service ClusterIPs to BGPPeer routers via a BGPConfiguration * Improve EdgeRouter docs about routes and BGP * https://docs.projectcalico.org/v3.10/release-notes/ * https://docs.projectcalico.org/v3.10/networking/advertise-service-ips	2019-10-27 14:13:41 -07:00
Dalton Hubble	d418045929	Switch kube-proxy from iptables mode to ipvs mode * Kubernetes v1.11 considered kube-proxy IPVS mode GA * Many problems were found #321 * Since then, major blockers seem to have been addressed	2019-10-27 00:37:41 -07:00
Dalton Hubble	de90cb9246	Remove kube-state-metrics addon-resizer * addon-resizer is outdated and has been dropped from kube-state-metrics examples. Those using it should look to the cluster-proportional-vertical-autoscaler. * Eliminate addon-resizer log spew * Remove associated Role and RoleBinding * Also fix kube-state-metrics readinessProbe port	2019-10-20 16:03:29 -07:00
Dalton Hubble	68da420adc	Refresh Prometheus rules/alerts and Grafana dashboards * Update Prometheus rules/alerts and Grafana dashboards * Remove dashboards that were moved to node-exporter, they may be added back later if valuable * Remove kube-prometheus based rules/alerts (ClockSkew alert)	2019-10-19 17:43:47 -07:00
Dalton Hubble	130c97f8eb	Update Prometheus from v2.13.0 to v2.13.1 * https://github.com/prometheus/prometheus/releases/tag/v2.13.1	2019-10-18 00:10:25 -07:00
Dalton Hubble	271d2f6b52	Update Grafana from v6.4.2 to v6.4.3 * https://github.com/grafana/grafana/releases/tag/v6.4.3	2019-10-18 00:08:39 -07:00
Dalton Hubble	0595915a19	Cleanup CHANGES notes	2019-10-15 23:25:45 -07:00
Dalton Hubble	e6bc5143aa	Default to Calico as the CNI provider on Azure/DigitalOcean * Change `networking` default from flannel to calico on Azure and DigitalOcean * AWS, bare-metal, and Google Cloud continue to default to Calico (as they have since v1.7.5) * Typhoon now defaults to using Calico and supporting NetworkPolicy on all platforms	2019-10-15 23:15:40 -07:00
Dalton Hubble	e4ac1027c8	Update Grafana from v6.4.1 to v6.4.2 * https://github.com/grafana/grafana/releases/tag/v6.4.2	2019-10-15 22:58:43 -07:00
Dalton Hubble	24fc440d83	Update Kubernetes from v1.16.1 to v1.16.2 * Update Calico from v3.9.1 to v3.9.2	2019-10-15 22:42:52 -07:00
Dalton Hubble	a6702573a2	Update etcd from v3.4.1 to v3.4.2 * https://github.com/etcd-io/etcd/releases/tag/v3.4.2	2019-10-15 00:06:15 -07:00
Dalton Hubble	5b9dab6659	Introduce list of detail objects for bare-metal machines * Define bare-metal `controllers` and `workers` as a complex type list(object{name=string, mac=string, domain=string}) to allow clusters with many machines to be defined more cleanly * Remove `controller_names` list variable * Remove `controller_macs` list variable * Remove `controller_domains` list variable * Remove `worker_names` list variable * Remove `worker_macs` list variable * Remove `worker_domains` list variable	2019-10-06 20:22:45 -07:00
Dalton Hubble	5196709fe0	Update docs, CHANGES, and mkdocs-material * Update mkdocs-material from v4.4.2 to v4.4.3 * Update recommended Terraform provider versions * Cleanup the changelog before release	2019-10-06 18:41:25 -07:00
Dalton Hubble	ab72f1ab2d	Update Prometheus from v2.12.0 to v2.13.0 * https://github.com/prometheus/prometheus/releases/tag/v2.13.0	2019-10-06 18:22:20 -07:00
Dalton Hubble	5ef4155e08	Detect most recent Fedora CoreOS AMI in region * Detect the most recent Fedora CoreOS AMI to allow usage of Fedora CoreOS in supported regions (previously just us-east-1) * Unpin the Fedora CoreOS AMI image which was pinned to images that had been checked. This does mean if Fedora publishes a broken image, it will be selected * Filter out "dev" images which have similar naming	2019-10-06 18:13:55 -07:00
Dalton Hubble	15c4b793c3	Use new Fedora CoreOS kernel/initrd/raw asset names * Fedora CoreOS changed the kernel, initramfs, and raw image asset download paths and names in 30.20191002.0	2019-10-06 17:31:21 -07:00
Dalton Hubble	36ed53924f	Add stricter types for bare-metal modules * Review variables available in bare-metal kubernetes modules for Container Linux and Fedora CoreOS * Deprecate cluster_domain_suffix variable * Remove deprecated container_linux_oem variable	2019-10-06 17:18:50 -07:00
Dalton Hubble	19de38b30d	Fix Prometheus etcd metrics scraping * Prometheus was configured to use kubernetes discovery of etcd targets based on nodes matching the node label node-role.kubernetes.io/controller=true * Kubernetes v1.16 stopped permitting node role labels node-role.kubernetes.io/* so Typhoon renamed these labels (no longer any association with roles) to node.kubermetes.io/controller=true * As a result, Prometheus didn't discover etcd targets, etcd metrics were missing, etcd alerts were ineffective, and the etcd Grafana dashboard was empty * Introduced: https://github.com/poseidon/typhoon/pull/543	2019-10-03 19:07:05 -07:00
Dalton Hubble	995824fa6d	Add stricter types for DigitalOcean module * Review variables available in DigitalOcean kubernetes module and sync with documentation * Promote Calico for DigitalOcean and Azure beyond experimental (its the primary mode I've used since it was introduced)	2019-10-02 21:48:24 -07:00
Dalton Hubble	1c5ed84fc2	Update Kubernetes from v1.16.0 to v1.16.1 * https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.16.md#v1161	2019-10-02 21:31:55 -07:00
Dalton Hubble	ca7d62720e	Update Grafana from v6.3.6 to v6.4.1 * https://github.com/grafana/grafana/releases/tag/v6.4.1	2019-10-02 20:36:05 -07:00
Dalton Hubble	26f8d76755	Update kube-state-metrics from v1.7.2 to v1.8.0 * https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.8.0	2019-10-01 20:50:33 -07:00
Dalton Hubble	fdd6882a87	Add stricter types to Azure modules * Review variables available in Azure kubernetes and workers modules and sync with documentation * Fix internal workers module default type to Standard_DS1_v2	2019-09-30 22:20:20 -07:00
Dalton Hubble	f82266ac8c	Add stricter types for GCP modules * Review variables available in google-cloud kubernetes and workers modules and in documentation	2019-09-30 22:04:35 -07:00
Dalton Hubble	7bcf2d7831	Update nginx-ingress from v0.25.1 to v0.26.1 * Add lifecycle hook to allow draining connections for up to 5 minutes	2019-09-30 22:01:07 -07:00
Dalton Hubble	96afa6a531	Update Calico from v3.8.2 to v3.9.1 * https://docs.projectcalico.org/v3.9/release-notes/	2019-09-29 11:22:53 -07:00
Dalton Hubble	a407ff72df	Add stricter types for AWS modules and update docs * Review variables available in AWS kubernetes and workers modules and documentation * Switching between spot and on-demand has worked since Terraform v0.12 * Generally, there are too many knobs. Less useful ones should be de-emphasized or removed * Remove `cluster_domain_suffix` documentation	2019-09-29 11:19:38 -07:00
Dalton Hubble	f453c54956	Update Grafana from v6.3.5 to v6.3.6 * https://github.com/grafana/grafana/releases/tag/v6.3.6	2019-09-28 15:13:46 -07:00
Dalton Hubble	3e34fb075b	Update etcd from v3.4.0 to v3.4.1 * https://github.com/etcd-io/etcd/releases/tag/v3.4.1	2019-09-28 15:09:57 -07:00
Dalton Hubble	9bfb1c5faf	Update docs and variable types for worker node_labels * Document worker pools `node_labels` variable to set the initial node labels for a homogeneous set of workers * Document `worker_node_labels` convenience variable to set the initial node labels for default worker nodes	2019-09-28 15:05:12 -07:00
Dalton Hubble	8703f2c3c5	Fix missing comma separator on bare-metal and DO * Introduced in bare-metal and DigitalOcean in #544 while addressing possible ordering race, but after the v1.16 upgrade validation	2019-09-23 11:05:26 -07:00
Dalton Hubble	078f084220	Update CHANGES and docs for v1.16.0 release	2019-09-22 17:37:23 -07:00
Dalton Hubble	9da3725738	Update Kubernetes from v1.15.3 to v1.16.0 * Drop `node-role.kubernetes.io/master` and `node-role.kubernetes.io/node` node labels * Kubelet (v1.16) now rejects the node labels used in the kubectl get nodes ROLES output * https://github.com/kubernetes/kubernetes/issues/75457	2019-09-18 22:53:06 -07:00
Dalton Hubble	b15c60fa2f	Update CHANGES for control plane static pod switch * Remove old references to bootkube / self-hosted	2019-09-09 22:48:48 -07:00
Dalton Hubble	4a7083d94a	Change Azure default controller_type and worker_type * Change default controller_type to Standard_B2s. A B2s is cheaper by $17/month and provides 2 vCPU, 4GB RAM (vs 1 vCPU, 3.5GB RAM) * Change default worker_type to Standard_DS1_v2. F1 was the previous generation. The DS1_v2 is newer, similar cost, more memory, and still supports Low Priority mode, if desired	2019-09-09 22:34:28 -07:00
Dalton Hubble	c20683067d	Update etcd from v3.3.15 to v3.4.0 * https://github.com/etcd-io/etcd/releases/tag/v3.4.0	2019-09-08 15:32:49 -07:00
Dalton Hubble	dc436b8fe9	Update Grafana from v6.3.4 to v6.3.5 * https://github.com/grafana/grafana/releases/tag/v6.3.5	2019-09-07 14:21:59 -07:00
Dalton Hubble	b74f470701	Recommend updating terraform-provider-ct from v0.3.2 to v0.4.0 * v0.4.0 adds a "strict" mode we'll start using in future and also adds support for Fedora CoreOS * https://github.com/poseidon/terraform-provider-ct/releases/tag/v0.4.0	2019-08-31 16:07:22 -07:00
Dalton Hubble	45bc52d156	Update Grafana from v6.3.3 to v6.3.4 * https://github.com/grafana/grafana/releases/tag/v6.3.4	2019-08-31 15:59:13 -07:00
Dalton Hubble	4d5f962d76	Update CoreDNS from v1.5.0 to v1.6.2 * https://coredns.io/2019/06/26/coredns-1.5.1-release/ * https://coredns.io/2019/07/03/coredns-1.5.2-release/ * https://coredns.io/2019/07/28/coredns-1.6.0-release/ * https://coredns.io/2019/08/02/coredns-1.6.1-release/ * https://coredns.io/2019/08/13/coredns-1.6.2-release/	2019-08-31 15:57:42 -07:00
Dalton Hubble	c42139beaa	Update etcd from v3.3.14 to v3.3.15 * No functional changes, just changes to vendoring tools (go modules -> glide). Still, update to v3.3.15 anyway * https://github.com/etcd-io/etcd/compare/v3.3.14...v3.3.15	2019-08-19 15:05:21 -07:00
Dalton Hubble	35c2763ab0	Update Kubernetes from v1.15.2 to v1.15.3 * https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.15.md/#v1153	2019-08-19 14:49:24 -07:00
Dalton Hubble	8f412e2f09	Update etcd from v3.3.13 to v3.3.14 * https://github.com/etcd-io/etcd/releases/tag/v3.3.14	2019-08-18 21:05:06 -07:00
Dalton Hubble	4ef2eb7e6b	Update Prometheus from v2.11.2 to v2.12.0 * https://github.com/prometheus/prometheus/releases/tag/v2.12.0	2019-08-18 20:59:44 -07:00
Dalton Hubble	99990e3cbb	Use stable IDs for etcd, CoreDNS, and Ngnix dashboards * Use unique dashboard ID so that multiple replicas of Grafana serve dashboards with uniform paths * Fix issue where refreshing a dashboard served by one replica could show a 404 unless the request went to the same replica	2019-08-18 12:45:49 -07:00
Dalton Hubble	3c3708d58e	Update Calico from v3.8.1 to v3.8.2 * https://docs.projectcalico.org/v3.8/release-notes/	2019-08-16 15:38:23 -07:00
Dalton Hubble	0c45cd0f06	Update Grafana from v6.3.2 to v6.3.3 * https://github.com/grafana/grafana/releases/tag/v6.3.3	2019-08-16 14:40:47 -07:00
Dalton Hubble	976452825e	Update Prometheus from v2.11.0 to v2.11.2 * https://github.com/prometheus/prometheus/releases/tag/v2.11.2	2019-08-14 21:26:46 -07:00
Dalton Hubble	7bc5633c38	Update nginx-ingress from v0.25.0 to v0.25.1 * https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.25.1	2019-08-14 21:26:46 -07:00
Dalton Hubble	6db11d5908	Enable AWS root block device encryption by default * terraform-provider-aws v2.23.0 allows AWS root block devices to enable encryption by default. * Require updating terraform-provider-aws to v2.23.0 or higher * Enable root EBS device encryption by default for controller instances and worker instances in auto-scaling groups For comparison: * Google Cloud persistent disks have been encrypted by default for years * Azure managed disk encryption is not ready yet (#486)	2019-08-07 21:13:44 -07:00
Dalton Hubble	eaea4d37a2	Update Grafana from v6.2.5 to v6.3.2 * https://github.com/grafana/grafana/releases/tag/v6.3.2 * https://github.com/grafana/grafana/releases/tag/v6.3.1 * https://github.com/grafana/grafana/releases/tag/v6.3.0	2019-08-07 20:01:18 -07:00
Dalton Hubble	457ad18daa	Update kube-state-metrics from v1.7.1 to v1.7.2 * Add a separate liveness and readiness probe * https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.7.2	2019-08-07 20:00:24 -07:00
Dalton Hubble	f79568c02a	Add CHANGES section for v1.15.2 release	2019-08-06 09:01:22 -07:00
Dalton Hubble	10d4d9e565	Add Grafana dashboards for CoreDNS and Nginx Ingress Controller * Add a CoreDNS dashboard originally based on an upstream dashboard, but now customized according to preferences * Add an Nginx Ingress Controller based on an upstream dashboard, but customized according to preferences	2019-08-05 22:49:19 -07:00
Dalton Hubble	2227f2cc62	Update Kubernetes from v1.15.1 to v1.15.2 * https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.15.md#v1152	2019-08-05 08:48:57 -07:00
Dalton Hubble	dcd6733649	Update Calico from v3.8.0 to v3.8.1 * https://docs.projectcalico.org/v3.8/release-notes/	2019-07-27 15:31:13 -07:00
Dalton Hubble	b9ccfedfe5	Update CHANGES for v1.15.1 release	2019-07-21 11:58:56 -07:00
Dalton Hubble	68d8717924	Refresh Prometheus rules/alerts and Grafana dashboards * Refresh rules, alerts, and dashboards from upstreams	2019-07-21 11:29:34 -07:00
Dalton Hubble	c8df349e55	Fix to add all Azure controller nodes to address pool * Add all Azure controllers to the apiserver load balancer backend address pool * Previously, kube-apiserver availability relied on the 0th controller being up. Multi-controller was just providing etcd data redundancy	2019-07-21 10:38:17 -07:00
Dalton Hubble	e0be091acc	Update kube-state-metrics from v1.7.0 to v1.7.1 * https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.7.1	2019-07-20 20:17:08 -07:00
Dalton Hubble	e0c7676a15	Update Kubernetes from v1.15.0 to v1.15.1 * https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.15.md#downloads-for-v1151	2019-07-19 01:21:08 -07:00
Dalton Hubble	6cd3e65267	Update kube-state-metrics from v1.7.0-rc.1 to v1.7.0 * https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.7.0 * Add storageclasses and verticalpodautoscalers to ClusterRole	2019-07-19 00:14:47 -07:00
Dalton Hubble	dfa6bcfecf	Relax terraform-provider-ct version constraint * Allow updating terraform-provider-ct to any release beyond v0.3.2, but below v1.0. This relaxes the prior constraint that allowed only v0.3.y provider versions	2019-07-16 22:07:37 -07:00
Dalton Hubble	70f5cfd33e	Update kube-state-metrics from v1.6.0 to v1.7.0-rc.1 * https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.7.0-rc.1 * https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.7.0-rc.0	2019-07-13 13:13:57 -07:00
Dalton Hubble	9e91d7f011	Upgrade Calico from v3.7.4 to v3.8.0 * Enable CNI bandwidth plugin for traffic shaping * https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/network-plugins/#support-traffic-shaping	2019-07-11 21:01:41 -07:00
Dalton Hubble	eaf59bd33f	Update Prometheus from v2.11.0-rc.0 to v2.11.0 * https://github.com/prometheus/prometheus/releases/tag/v2.11.0	2019-07-09 21:33:24 -07:00
Dalton Hubble	40640f3697	Upgrade nginx-ingress from v0.24.1 to v0.25.0 * Support networking.k8s.io/v1beta1 apiVersion * Update RBAC cluster-role for networking.k8s.io/v1beta1 * https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.25.0	2019-07-08 22:04:50 -07:00
Dalton Hubble	28ab746068	Update Prometheus from v2.10.0 to v2.11.0-rc.0 * https://github.com/prometheus/prometheus/releases/tag/v2.11.0-rc.0	2019-07-08 21:32:50 -07:00
Dalton Hubble	69d064bfdf	Run kube-apiserver with lower privilege user (nobody) * Run kube-apiserver as a non-root user (nobody). User no longer needs to bind low number ports. * On most platforms, the kube-apiserver load balancer listens on 6443 and fronts controllers with kube-apiserver pods using port 6443. Google Cloud TCP proxy load balancers cannot listen on 6443. However, GCP's load balancer can be made to listen on 443, while kube-apiserver uses 6443 across all platforms.	2019-07-08 20:52:00 -07:00
Dalton Hubble	7a69bae75e	Raise GCP network deletion timeout from 4m to 6m * Fix a GCP errata item https://github.com/poseidon/typhoon/wiki/Errata * Removal of a Google Cloud cluster often required 2 runs of `terraform apply` because network resource deletes timeout after 4m. Raise the network deletion timeout to 6m to ensure apply only needs to be run once to remove a cluster	2019-07-06 13:15:33 -07:00
Dalton Hubble	3fcb04f68c	Improve apiserver backend service zone spanning * google_compute_backend_services use nested blocks to define backends (instance groups heterogeneous controllers) * Use Terraform v0.12.x dynamic blocks so the apiserver backend service can refer to (up to zone-many) controller instance groups * Previously, with Terraform v0.11.x, the apiserver backend service had to list a fixed set of backends to span controller nodes across zones in multi-controller setups. 3 backends were used because each GCP region offered at least 3 zones. Single-controller clusters had the cosmetic ugliness of unused instance groups * Allow controllers to span more than 3 zones if avilable in a region (e.g. currently only us-central1, with 4 zones) Related: * https://www.terraform.io/docs/providers/google/r/compute_backend_service.html * https://www.terraform.io/docs/configuration/expressions.html#dynamic-blocks	2019-07-05 19:46:26 -07:00
Dalton Hubble	8d373b5850	Update Calico from v3.7.3 to v3.7.4 * https://docs.projectcalico.org/v3.7/release-notes/	2019-07-02 20:18:02 -07:00
Dalton Hubble	9a395dbf88	Update Grafana from v6.2.4 to v6.2.5 * https://github.com/grafana/grafana/releases/tag/v6.2.5	2019-06-29 13:21:42 -07:00

... 2 3 4 5 6 ...

704 Commits