Commit Graph

704 Commits

Author SHA1 Message Date
Dalton Hubble
e72f916c8d Update etcd from v3.4.8 to v3.4.9
* https://github.com/etcd-io/etcd/blob/master/CHANGELOG-3.4.md#v349-2020-05-20
2020-05-22 00:52:20 -07:00
Dalton Hubble
c52f9f8d08 Upgrade docs packages and refresh content
* Promote DigitalOcean from alpha to beta for Fedora
CoreOS and Flatcar Linux
* Upgrade mkdocs-material and PyPI packages for docs
* Replace docs mentions of Container Linux with Flatcar
Linux and move docs/cl to docs/flatcar-linux
* Deprecate CoreOS Container Linux support. Its still
usable for some time, but start removing docs
2020-05-20 23:31:26 -07:00
Dalton Hubble
3bdddc452c Update Grafana from v7.0.0-beta2 to v7.0.0
* https://grafana.com/docs/grafana/latest/guides/whats-new-in-v7-0/
2020-05-18 23:42:32 -07:00
Dalton Hubble
ff4187a1fb Use new Azure subnet to set address_prefixes list
* Update Azure subnet `address_prefix` to `azure_prefixes` list
* Fix warning that `address_prefix` is deprecated
* Require `terraform-provider-azurerm` v2.8.0+ (action required)

Rel: https://github.com/terraform-providers/terraform-provider-azurerm/pull/6493
2020-05-18 23:35:47 -07:00
Dalton Hubble
90edcd3d77 Update node-exporter from v1.0.0-rc.0 to v1.0.0-rc.1
* https://github.com/prometheus/node_exporter/releases/tag/v1.0.0-rc.1
2020-05-15 18:03:19 -07:00
Dalton Hubble
a927c7c790 Update kube-state-metrics from v1.9.5 to v1.9.6
* https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.9.6
2020-05-15 17:42:24 -07:00
Dalton Hubble
d952576d2f Update Grafana from v7.0.0-beta3 to v7.0.0
* https://github.com/grafana/grafana/releases/tag/7.0.0
2020-05-15 17:38:59 -07:00
Dalton Hubble
70e389f37f Restore use of Flatcar Linux Azure Marketplace image
* Switch Flatcar Linux Azure to use the Marketplace image
from Kinvolk (offer `flatcar-container-linux-free`)
* Accepting Azure Marketplace terms is still neccessary,
update docs to show accepting the free offer rather than
BYOL

* Upstream Flatcar: https://github.com/flatcar-linux/Flatcar/issues/82
* Typhoon: https://github.com/poseidon/typhoon/issues/703
2020-05-13 22:50:24 -07:00
Dalton Hubble
01905b00bc Support Fedora CoreOS OS image streams on AWS
* Add `os_stream` variable to set the stream to stable (default),
testing, or next
* Remove unused os_image variable on Fedora CoreOS AWS
2020-05-13 21:45:12 -07:00
Dalton Hubble
f4194cd57a Update Grafana from v7.0.0-beta2 to v7.0.0-beta.3
* https://github.com/grafana/grafana/releases/tag/v7.0.0-beta3
2020-05-09 17:50:40 -07:00
Dalton Hubble
a2db4fa8c4 Update Calico from v3.13.3 to v3.14.0
* https://docs.projectcalico.org/v3.14/release-notes/
2020-05-09 16:05:30 -07:00
Dalton Hubble
358854e712 Fix Calico install-cni crash loop on Pod restarts
* Set a consistent MCS level/range for Calico install-cni
* Note: Rebooting a node was a workaround, because Kubelet
relabels /etc/kubernetes(/cni/net.d)

Background:

* On SELinux enforcing systems, the Calico CNI install-cni
container ran with default SELinux context and a random MCS
pair. install-cni places CNI configs by first creating a
temporary file and then moving them into place, which means
the file MCS categories depend on the containers SELinux
context.
* calico-node Pod restarts creates a new install-cni container
with a different MCS pair that cannot access the earlier
written file (it places configs every time), causing the
init container to error and calico-node to crash loop
* https://github.com/projectcalico/cni-plugin/issues/874

```
mv: inter-device move failed: '/calico.conf.tmp' to
'/host/etc/cni/net.d/10-calico.conflist'; unable to remove target:
Permission denied
Failed to mv files. This may be caused by selinux configuration on
the
host, or something else.
```

Note, this isn't a host SELinux configuration issue.

Related:

* https://github.com/poseidon/terraform-render-bootstrap/pull/186
2020-05-09 16:01:44 -07:00
Dalton Hubble
b5dabcea31 Use Fedora CoreOS image streams on Google Cloud
* Add `os_stream` variable to set a Fedora CoreOS stream
to `stable` (default), `testing`, or `next`
* Deprecate `os_image` variable. Remove docs about uploading
Fedora CoreOS images manually, this is no longer needed
* https://docs.fedoraproject.org/en-US/fedora-coreos/update-streams/

Rel: https://github.com/coreos/fedora-coreos-docs/pull/70
2020-05-08 01:23:12 -07:00
Dalton Hubble
3f0a5d2715 Update Grafana from v7.0.0-beta1 to v7.0.0-beta2
* https://github.com/grafana/grafana/releases/tag/v7.0.0-beta2
2020-05-07 23:04:44 -07:00
Dalton Hubble
33173c0206 Update Prometheus from v2.18.0 to v2.18.1
* https://github.com/prometheus/prometheus/releases/tag/v2.18.1
2020-05-07 22:59:11 -07:00
Dalton Hubble
70f30d9c07 Update Prometheus from v2.18.0-rc.1 to v2.18.0
* https://github.com/prometheus/prometheus/releases/tag/v2.18.0
2020-05-05 22:31:11 -07:00
Dalton Hubble
6afc1643d9 Update nginx-ingress from v0.30.0 to v0.32.0
* Add support for IngressClass and RBAC authorization
* Since our nginx ingress controller example uses the flag
`--ingress-class=public`, add an IngressClass to go along
with it

Rel: https://kubernetes.io/docs/concepts/services-networking/ingress/#ingress-class
2020-05-03 23:24:19 -07:00
Dalton Hubble
e71e27e769 Update Prometheus from v2.17.2 to v2.18.0-rc.1
* https://github.com/prometheus/prometheus/releases/tag/v2.18.0-rc.1
2020-04-29 20:57:48 -07:00
Dalton Hubble
64035005d4 Update Grafana from v6.7.2 to v7.0.0-beta1
* https://github.com/grafana/grafana/releases/tag/v7.0.0-beta1
2020-04-29 20:53:30 -07:00
Dalton Hubble
fd044ee117 Enable Kubelet TLS bootstrap and NodeRestriction
* Enable bootstrap token authentication on kube-apiserver
* Generate the bootstrap.kubernetes.io/token Secret that
may be used as a bootstrap token
* Generate a bootstrap kubeconfig (with a bootstrap token)
to be securely distributed to nodes. Each Kubelet will use
the bootstrap kubeconfig to authenticate to kube-apiserver
as `system:bootstrappers` and send a node-unique CSR for
kube-controller-manager to automatically approve to issue
a Kubelet certificate and kubeconfig (expires in 72 hours)
* Add ClusterRoleBinding for bootstrap token subjects
(`system:bootstrappers`) to have the `system:node-bootstrapper`
ClusterRole
* Add ClusterRoleBinding for bootstrap token subjects
(`system:bootstrappers`) to have the csr nodeclient ClusterRole
* Add ClusterRoleBinding for bootstrap token subjects
(`system:bootstrappers`) to have the csr selfnodeclient ClusterRole
* Enable NodeRestriction admission controller to limit the
scope of Node or Pod objects a Kubelet can modify to those of
the node itself
* Ability for a Kubelet to delete its Node object is retained
as preemptible nodes or those in auto-scaling instance groups
need to be able to remove themselves on shutdown. This need
continues to have precedence over any risk of a node deleting
itself maliciously

Security notes:

1. Issued Kubelet certificates authenticate as user `system:node:NAME`
and group `system:nodes` and are limited in their authorization
to perform API operations by Node authorization and NodeRestriction
admission. Previously, a Kubelet's authorization was broader. This
is the primary security motivation.

2. The bootstrap kubeconfig credential has the same sensitivity
as the previous generated TLS client-certificate kubeconfig.
It must be distributed securely to nodes. Its compromise still
allows an attacker to obtain a Kubelet kubeconfig

3. Bootstrapping Kubelet kubeconfig's with a limited lifetime offers
a slight security improvement.
  * An attacker who obtains the kubeconfig can likely obtain the
  bootstrap kubeconfig as well, to obtain the ability to renew
  their access
  * A compromised bootstrap kubeconfig could plausibly be handled
  by replacing the bootstrap token Secret, distributing the token
  to new nodes, and expiration. Whereas a compromised TLS-client
  certificate kubeconfig can't be revoked (no CRL). However,
  replacing a bootstrap token can be impractical in real cluster
  environments, so the limited lifetime is mostly a theoretical
  benefit.
  * Cluster CSR objects are visible via kubectl which is nice

4. Bootstrapping node-unique Kubelet kubeconfigs means Kubelet
clients have more identity information, which can improve the
utility of audits and future features

Rel: https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet-tls-bootstrapping/
Rel: https://github.com/poseidon/terraform-render-bootstrap/pull/185
2020-04-28 19:35:33 -07:00
Dalton Hubble
38a6bddd06 Update Calico from v3.13.1 to v3.13.3
* https://docs.projectcalico.org/v3.13/release-notes/
2020-04-23 23:58:02 -07:00
Dalton Hubble
84ed0a31c3 Update Prometheus from v2.17.1 to v2.17.2
* https://github.com/prometheus/prometheus/releases/tag/v2.17.2
2020-04-20 18:09:24 -07:00
Dalton Hubble
fcbee12334 Fix race condition creating DigitalOcean firewall rules
* DigitalOcean firewall rules should reference Terraform tag
resources rather than using tag strings. Otherwise, terraform
apply can fail (neeeds rerun) if a tag has not yet been created
2020-04-19 16:55:02 -07:00
Dalton Hubble
feac94605a Fix bootstrap mount to use shared volume SELinux label
* Race: During initial bootstrap, static control plane pods
could hang with Permission denied to bootstrap secrets. A
manual fix involved restarting Kubelet, which relabeled mounts
The race had no effect on subsequent reboots.
* bootstrap.service runs podman with a private unshared mount
of /etc/kubernetes/bootstrap-secrets which uses an SELinux MCS
label with a category pair. However, bootstrap-secrets should
be shared as its mounted by Docker pods kube-apiserver,
kube-scheduler, and kube-controller-manager. Restarting Kubelet
was a manual fix because Kubelet relabels all /etc/kubernetes
* Fix bootstrap Pod to use the shared volume label, which leaves
bootstrap-secrets files with SELinux level s0 without MCS
* Also allow failed bootstrap.service to be re-applied. This was
missing on bare-metal and AWS
2020-04-19 16:31:32 -07:00
Dalton Hubble
2b1b918b43 Revert Flatcar Linux Azure to manual upload images
* Initial support for Flatcar Linux on Azure used the Flatcar
Linux Azure Marketplace images (e.g. `flatcar-stable`) in
https://github.com/poseidon/typhoon/pull/664
* Flatcar Linux Azure Marketplace images have some unresolved
items https://github.com/poseidon/typhoon/issues/703
* Until the Marketplace items are resolved, revert to requiring
Flatcar Linux's images be manually uploaded (like GCP and
DigitalOcean)
2020-04-18 15:40:57 -07:00
Dalton Hubble
671eacb86e Update Kubernetes from v1.18.1 to v1.18.2
* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md#changelog-since-v1181
2020-04-16 23:40:52 -07:00
Dalton Hubble
5c4a3f73d5 Add support for Fedora CoreOS on Azure
* Add `azure/fedora-coreos/kubernetes` module
2020-04-12 16:35:49 -07:00
Dalton Hubble
76ab4c4c2a Change container-linux module preference to Flatcar Linux
* No change to Fedora CoreOS modules
* For Container Linx AWS and Azure, change the `os_image` default
from coreos-stable to flatcar-stable
* For Container Linux GCP and DigitalOcean, change `os_image` to
be required since users should upload a Flatcar Linux image and
set the variable
* For Container Linux bare-metal, recommend users change the
`os_channel` to Flatcar Linux. No actual module change.
2020-04-11 14:52:30 -07:00
Dalton Hubble
1420700bc0 Update CHANGES for v1.18.1 release
* Change order of modules in the README
2020-04-11 13:23:49 -07:00
Dalton Hubble
80538e2953 Add support for Fedora CoreOS on DigitalOcean
* Add `digital-ocean/fedora-coreos/kubernetes` module
* DigitalOcean custom uploaded images do not permit
droplet IPv6 networking
2020-04-09 23:55:29 -07:00
Dalton Hubble
73af2f3b7c Update Kubernetes from v1.18.0 to v1.18.1
* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md#v1181
2020-04-08 19:41:48 -07:00
Dalton Hubble
17ea547723 Update etcd from v3.4.5 to v3.4.7
* https://github.com/etcd-io/etcd/releases/tag/v3.4.7
* https://github.com/etcd-io/etcd/releases/tag/v3.4.6
2020-04-06 21:09:25 -07:00
Dalton Hubble
2b5dfece93 Update Grafana from v6.7.1 to v6.7.2
* https://github.com/grafana/grafana/releases/tag/v6.7.2
2020-04-04 13:13:19 -07:00
Dalton Hubble
d47d40b517 Refresh Prometheus rules/alerts and Grafana dashboards
* Refresh upstream Prometheus rules and alerts and Grafana
dashboards
* All Loki recording rules for convenience
2020-03-31 21:53:01 -07:00
Dalton Hubble
bbbaf949f9 Fix UDP outbound and clock sync timeouts on Azure workers
* Add "lb" outbound rule for worker TCP _and_ UDP traffic
* Fix Azure worker nodes clock synchronization being inactive
due to timeouts reaching the CoreOS / Flatcar NTP pool
* Fix Azure worker nodes not providing outbount UDP connectivity

Background:

Azure provides VMs outbound connectivity either by having a public
IP or via an SNAT masquerade feature bundled with their virtual
load balancing abstraction (in contrast with, say, a NAT gateway).

Azure worker nodes have only a private IP, but are associated with
the cluster load balancer's backend pool and ingress frontend IP.
Outbound traffic uses SNAT with this frontend IP. A subtle detail
with Azure SNAT seems to be that since both inbound lb_rule's are
TCP only, outbound UDP traffic isn't SNAT'd (highlights the reasons
Azure shouldn't have conflated inbound load balancing with outbound
SNAT concepts). However, adding a separate outbound rule and
disabling outbound SNAT on our ingress lb_rule's we can tell Azure
to continue load balancing as before, and support outbound SNAT for
worker traffic of both the TCP and UDP protocol.

Fixes clock synchronization timeouts:

```
systemd-timesyncd[786]: Timed out waiting for reply from
45.79.36.123:123 (3.flatcar.pool.ntp.org)
```

Azure controller nodes have their own public IP, so controllers (and
etcd) nodes have not had clock synchronization or outbound UDP issues
2020-03-31 21:00:16 -07:00
Dalton Hubble
135c6182b8 Update flannel from v0.11.0 to v0.12.0
* https://github.com/coreos/flannel/releases/tag/v0.12.0
2020-03-31 18:31:59 -07:00
Dalton Hubble
c53dc66d4a Rename Container Linux snippets variable for consistency
* Rename controller_clc_snippets to controller_snippets (cloud platforms)
* Rename worker_clc_snippets to worker_snippets (cloud platforms)
* Rename clc_snippets to snippets (bare-metal)
2020-03-31 18:25:51 -07:00
Dalton Hubble
9960972726 Fix bootstrap regression when networking="flannel"
* Fix bootstrap error for missing `manifests-networking/crd*yaml`
when `networking = "flannel"`
* Cleanup manifest-networking directory left during bootstrap
* Regressed in v1.18.0 changes for Calico https://github.com/poseidon/typhoon/pull/675
2020-03-31 18:21:59 -07:00
Dalton Hubble
bac5acb3bd Change default kube-system DaemonSet tolerations
* Change kube-proxy, flannel, and calico-node DaemonSet
tolerations to tolerate `node.kubernetes.io/not-ready`
and `node-role.kubernetes.io/master` (i.e. controllers)
explicitly, rather than tolerating all taints
* kube-system DaemonSets will no longer tolerate custom
node taints by default. Instead, custom node taints must
be enumerated to opt-in to scheduling/executing the
kube-system DaemonSets
* Consider setting the daemonset_tolerations variable
of terraform-render-bootstrap at a later date

Background: Tolerating all taints ruled out use-cases
where certain nodes might legitimately need to keep
kube-proxy or CNI networking disabled
Related: https://github.com/poseidon/terraform-render-bootstrap/pull/179
2020-03-31 01:00:45 -07:00
Dalton Hubble
70bdc9ec94 Allow bootstrap re-apply for Fedora CoreOS GCP
* Problem: Fedora CoreOS images are manually uploaded to GCP. When a
cluster is created with a stale image, Zincati immediately checks
for the latest stable image, fetches, and reboots. In practice,
this can unfortunately occur exactly during the initial cluster
bootstrap phase.

* Recommended: Upload the latest Fedora CoreOS image regularly
* Mitigation: Allow a failed bootstrap.service run (which won't touch
the done ConditionalPathExists) to be re-run by running `terraforma apply`
again. Add a known issue to CHANGES
* Update docs to show the current Fedora CoreOS stable version to
reduce likelihood users see this issue

 Longer term ideas:

* Ideal: Fedora CoreOS publishes a stable channel. Instances will always
boot with the latest image in a channel. The problem disappears since
it works the same way AWS does
* Timer: Consider some timer-based approach to have zincati delay any
system reboots for the first ~30 min of a machine's life. Possibly just
configured on the controller node https://github.com/coreos/zincati/pull/251
* External coordination: For Container Linux, locksmith filled a similar
role and was disabled to allow CLUO to coordinate reboots. By running
atop Kubernetes, it was not possible for the reboot to occur before
cluster bootstrap
* Rely on https://github.com/coreos/zincati/issues/115 to delay the
reboot since bootstrap involves an SSH session
* Use path-based activation of zincati on controllers and set that
path at the end of the bootstrap process

Rel: https://github.com/coreos/fedora-coreos-tracker/issues/239
2020-03-28 18:12:31 -07:00
Dalton Hubble
144bb9403c Add support for Fedora CoreOS snippets
* Refresh snippets customization docs
* Requires terraform-provider-ct v0.5+
2020-03-28 16:15:04 -07:00
Dalton Hubble
5fca08064b Fix Fedora CoreOS AMI to filter for stable images
* Fix issue observed in us-east-1 where AMI filters chose the
latest testing channel release, rather than the stable chanel
* Fedora CoreOS AMI filter selects the latest image with a
matching name, x86_64, and hvm, excluding dev images. Add a
filter for "Fedora CoreOS stable", which seems to be the only
distinguishing metadata indicating the channel
2020-03-28 12:57:45 -07:00
Dalton Hubble
a1a5da6bc2 Add CoreOS Container Linux EOL recommendation to CHANGES
* Recommend that users who have not yet tried Fedora CoreOS or
Flatcar Linux do so. Likely, Container Linux will reach EOL
and platform support / stability ratings will be in a mixed
state. Nevertheless, folks should migrate by September.
2020-03-26 23:41:54 -07:00
Dalton Hubble
076b8e3c42 Update Prometheus from v2.17.0 to v2.17.1
* https://github.com/prometheus/prometheus/releases/tag/v2.17.1
2020-03-26 22:17:13 -07:00
Dalton Hubble
ef5f953e04 Set docker log driver to journald on Fedora CoreOS
* Before Kubernetes v1.18.0, Kubelet only supported kubectl
`--limit-bytes` with the Docker `json-file` log driver so
the Fedora CoreOS default was overridden for conformance.
See https://github.com/poseidon/typhoon/pull/642
* Kubelet v1.18+ implemented support for other docker log
drivers, so the Fedora CoreOS default `journald` can be
used again

Rel: https://github.com/kubernetes/kubernetes/issues/86367
2020-03-26 22:06:45 -07:00
Dalton Hubble
f100a90d28 Update Kubernetes from v1.17.4 to v1.18.0
* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md
2020-03-25 17:51:50 -07:00
Dalton Hubble
5d1e4ad333 Deprecate asset_dir variable and remove docs
* Remove docs for the `asset_dir` variable and deprecate
it in CHANGES. It will be removed in an upcoming release
* Typhoon v1.17.0 introduced a new mechanism for managing
and distributing generated assets that stopped relying on
writing out to disk. `asset_dir` became optional and
defaulted to being unset / off (recommended)
2020-03-25 00:00:01 -07:00
Dalton Hubble
9f702c72d2 Rename DigitalOcean image variable to os_image
* Rename variable `image` to `os_image` to match the naming
used for the same purpose on other supported platforms (e.g.
AWS, Azure, GCP)
2020-03-24 23:49:37 -07:00
Dalton Hubble
e556bc2167 Update Prometheus from v2.17.0-rc.3 to v2.17.0
* https://github.com/prometheus/prometheus/releases/tag/v2.17.0
2020-03-24 23:15:49 -07:00
Dalton Hubble
590d941f50 Switch from upstream hyperkube image to individual images
* Kubernetes plans to stop releasing the hyperkube container image
* Upstream will continue to publish `kube-apiserver`, `kube-controller-manager`,
`kube-scheduler`, and `kube-proxy` container images to `k8s.gcr.io`
* Upstream will publish Kubelet only as a binary for distros to package,
either as a DEB/RPM on traditional distros or a container image on
container-optimized operating systems
* Typhoon will package the upstream Kubelet (checksummed) and its
dependencies as a container image for use on CoreOS Container Linux,
Flatcar Linux, and Fedora CoreOS
* Update the Typhoon container image security policy to list
`quay.io/poseidon/kubelet`as an official distributed artifact

Hyperkube: https://github.com/kubernetes/kubernetes/pull/88676
Kubelet Container Image: https://github.com/poseidon/kubelet
Kubelet Quay Repo: https://quay.io/repository/poseidon/kubelet
2020-03-21 15:43:05 -07:00
Dalton Hubble
ddc1ff5348 Update Grafana from v6.6.2 to v6.7.1
* https://github.com/grafana/grafana/releases/tag/v6.7.1
2020-03-21 15:27:55 -07:00
Dalton Hubble
61557e89a6 Update Prometheus from v2.16.0 to v2.17.0-rc.3
* https://github.com/prometheus/prometheus/releases/tag/v2.17.0-rc.3
2020-03-19 22:38:05 -07:00
Dalton Hubble
c3ef21dbf5 Update etcd from v3.4.4 to v3.4.5
* https://github.com/etcd-io/etcd/releases/tag/v3.4.5
2020-03-18 20:50:41 -07:00
Dalton Hubble
2a5dddeb9d Promote Fedora CoreOS AWS and Google Cloud
* Promote Fedora CoreOS AWS to stable
* Promote Fedora CoreOS GCP to beta
2020-03-16 22:12:26 -07:00
Dalton Hubble
75fb4e5d11 Remove Container Linux Update Operator (CLUO) addon
* Stop providing example manifests for the Container Linux
Update Operator (CLUO)
* CLUO requires patches to support Kubernetes v1.16+, but the
project and push access is rather unowned
* CLUO hasn't been in active use in our clusters and won't be
relevant beyond Container Linux. Not to say folks can't patch
it and run it on their own. Examples just aren't provided here

Related: https://github.com/coreos/container-linux-update-operator/pull/197
2020-03-16 22:05:17 -07:00
Dalton Hubble
bc7902f40a Update Kubernetes from v1.17.3 to v1.17.4
* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.17.md#v1174
2020-03-13 00:06:41 -07:00
Dalton Hubble
70bf39bb9a Update Calico from v3.12.0 to v3.13.1
* https://docs.projectcalico.org/v3.13/release-notes/
2020-03-12 23:00:38 -07:00
Dalton Hubble
4e1b8f22df Add support for Flatcar Linux on Azure
* Accept `os_image` "flatcar-stable" and "flatcar-beta" to
use Kinvolk's Flatcar Linux images from the Azure Marketplace

Note: Flatcar Linux Azure Marketplace images require terms be
accepted before use
2020-03-12 22:52:48 -07:00
Dalton Hubble
ab7913a061 Accept initial worker node labels and taints map on bare-metal
* Add `worker_node_labels` map from node name to a list of initial
node label strings
* Add `worker_node_taints` map from node name to a list of initial
node taint strings
* Unlike cloud platforms, bare-metal node labels and taints
are defined via a map from node name to list of labels/taints.
Bare-metal clusters may have heterogeneous hardware so per node
labels and taints are accepted
* Only worker node names are allowed. Workloads are not scheduled
on controller nodes so altering their labels/taints isn't suitable

```
module "mercury" {
  ...

  worker_node_labels = {
    "node2" = ["role=special"]
  }

  worker_node_taints = {
    "node2" = ["role=special:NoSchedule"]
  }
}
```

Related: https://github.com/poseidon/typhoon/issues/429
2020-03-09 00:12:02 -07:00
Dalton Hubble
7b0ea23cdc Upgrade terraform-provider-azurerm to v2.0+
* Add support for `terraform-provider-azurerm` v2.0+. Require
`terraform-provider-azurerm` v2.0+ and drop v1.x support since
the Azure provider major release is not backwards compatible
* Use Azure's new Linux VM and Linux VM Scale Set resources
* Change controller's Azure disk caching to None
* Associate subnets (in addition to NICs) with security groups
(aesthetic)
* If set, change `worker_priority` from `Low` to `Spot` (action required)

Related:

* https://www.terraform.io/docs/providers/azurerm/guides/2.0-upgrade-guide.html
2020-03-08 17:40:13 -07:00
Dalton Hubble
c4683c5bad Refresh Prometheus alerts and Grafana dashboards
* Add 2 min wait before KubeNodeUnreachable to be less
noisy on premeptible clusters
* Add a BlackboxProbeFailure alert for any failing probes
for services annotated `prometheus.io/probe: true`
2020-03-02 20:08:37 -08:00
Dalton Hubble
51cee6d5a4 Change Container Linux etcd-member to fetch with docker://
* Quay has historically generated ACI signatures for images to
facilitate rkt's notions of verification (it allowed authors to
actually sign images, though `--trust-keys-from-https` is in use
since etcd and most authors don't sign images). OCI standardization
didn't adopt verification ideas and checking signatures has fallen
out of favor.
* Fix an issue where Quay no longer seems to be generating ACI
signatures for new images (e.g. quay.io/coreos/etcd:v.3.4.4)
* Don't be alarmed by rkt `--insecure-options=image`. It refers
to disabling image signature checking (i.e. docker pull doesn't
check signatures either)
* System containers for Kubelet and bootstrap have transitioned
to the docker:// transport, so there is precedent and this brings
all the system containers on Container Linux controllers into
alignment
2020-03-02 19:57:45 -08:00
Dalton Hubble
87f9a2fc35 Add automatic worker deletion on Fedora CoreOS clouds
* On clouds where workers can scale down or be preempted
(AWS, GCP, Azure), shutdown runs delete-node.service to
remove a node a prevent NotReady nodes from lingering
* Add the delete-node.service that wasn't carried over
from Container Linux and port it to use podman
2020-02-29 20:22:03 -08:00
Dalton Hubble
6de5cf5a55 Update etcd from v3.4.3 to v3.4.4
* https://github.com/etcd-io/etcd/releases/tag/v3.4.4
2020-02-29 16:19:29 -08:00
Dalton Hubble
3250994c95 Use a route table with separate (rather than inline) routes
* Allow users to extend the route table using a data reference
and adding route resources (e.g. unusual peering setups)
* Note: Internally connecting AWS clusters can reduce cross-cloud
flexibility and inhibits blue-green cluster patterns. It is not
recommended
2020-02-25 23:21:58 -08:00
Dalton Hubble
f4d260645c Update node-exporter from v0.18.1 to v1.0.0-rc.0
* Update mdadm alert rule; node-exporter adds `state` label to
`node_md_disks` and removes `node_md_disks_active`
* https://github.com/prometheus/node_exporter/releases/tag/v1.0.0-rc.0
2020-02-25 22:29:52 -08:00
Dalton Hubble
d9219a6722 Update nginx-ingress from v0.29.0 to v0.30.0
* https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.30.0
2020-02-25 22:11:59 -08:00
Dalton Hubble
60c7eb85ee Update nginx-ingress from v0.28.0 to v0.29.0
* https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.29.0
2020-02-22 15:57:59 -08:00
Dalton Hubble
4c964b56a0 Update kube-state-metrics from v1.9.4 to v1.9.5
* https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.9.5
2020-02-22 15:21:10 -08:00
Dalton Hubble
1fbd6835f2 Update Grafana from v6.6.1 to v6.6.2
* https://github.com/grafana/grafana/releases/tag/v6.6.2
2020-02-22 15:19:24 -08:00
Dalton Hubble
e4d977bfcd Fix worker_node_labels for initial Fedora CoreOS
* Add Terraform strip markers to consume beginning and
trailing whitespace in templated Kubelet arguments for
podman (Fedora CoreOS only)
* Fix initial `worker_node_labels` being quietly ignored
on Fedora CoreOS cloud platforms that offer the feature
* Close https://github.com/poseidon/typhoon/issues/650
2020-02-22 15:12:35 -08:00
Dalton Hubble
4a38fb5927 Update CoreDNS from v1.6.6 to v1.6.7
* https://coredns.io/2020/01/28/coredns-1.6.7-release/
2020-02-18 21:46:19 -08:00
Dalton Hubble
7ca03e5219 Update Prometheus from v1.15.2 to v1.16.0
* https://github.com/prometheus/prometheus/releases/tag/v2.16.0
2020-02-14 12:10:56 -08:00
Dalton Hubble
362b3fac5c Add guide for Typhoon with Flatcar Linux on DigitalOcean
* Add docs on manually uploading a Flatcar Linux DigitalOcean
bin image as a custom image and using a data reference
* Set status of Flatcar Linux on DigitalOcean to alpha
* IPv6 is not supported for DigitalOcean custom images
2020-02-14 12:08:58 -08:00
Dalton Hubble
32db59b9eb Update CHANGELOG sections and links 2020-02-14 12:05:51 -08:00
Dalton Hubble
008817b0aa Promote Fedora CoreOS AWS/bare-metal to beta
* Remove alpha warnings from docs headers
2020-02-13 14:25:22 -08:00
Dalton Hubble
49d3b9e6b3 Set docker log driver to json-file on Fedora CoreOS
* Fix the last minor issue for Fedora CoreOS clusters to pass CNCF's
Kubernetes conformance tests
* Kubelet supports a seldom used feature `kubectl logs --limit-bytes=N`
to trim a log stream to a desired length. Kubelet handles this in the
CRI driver. The Kubelet docker shim only supports the limit bytes
feature when Docker is configured with the default `json-file` logging
driver
* CNCF conformance tests started requiring limit-bytes be supported,
indirectly forcing the log driver choice until either the Kubelet or
the conformance tests are fixed
* Fedora CoreOS defaults Docker to use `journald` (desired). For now,
as a workaround to offer conformant clusters, the log driver can
be set back to `json-file`. RHEL CoreOS likely won't have noticed the
non-conformance since its using crio runtime
* https://github.com/kubernetes/kubernetes/issues/86367

Note: When upstream has a fix, the aim is to drop the docker config
override and use the journald default
2020-02-11 23:00:38 -08:00
Dalton Hubble
1243f395d1 Update Kubernetes from v1.17.2 to v1.17.3
* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.17.md#v1173
2020-02-11 20:22:14 -08:00
Dalton Hubble
846f11097f Update Fedora CoreOS kernel arguments to align with upstream
* Align bare-metal kernel arguments with upstream docs
* Add missing initrd argument which can cause issues if
not present. Fix #638
* Add tty0 and ttyS0 consoles (matches Container Linux)
* Remove unused coreos.inst=yes

Related: https://docs.fedoraproject.org/en-US/fedora-coreos/bare-metal/
2020-02-11 20:11:19 -08:00
Dalton Hubble
ba84f86dc7 Add guide for Typhoon with Flatcar Linux on Google Cloud
* Add docs on manually uploading a Flatcar Linux GCE/GCP gzipped
tarball image as a Compute Engine image for use with the Typhoon
container-linux module
* Set status of Flatcar Linux on Google Cloud to alpha
2020-02-11 19:38:40 -08:00
Dalton Hubble
34c3d7cc39 Update Grafana from v6.6.0 to v6.6.1
* https://github.com/grafana/grafana/releases/tag/v6.6.1
2020-02-08 14:50:33 -08:00
Dalton Hubble
ca96a1335c Update Calico from v3.11.2 to v3.12.0
* https://docs.projectcalico.org/release-notes/#v3120
* Remove reverse packet filter override, since Calico no
longer relies on the setting
* https://github.com/coreos/fedora-coreos-tracker/issues/219
* https://github.com/projectcalico/felix/pull/2189
2020-02-06 00:43:33 -08:00
Dalton Hubble
e339fbd2b6 Update kube-state-metrics from v1.9.3 to v1.9.4
* https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.9.4
2020-02-04 21:33:34 -08:00
Dalton Hubble
8cc303c9ac Add module for Fedora CoreOS on Google Cloud
* Add Typhoon Fedora CoreOS on Google Cloud as alpha
* Add docs on uploading the Fedora CoreOS GCP gzipped tarball to
Google Cloud storage to create a boot disk image
2020-02-01 15:21:40 -08:00
Dalton Hubble
b19ba16afa Update nginx-ingress from v0.27.1 to v0.28.0
* https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.28.0
2020-01-30 18:00:23 -08:00
Dalton Hubble
d127a7345c Update Grafana from v6.5.3 to v6.6.0
* https://github.com/grafana/grafana/releases/tag/v6.6.0
2020-01-27 20:46:32 -08:00
Dalton Hubble
5643ad525f Promote Fedora CoreOS from preview to alpha in docs
* Add an announcement to the website as well
2020-01-23 08:47:18 -08:00
Dalton Hubble
d5b7ce8f27 Update kube-state-metrics from v1.9.2 to v1.9.3
* https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.9.3
2020-01-23 00:03:16 -08:00
Dalton Hubble
1cda5bcd2a Update Kubernetes from v1.17.1 to v1.17.2
* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.17.md#v1172
2020-01-21 18:27:39 -08:00
Dalton Hubble
bda73264f7 Update nginx-ingress from v0.26.1 to v0.27.1
* Change runAsUser from 33 to 101 for new alpine-based image
* https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.27.0
* https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.27.1
2020-01-20 15:22:16 -08:00
Dalton Hubble
dd930a2ff9 Update bare-metal Fedora CoreOS image location
* Use Fedora CoreOS production download streams (change)
* Use live PXE kernel and initramfs images
* https://getfedora.org/coreos/download/
* Update docs example to use public images (cache is still
recommended at large scale) and stable stream
2020-01-20 14:44:06 -08:00
Dalton Hubble
03ff3a9cf3 Update kube-state-metrics from v1.9.1 to v1.9.2
* https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.9.2
2020-01-18 15:32:10 -08:00
Dalton Hubble
48703f9906 Update Grafana from v6.5.2 to v6.5.3
* https://github.com/grafana/grafana/releases/tag/v6.5.3
2020-01-18 15:30:39 -08:00
Dalton Hubble
7daabd28b5 Update Calico from v3.11.1 to v3.11.2
* https://docs.projectcalico.org/v3.11/release-notes/
2020-01-18 13:45:24 -08:00
Dalton Hubble
0e2fc89f78 Update kube-state-metrics from v1.9.0 to v1.9.1
* https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.9.1
2020-01-11 14:15:55 -08:00
Dalton Hubble
b1f521fc4a Allow terraform-provider-google v3.x plugin versions
* Typhoon Google Cloud is compatible with `terraform-provider-google`
v3.x releases
* No v3.x specific features are used, so v2.19+ provider versions are
still allowed, to ease migrations
2020-01-11 14:07:18 -08:00
Dalton Hubble
73588cfad3 Update Prometheus from v2.15.1 to v2.15.2
* https://github.com/prometheus/prometheus/releases/tag/v2.15.2
2020-01-06 22:08:34 -08:00
Dalton Hubble
bb586b60da Reduce Prometheus addon's node-exporter tolerations
* Change node-exporter DaemonSet tolerations from tolerating
all possible NoSchedule taints to tolerating the master taint
and the not ready taint (we'd like metrics regardless)
* Users who add custom node taints must add their custom taints
to the addon node-exporter DaemonSet. As an addon, its expected
users copy and manipulate manifests out-of-band in their own
systems
2020-01-06 21:24:24 -08:00
Dalton Hubble
43e05b9131 Enable kube-proxy metrics and allow Prometheus scrapes
* Configure kube-proxy --metrics-bind-address=0.0.0.0 (default
127.0.0.1) to serve metrics on 0.0.0.0:10249
* Add firewall rules to allow Prometheus (resides on a worker) to
scrape kube-proxy service endpoints on controllers or workers
* Add a clusterIP: None service for kube-proxy endpoint discovery
2020-01-06 21:11:18 -08:00
Dalton Hubble
b2eb3e05d0 Disable Kubelet 127.0.0.1.10248 healthz endpoint
* Kubelet runs a healthz server listening on 127.0.0.1:10248
by default. Its unused by Typhoon and can be disabled
* https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/
2019-12-29 11:23:25 -08:00
Dalton Hubble
f1f4cd6fc0 Inline Container Linux kubelet.service, deprecate kubelet-wrapper
* Change kubelet.service on Container Linux nodes to ExecStart Kubelet
inline to replace the use of the host OS kubelet-wrapper script
* Express rkt run flags and volume mounts in a clear, uniform way to
make the Kubelet service easier to audit, manage, and understand
* Eliminate reliance on a Container Linux kubelet-wrapper script
* Typhoon for Fedora CoreOS developed a kubelet.service that similarly
uses an inline ExecStart (except with podman instead of rkt) and a
more minimal set of volume mounts. Adopt the volume improvements:
  * Change Kubelet /etc/kubernetes volume to read-only
  * Change Kubelet /etc/resolv.conf volume to read-only
  * Remove unneeded /var/lib/cni volume mount

Background:

* kubelet-wrapper was added in CoreOS around the time of Kubernetes v1.0
to simplify running a CoreOS-built hyperkube ACI image via rkt-fly. The
script defaults are no longer ideal (e.g. rkt's notion of trust dates
back to quay.io ACI image serving and signing, which informed the OCI
standard images we use today, though they still lack rkt's signing ideas).
* Shipping kubelet-wrapper was regretted at CoreOS, but remains in the
distro for compatibility. The script is not updated to track hyperkube
changes, but it is stable and kubelet.env overrides bridge most gaps
* Typhoon Container Linux nodes have used kubelet-wrapper to rkt/rkt-fly
run the Kubelet via the official k8s.gcr.io hyperkube image using overrides
(new image registry, new image format, restart handling, new mounts, new
entrypoint in v1.17).
* Observation: Most of what it takes to run a Kubelet container is defined
in Typhoon, not in kubelet-wrapper. The wrapper's value is now undermined
by having to workaround its dated defaults. Typhoon may be better served
defining Kubelet.service explicitly
* Typhoon for Fedora CoreOS developed a kubelet.service without the use
of a host OS kubelet-wrapper which is both clearer and eliminated some
volume mounts
2019-12-29 11:17:26 -08:00
Dalton Hubble
11565ffa8a Update Calico from v3.10.2 to v3.11.1
* https://docs.projectcalico.org/v3.11/release-notes/
2019-12-28 11:08:03 -08:00
Dalton Hubble
a4e843693f Update Prometheus from v2.15.0 to v2.15.1
* https://github.com/prometheus/prometheus/releases/tag/v2.15.1
2019-12-26 09:12:55 -05:00
Dalton Hubble
f48e43c0b1 Update Prometheus from v2.14.0 to v2.15.0
* https://github.com/prometheus/prometheus/releases/tag/v2.15.0
2019-12-24 10:52:19 -05:00
Dalton Hubble
daa8d9d9ec Update CoreDNS from v1.6.5 to v1.6.6
* https://coredns.io/2019/12/11/coredns-1.6.6-release/
2019-12-22 10:47:19 -05:00
Dalton Hubble
52d11096dc Update kube-state-metrics from v1.9.0-rc.1 to v1.9.0
* https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.9.0
* https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.9.0-rc.1
* https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.9.0-rc.0
2019-12-20 13:53:37 -08:00
Dalton Hubble
1b9fa2e688 Update Grafana from v6.5.1 to v6.5.2
* https://github.com/grafana/grafana/releases/tag/v6.5.2
2019-12-14 15:25:48 -08:00
Dalton Hubble
f69dc2ea0f Update CHANGES and tutorial notes for release
* Update recommended Terraform and provider plugin versions
* Update the rough count of resources created per cluster
since its not been refreshed in a while (will vary based
on cluster options)
2019-12-10 23:03:39 -08:00
Dalton Hubble
c0ce04e1de Update Calico from v3.10.1 to v3.10.2
* https://docs.projectcalico.org/v3.10/release-notes/
2019-12-09 21:03:00 -08:00
Dalton Hubble
de36d99afc Update Kubernetes from v1.16.3 to v1.17.0
* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.17.md/#v1170
2019-12-09 18:31:58 -08:00
Dalton Hubble
4fce9485c8 Reduce kube-controller-manager pod eviction timeout from 5m to 1m
* Reduce time to delete pods on unready nodes from 5m to 1m
* Present since v1.13.3, but mistakenly removed in v1.16.0 static
pod control plane migration

Related:

* https://github.com/poseidon/terraform-render-bootstrap/pull/148
* https://github.com/poseidon/terraform-render-bootstrap/pull/164
2019-12-08 22:58:31 -08:00
Dalton Hubble
178afe4a9b Reduce apiserver metrics cardinality and extraneous labels
* Stop mapping node labels to targets discovered via Kubernetes
nodes (e.g. etcd, kubelet, cadvisor). It is rarely useful to
store node labels (e.g. kubernetes.io/os=linux) on these metrics
* kube-apiserver's apiserver_request_duration_seconds_bucket metric
has a high cardinality that includes labels for the API group, verb,
scope, resource, and component for each object type, including for
each CRD. This one metric has ~10k time series in a typical cluster
(btw 10-40% of total)
* Removing the apiserver request duration outright would make latency
alerts a NoOp and break a Grafana apiserver panel. Instead, drop series
that have a "group" label. Effectively, only request durations for
core Kubernetes APIs will be kept (e.g. cardinality won't grow with
each CRD added). This reduces the metric to ~2k unique series
2019-12-08 22:48:25 -08:00
Dalton Hubble
d9c7a9e049 Add/update docs for asset_dir and kubeconfig usage
* Original tutorials favored including the platform (e.g.
google-cloud) in modules (e.g. google-cloud-yavin). Prefer
naming conventions where each module / cluster has a simple
name (e.g. yavin) since the platform is usually redundant
* Retain the example cluster naming themes per platform
2019-12-05 22:56:42 -08:00
Dalton Hubble
26674083b6 Update Grafana from v6.5.0 to v6.5.1
* https://github.com/grafana/grafana/releases/tag/v6.5.1
2019-11-28 14:11:25 -08:00
Dalton Hubble
030a4cec19 Update Grafana from v6.4.4 to v6.5.0
* https://grafana.com/docs/guides/whats-new-in-v6-5/
2019-11-25 22:45:58 -08:00
Dalton Hubble
ddea7dc452 Use new resource dashboards in Grafana deployment
* kubernetes-mixin pod resource dashboards were split into
two ConfigMap parts because they provide richer networking
details
* New dashboards have been used by the author at the global
level, but were missing in the per-cluster Grafana tracked
here
2019-11-25 22:27:11 -08:00
Dalton Hubble
525ae23305 Add node-exporter alerts and Grafana dashboard
* Add Prometheus alerts from node-exporter
* Add Grafana dashboard nodes.json, from node-exporter
* Not adding recording rules, since those are only used
by some node-exporter USE dashboards not being included
2019-11-16 13:47:20 -08:00
Dalton Hubble
19ee57dc04 Use GCP region_instance_group_manager version block format
* terraform-provider-google v2.19.0 deprecates `instance_template`
within `google_compute_region_instance_group_manager` in order to
support a scheme with multiple version blocks. Adapt our single
version to the new format to resolve deprecation warnings.
* Fixes: Warning: "instance_template": [DEPRECATED] This field
will be replaced by `version.instance_template` in 3.0.0
* Require terraform-provider-google v2.19.0+ (action required)
2019-11-13 17:41:13 -08:00
Dalton Hubble
0e4ee5efc9 Add small CPU resource requests to static pods
* Set small CPU requests on static pods kube-apiserver,
kube-controller-manager, and kube-scheduler to align with
upstream tooling and for edge cases
* Effectively, a practical case for these requests hasn't been
observed. However, a small static pod CPU request may offer
a slight benefit if a controller became overloaded and the
below mechanisms were insufficient

Existing safeguards:

* Control plane nodes are tainted to isolate them from
ordinary workloads. Even dense workloads can only compress
CPU resources on worker nodes.
* Control plane static pods use the highest priority class, so
contention favors control plane pods (over say node-exporter)
and CPU is compressible too.

See: https://github.com/poseidon/terraform-render-bootstrap/pull/161
2019-11-13 17:18:45 -08:00
Dalton Hubble
a271b9f340 Update CoreDNS from v1.6.2 to v1.6.5
* Add health `lameduck` option 5s. Before CoreDNS shuts down, it will
wait and report unhealthy for 5s to allow time for plugins to shutdown
cleanly
* Minor bug fixes over a few releases
* https://coredns.io/2019/08/31/coredns-1.6.3-release/
* https://coredns.io/2019/09/27/coredns-1.6.4-release/
* https://coredns.io/2019/11/05/coredns-1.6.5-release/
2019-11-13 16:47:44 -08:00
Dalton Hubble
cb0598e275 Adopt Terraform v0.12 templatefile function
* Update terraform-render-bootstrap module to adopt the
Terrform v0.12 templatefile function feature to replace
the use of terraform-provider-template's `template_dir`
* Require Terraform v0.12.6+ which adds `for_each`

Background:

* `template_dir` was added to `terraform-provider-template`
to add support for template directory rendering in CoreOS
Tectonic Kubernetes distribution (~2017)
* Terraform v0.12 introduced a native `templatefile` function
and v0.12.6 introduced native `for_each` support (July 2019)
that makes it possible to replace `template_dir` usage
2019-11-13 16:33:36 -08:00
Dalton Hubble
42b6df89c8 Update Prometheus from v2.14.0-rc.0 to v2.14.0
* https://github.com/prometheus/prometheus/releases/tag/v2.14.0
2019-11-13 13:41:11 -08:00
Dalton Hubble
d7061020ba Update Kubernetes from v1.16.2 to v1.16.3
* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.16.md#v1163
2019-11-13 13:05:15 -08:00
Dalton Hubble
a8b7792338 Update Grafana from v6.4.3 to v6.4.4
* https://github.com/grafana/grafana/releases/tag/v6.4.4
2019-11-07 12:00:25 -08:00
Dalton Hubble
a3807086d4 Update Prometheus from v2.13.1 to v2.14.0-rc.0
* Happy PromCon 2019!
* https://github.com/prometheus/prometheus/releases/tag/v2.14.0-rc.0
2019-11-07 11:48:23 -08:00
Dalton Hubble
2c163503f1 Update etcd from v3.4.2 to v3.4.3
* etcd v3.4.3 builds with Go v1.12.12 instead of v1.12.9
and adds a few minor metrics fixes
* https://github.com/etcd-io/etcd/compare/v3.4.2...v3.4.3
2019-11-07 11:41:01 -08:00
Dalton Hubble
0034a15711 Update Calico from v3.10.0 to v3.10.1
* https://docs.projectcalico.org/v3.10/release-notes/
2019-11-07 11:38:32 -08:00
Dalton Hubble
4775e9d0f7 Upgrade Calico v3.9.2 to v3.10.0
* Allow advertising Kubernetes service ClusterIPs to BGPPeer
routers via a BGPConfiguration
* Improve EdgeRouter docs about routes and BGP
* https://docs.projectcalico.org/v3.10/release-notes/
* https://docs.projectcalico.org/v3.10/networking/advertise-service-ips
2019-10-27 14:13:41 -07:00
Dalton Hubble
d418045929 Switch kube-proxy from iptables mode to ipvs mode
* Kubernetes v1.11 considered kube-proxy IPVS mode GA
* Many problems were found #321
* Since then, major blockers seem to have been addressed
2019-10-27 00:37:41 -07:00
Dalton Hubble
de90cb9246 Remove kube-state-metrics addon-resizer
* addon-resizer is outdated and has been dropped from
kube-state-metrics examples. Those using it should look
to the cluster-proportional-vertical-autoscaler.
* Eliminate addon-resizer log spew
* Remove associated Role and RoleBinding
* Also fix kube-state-metrics readinessProbe port
2019-10-20 16:03:29 -07:00
Dalton Hubble
68da420adc Refresh Prometheus rules/alerts and Grafana dashboards
* Update Prometheus rules/alerts and Grafana dashboards
* Remove dashboards that were moved to node-exporter, they
may be added back later if valuable
* Remove kube-prometheus based rules/alerts (ClockSkew alert)
2019-10-19 17:43:47 -07:00
Dalton Hubble
130c97f8eb Update Prometheus from v2.13.0 to v2.13.1
* https://github.com/prometheus/prometheus/releases/tag/v2.13.1
2019-10-18 00:10:25 -07:00
Dalton Hubble
271d2f6b52 Update Grafana from v6.4.2 to v6.4.3
* https://github.com/grafana/grafana/releases/tag/v6.4.3
2019-10-18 00:08:39 -07:00
Dalton Hubble
0595915a19 Cleanup CHANGES notes 2019-10-15 23:25:45 -07:00
Dalton Hubble
e6bc5143aa Default to Calico as the CNI provider on Azure/DigitalOcean
* Change `networking` default from flannel to calico on
Azure and DigitalOcean
* AWS, bare-metal, and Google Cloud continue to default
to Calico (as they have since v1.7.5)
* Typhoon now defaults to using Calico and supporting
NetworkPolicy on all platforms
2019-10-15 23:15:40 -07:00
Dalton Hubble
e4ac1027c8 Update Grafana from v6.4.1 to v6.4.2
* https://github.com/grafana/grafana/releases/tag/v6.4.2
2019-10-15 22:58:43 -07:00
Dalton Hubble
24fc440d83 Update Kubernetes from v1.16.1 to v1.16.2
* Update Calico from v3.9.1 to v3.9.2
2019-10-15 22:42:52 -07:00
Dalton Hubble
a6702573a2 Update etcd from v3.4.1 to v3.4.2
* https://github.com/etcd-io/etcd/releases/tag/v3.4.2
2019-10-15 00:06:15 -07:00
Dalton Hubble
5b9dab6659 Introduce list of detail objects for bare-metal machines
* Define bare-metal `controllers` and `workers` as a complex type
list(object{name=string, mac=string, domain=string}) to allow
clusters with many machines to be defined more cleanly
* Remove `controller_names` list variable
* Remove `controller_macs` list variable
* Remove `controller_domains` list variable
* Remove `worker_names` list variable
* Remove `worker_macs` list variable
* Remove `worker_domains` list variable
2019-10-06 20:22:45 -07:00
Dalton Hubble
5196709fe0 Update docs, CHANGES, and mkdocs-material
* Update mkdocs-material from v4.4.2 to v4.4.3
* Update recommended Terraform provider versions
* Cleanup the changelog before release
2019-10-06 18:41:25 -07:00
Dalton Hubble
ab72f1ab2d Update Prometheus from v2.12.0 to v2.13.0
* https://github.com/prometheus/prometheus/releases/tag/v2.13.0
2019-10-06 18:22:20 -07:00
Dalton Hubble
5ef4155e08 Detect most recent Fedora CoreOS AMI in region
* Detect the most recent Fedora CoreOS AMI to allow usage
of Fedora CoreOS in supported regions (previously just
us-east-1)
* Unpin the Fedora CoreOS AMI image which was pinned to
images that had been checked. This does mean if Fedora
publishes a broken image, it will be selected
* Filter out "dev" images which have similar naming
2019-10-06 18:13:55 -07:00
Dalton Hubble
15c4b793c3 Use new Fedora CoreOS kernel/initrd/raw asset names
* Fedora CoreOS changed the kernel, initramfs, and raw
image asset download paths and names in 30.20191002.0
2019-10-06 17:31:21 -07:00
Dalton Hubble
36ed53924f Add stricter types for bare-metal modules
* Review variables available in bare-metal kubernetes modules
for Container Linux and Fedora CoreOS
* Deprecate cluster_domain_suffix variable
* Remove deprecated container_linux_oem variable
2019-10-06 17:18:50 -07:00
Dalton Hubble
19de38b30d Fix Prometheus etcd metrics scraping
* Prometheus was configured to use kubernetes discovery
of etcd targets based on nodes matching the node label
node-role.kubernetes.io/controller=true
* Kubernetes v1.16 stopped permitting node role labels
node-role.kubernetes.io/* so Typhoon renamed these labels
(no longer any association with roles) to
node.kubermetes.io/controller=true
* As a result, Prometheus didn't discover etcd targets,
etcd metrics were missing, etcd alerts were ineffective,
and the etcd Grafana dashboard was empty
* Introduced: https://github.com/poseidon/typhoon/pull/543
2019-10-03 19:07:05 -07:00
Dalton Hubble
995824fa6d Add stricter types for DigitalOcean module
* Review variables available in DigitalOcean kubernetes
module and sync with documentation
* Promote Calico for DigitalOcean and Azure beyond experimental
(its the primary mode I've used since it was introduced)
2019-10-02 21:48:24 -07:00
Dalton Hubble
1c5ed84fc2 Update Kubernetes from v1.16.0 to v1.16.1
* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.16.md#v1161
2019-10-02 21:31:55 -07:00
Dalton Hubble
ca7d62720e Update Grafana from v6.3.6 to v6.4.1
* https://github.com/grafana/grafana/releases/tag/v6.4.1
2019-10-02 20:36:05 -07:00
Dalton Hubble
26f8d76755 Update kube-state-metrics from v1.7.2 to v1.8.0
* https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.8.0
2019-10-01 20:50:33 -07:00
Dalton Hubble
fdd6882a87 Add stricter types to Azure modules
* Review variables available in Azure kubernetes and workers
modules and sync with documentation
* Fix internal workers module default type to Standard_DS1_v2
2019-09-30 22:20:20 -07:00
Dalton Hubble
f82266ac8c Add stricter types for GCP modules
* Review variables available in google-cloud kubernetes
and workers modules and in documentation
2019-09-30 22:04:35 -07:00
Dalton Hubble
7bcf2d7831 Update nginx-ingress from v0.25.1 to v0.26.1
* Add lifecycle hook to allow draining connections for
up to 5 minutes
2019-09-30 22:01:07 -07:00
Dalton Hubble
96afa6a531 Update Calico from v3.8.2 to v3.9.1
* https://docs.projectcalico.org/v3.9/release-notes/
2019-09-29 11:22:53 -07:00
Dalton Hubble
a407ff72df Add stricter types for AWS modules and update docs
* Review variables available in AWS kubernetes and workers
modules and documentation
* Switching between spot and on-demand has worked since
Terraform v0.12
* Generally, there are too many knobs. Less useful ones
should be de-emphasized or removed
* Remove `cluster_domain_suffix` documentation
2019-09-29 11:19:38 -07:00
Dalton Hubble
f453c54956 Update Grafana from v6.3.5 to v6.3.6
* https://github.com/grafana/grafana/releases/tag/v6.3.6
2019-09-28 15:13:46 -07:00
Dalton Hubble
3e34fb075b Update etcd from v3.4.0 to v3.4.1
* https://github.com/etcd-io/etcd/releases/tag/v3.4.1
2019-09-28 15:09:57 -07:00
Dalton Hubble
9bfb1c5faf Update docs and variable types for worker node_labels
* Document worker pools `node_labels` variable to set the
initial node labels for a homogeneous set of workers
* Document `worker_node_labels` convenience variable to
set the initial node labels for default worker nodes
2019-09-28 15:05:12 -07:00
Dalton Hubble
8703f2c3c5 Fix missing comma separator on bare-metal and DO
* Introduced in bare-metal and DigitalOcean in #544
while addressing possible ordering race, but after
the v1.16 upgrade validation
2019-09-23 11:05:26 -07:00
Dalton Hubble
078f084220 Update CHANGES and docs for v1.16.0 release 2019-09-22 17:37:23 -07:00
Dalton Hubble
9da3725738 Update Kubernetes from v1.15.3 to v1.16.0
* Drop `node-role.kubernetes.io/master` and
`node-role.kubernetes.io/node` node labels
* Kubelet (v1.16) now rejects the node labels used
in the kubectl get nodes ROLES output
* https://github.com/kubernetes/kubernetes/issues/75457
2019-09-18 22:53:06 -07:00
Dalton Hubble
b15c60fa2f Update CHANGES for control plane static pod switch
* Remove old references to bootkube / self-hosted
2019-09-09 22:48:48 -07:00
Dalton Hubble
4a7083d94a Change Azure default controller_type and worker_type
* Change default controller_type to Standard_B2s. A B2s is cheaper
by $17/month and provides 2 vCPU, 4GB RAM (vs 1 vCPU, 3.5GB RAM)
* Change default worker_type to Standard_DS1_v2. F1 was the previous
generation. The DS1_v2 is newer, similar cost, more memory, and still
supports Low Priority mode, if desired
2019-09-09 22:34:28 -07:00
Dalton Hubble
c20683067d Update etcd from v3.3.15 to v3.4.0
* https://github.com/etcd-io/etcd/releases/tag/v3.4.0
2019-09-08 15:32:49 -07:00
Dalton Hubble
dc436b8fe9 Update Grafana from v6.3.4 to v6.3.5
* https://github.com/grafana/grafana/releases/tag/v6.3.5
2019-09-07 14:21:59 -07:00
Dalton Hubble
b74f470701 Recommend updating terraform-provider-ct from v0.3.2 to v0.4.0
* v0.4.0 adds a "strict" mode we'll start using in future and
also adds support for Fedora CoreOS
* https://github.com/poseidon/terraform-provider-ct/releases/tag/v0.4.0
2019-08-31 16:07:22 -07:00
Dalton Hubble
45bc52d156 Update Grafana from v6.3.3 to v6.3.4
* https://github.com/grafana/grafana/releases/tag/v6.3.4
2019-08-31 15:59:13 -07:00
Dalton Hubble
4d5f962d76 Update CoreDNS from v1.5.0 to v1.6.2
* https://coredns.io/2019/06/26/coredns-1.5.1-release/
* https://coredns.io/2019/07/03/coredns-1.5.2-release/
* https://coredns.io/2019/07/28/coredns-1.6.0-release/
* https://coredns.io/2019/08/02/coredns-1.6.1-release/
* https://coredns.io/2019/08/13/coredns-1.6.2-release/
2019-08-31 15:57:42 -07:00
Dalton Hubble
c42139beaa Update etcd from v3.3.14 to v3.3.15
* No functional changes, just changes to vendoring tools
(go modules -> glide). Still, update to v3.3.15 anyway
* https://github.com/etcd-io/etcd/compare/v3.3.14...v3.3.15
2019-08-19 15:05:21 -07:00
Dalton Hubble
35c2763ab0 Update Kubernetes from v1.15.2 to v1.15.3
* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.15.md/#v1153
2019-08-19 14:49:24 -07:00
Dalton Hubble
8f412e2f09 Update etcd from v3.3.13 to v3.3.14
* https://github.com/etcd-io/etcd/releases/tag/v3.3.14
2019-08-18 21:05:06 -07:00
Dalton Hubble
4ef2eb7e6b Update Prometheus from v2.11.2 to v2.12.0
* https://github.com/prometheus/prometheus/releases/tag/v2.12.0
2019-08-18 20:59:44 -07:00
Dalton Hubble
99990e3cbb Use stable IDs for etcd, CoreDNS, and Ngnix dashboards
* Use unique dashboard ID so that multiple replicas of Grafana
serve dashboards with uniform paths
* Fix issue where refreshing a dashboard served by one replica
could show a 404 unless the request went to the same replica
2019-08-18 12:45:49 -07:00
Dalton Hubble
3c3708d58e Update Calico from v3.8.1 to v3.8.2
* https://docs.projectcalico.org/v3.8/release-notes/
2019-08-16 15:38:23 -07:00
Dalton Hubble
0c45cd0f06 Update Grafana from v6.3.2 to v6.3.3
* https://github.com/grafana/grafana/releases/tag/v6.3.3
2019-08-16 14:40:47 -07:00
Dalton Hubble
976452825e Update Prometheus from v2.11.0 to v2.11.2
* https://github.com/prometheus/prometheus/releases/tag/v2.11.2
2019-08-14 21:26:46 -07:00
Dalton Hubble
7bc5633c38 Update nginx-ingress from v0.25.0 to v0.25.1
* https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.25.1
2019-08-14 21:26:46 -07:00
Dalton Hubble
6db11d5908 Enable AWS root block device encryption by default
* terraform-provider-aws v2.23.0 allows AWS root block devices
to enable encryption by default.
* Require updating terraform-provider-aws to v2.23.0 or higher
* Enable root EBS device encryption by default for controller
instances and worker instances in auto-scaling groups

For comparison:

* Google Cloud persistent disks have been encrypted by
default for years
* Azure managed disk encryption is not ready yet (#486)
2019-08-07 21:13:44 -07:00
Dalton Hubble
eaea4d37a2 Update Grafana from v6.2.5 to v6.3.2
* https://github.com/grafana/grafana/releases/tag/v6.3.2
* https://github.com/grafana/grafana/releases/tag/v6.3.1
* https://github.com/grafana/grafana/releases/tag/v6.3.0
2019-08-07 20:01:18 -07:00
Dalton Hubble
457ad18daa Update kube-state-metrics from v1.7.1 to v1.7.2
* Add a separate liveness and readiness probe
* https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.7.2
2019-08-07 20:00:24 -07:00
Dalton Hubble
f79568c02a Add CHANGES section for v1.15.2 release 2019-08-06 09:01:22 -07:00
Dalton Hubble
10d4d9e565 Add Grafana dashboards for CoreDNS and Nginx Ingress Controller
* Add a CoreDNS dashboard originally based on an upstream dashboard,
but now customized according to preferences
* Add an Nginx Ingress Controller based on an upstream dashboard,
but customized according to preferences
2019-08-05 22:49:19 -07:00
Dalton Hubble
2227f2cc62 Update Kubernetes from v1.15.1 to v1.15.2
* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.15.md#v1152
2019-08-05 08:48:57 -07:00
Dalton Hubble
dcd6733649 Update Calico from v3.8.0 to v3.8.1
* https://docs.projectcalico.org/v3.8/release-notes/
2019-07-27 15:31:13 -07:00
Dalton Hubble
b9ccfedfe5 Update CHANGES for v1.15.1 release 2019-07-21 11:58:56 -07:00
Dalton Hubble
68d8717924 Refresh Prometheus rules/alerts and Grafana dashboards
* Refresh rules, alerts, and dashboards from upstreams
2019-07-21 11:29:34 -07:00
Dalton Hubble
c8df349e55 Fix to add all Azure controller nodes to address pool
* Add all Azure controllers to the apiserver load balancer
backend address pool
* Previously, kube-apiserver availability relied on the 0th
controller being up. Multi-controller was just providing etcd
data redundancy
2019-07-21 10:38:17 -07:00
Dalton Hubble
e0be091acc Update kube-state-metrics from v1.7.0 to v1.7.1
* https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.7.1
2019-07-20 20:17:08 -07:00
Dalton Hubble
e0c7676a15 Update Kubernetes from v1.15.0 to v1.15.1
* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.15.md#downloads-for-v1151
2019-07-19 01:21:08 -07:00
Dalton Hubble
6cd3e65267 Update kube-state-metrics from v1.7.0-rc.1 to v1.7.0
* https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.7.0
* Add storageclasses and verticalpodautoscalers to ClusterRole
2019-07-19 00:14:47 -07:00
Dalton Hubble
dfa6bcfecf Relax terraform-provider-ct version constraint
* Allow updating terraform-provider-ct to any release
beyond v0.3.2, but below v1.0. This relaxes the prior
constraint that allowed only v0.3.y provider versions
2019-07-16 22:07:37 -07:00
Dalton Hubble
70f5cfd33e Update kube-state-metrics from v1.6.0 to v1.7.0-rc.1
* https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.7.0-rc.1
* https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.7.0-rc.0
2019-07-13 13:13:57 -07:00
Dalton Hubble
9e91d7f011 Upgrade Calico from v3.7.4 to v3.8.0
* Enable CNI bandwidth plugin for traffic shaping
* https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/network-plugins/#support-traffic-shaping
2019-07-11 21:01:41 -07:00
Dalton Hubble
eaf59bd33f Update Prometheus from v2.11.0-rc.0 to v2.11.0
* https://github.com/prometheus/prometheus/releases/tag/v2.11.0
2019-07-09 21:33:24 -07:00
Dalton Hubble
40640f3697 Upgrade nginx-ingress from v0.24.1 to v0.25.0
* Support networking.k8s.io/v1beta1 apiVersion
* Update RBAC cluster-role for networking.k8s.io/v1beta1
* https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.25.0
2019-07-08 22:04:50 -07:00
Dalton Hubble
28ab746068 Update Prometheus from v2.10.0 to v2.11.0-rc.0
* https://github.com/prometheus/prometheus/releases/tag/v2.11.0-rc.0
2019-07-08 21:32:50 -07:00
Dalton Hubble
69d064bfdf Run kube-apiserver with lower privilege user (nobody)
* Run kube-apiserver as a non-root user (nobody). User
no longer needs to bind low number ports.
* On most platforms, the kube-apiserver load balancer listens
on 6443 and fronts controllers with kube-apiserver pods using
port 6443. Google Cloud TCP proxy load balancers cannot listen
on 6443. However, GCP's load balancer can be made to listen on
443, while kube-apiserver uses 6443 across all platforms.
2019-07-08 20:52:00 -07:00
Dalton Hubble
7a69bae75e Raise GCP network deletion timeout from 4m to 6m
* Fix a GCP errata item https://github.com/poseidon/typhoon/wiki/Errata
* Removal of a Google Cloud cluster often required 2 runs of
`terraform apply` because network resource deletes timeout
after 4m. Raise the network deletion timeout to 6m to
ensure apply only needs to be run once to remove a cluster
2019-07-06 13:15:33 -07:00
Dalton Hubble
3fcb04f68c Improve apiserver backend service zone spanning
* google_compute_backend_services use nested blocks to define
backends (instance groups heterogeneous controllers)
* Use Terraform v0.12.x dynamic blocks so the apiserver backend
service can refer to (up to zone-many) controller instance groups
* Previously, with Terraform v0.11.x, the apiserver backend service
had to list a fixed set of backends to span controller nodes across
zones in multi-controller setups. 3 backends were used because each
GCP region offered at least 3 zones. Single-controller clusters had
the cosmetic ugliness of unused instance groups
* Allow controllers to span more than 3 zones if avilable in a
region (e.g. currently only us-central1, with 4 zones)

Related:

* https://www.terraform.io/docs/providers/google/r/compute_backend_service.html
* https://www.terraform.io/docs/configuration/expressions.html#dynamic-blocks
2019-07-05 19:46:26 -07:00
Dalton Hubble
8d373b5850 Update Calico from v3.7.3 to v3.7.4
* https://docs.projectcalico.org/v3.7/release-notes/
2019-07-02 20:18:02 -07:00
Dalton Hubble
9a395dbf88 Update Grafana from v6.2.4 to v6.2.5
* https://github.com/grafana/grafana/releases/tag/v6.2.5
2019-06-29 13:21:42 -07:00