Commit Graph

1052 Commits

Author SHA1 Message Date
Dalton Hubble
e9c8520359 Add experimental Cilium CNI provider
* Accept experimental CNI `networking` mode "cilium"
* Run Cilium v1.8.0-rc4 with overlay vxlan tunnels and a
minimal set of features. We're interested in:
  * IPAM: Divide pod_cidr into /24 subnets per node
  * CNI networking pod-to-pod, pod-to-external
  * BPF masquerade
  * NetworkPolicy as defined by Kubernetes (no L7 Policy)
* Continue using kube-proxy with Cilium probe mode
* Firewall changes:
  * Require UDP 8472 for vxlan (Linux kernel default) between nodes
  * Optional ICMP echo(8) between nodes for host reachability
    (health)
  * Optional TCP 4240 between nodes for endpoint reachability (health)

Known Issues:

* Containers with `hostPort` don't listen on all host addresses,
these workloads must use `hostNetwork` for now
https://github.com/cilium/cilium/issues/12116
* Erroneous warning on Fedora CoreOS
https://github.com/cilium/cilium/issues/10256

Note: This is experimental. It is not listed in docs and may be
changed or removed without a deprecation notice

Related:

* https://github.com/poseidon/terraform-render-bootstrap/pull/192
* https://github.com/cilium/cilium/issues/12217
2020-06-21 20:41:53 -07:00
Dalton Hubble
37f00a3882 Reduce Calcio MTU on Fedora CoreOS Azure
* Change the Calico VXLAN interface for MTU from 1450 to 1410
* VXLAN on Azure should support MTU 1450. However, there is
history where performance measures have shown that 1410 is
needed to have expected performance. Flatcar Linux has the
same MTU 1410 override and note
* FCOS 31.20200323.3.2 was known to perform fine with 1450, but
now in 31.20200517.3.0 the right value seems to be 1410
2020-06-19 00:24:56 -07:00
Dalton Hubble
4cfafeaa07 Fix Kubelet starting before hostname set on FCOS AWS
* Fedora CoreOS `kubelet.service` can start before the hostname
is set. Kubelet reads the hostname to determine the node name to
register. If the hostname was read as localhost, Kubelet will
continue trying to register as localhost (problem)
* This race manifests as a node that appears NotReady, the Kubelet
is trying to register as localhost, while the host itself (by then)
has an AWS provided hostname. Restarting kubelet.service is a
manual fix so Kubelet re-reads the hostname
* This race could only be shown on AWS, not on Google Cloud or
Azure despite attempts. Bare-metal and DigitalOcean differ and
use hostname-override (e.g. afterburn) so they're not affected
* Wait for nodes to have a non-localhost hostname in the oneshot
that awaits /etc/resolve.conf. Typhoon has no valid cases for a
node hostname being localhost (not even single-node clusters)

Related Openshift: https://github.com/openshift/machine-config-operator/pull/1813
Close https://github.com/poseidon/typhoon/issues/765
2020-06-19 00:19:54 -07:00
Dalton Hubble
90e23f5822 Rename controller node label and NoSchedule taint
* Remove node label `node.kubernetes.io/master` from controller nodes
* Use `node.kubernetes.io/controller` (present since v1.9.5,
[#160](https://github.com/poseidon/typhoon/pull/160)) to node select controllers
* Rename controller NoSchedule taint from `node-role.kubernetes.io/master` to
`node-role.kubernetes.io/controller`
* Tolerate the new taint name for workloads that may run on controller nodes
and stop tolerating `node-role.kubernetes.io/master` taint
2020-06-19 00:12:13 -07:00
Dalton Hubble
6234147948 Update recommended Terraform provider versions
* Sync Terraform provider plugin versions with those
used internally
2020-06-18 01:03:37 -07:00
Dalton Hubble
c25c59058c Update Kubernetes from v1.18.3 to v1.18.4
* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md#v1184
2020-06-17 19:53:19 -07:00
Dalton Hubble
bc9b808d44 Update nginx-ingress from v0.32.0 to v0.33.0
* https://github.com/kubernetes/ingress-nginx/releases/tag/controller-0.33.0
2020-06-16 18:44:40 -07:00
Dalton Hubble
4b0203fdb2 Fix typo in DigitalOcean docs title 2020-06-16 18:33:56 -07:00
Dalton Hubble
331566e1f7 Update mkdocs packages for website 2020-06-16 18:20:19 -07:00
Dalton Hubble
04520e447c Update node-exporter from v1.0.0 to v1.0.1
* https://github.com/prometheus/node_exporter/releases/tag/v1.0.1
2020-06-16 17:57:09 -07:00
Dalton Hubble
413585681b Remove unused Kubelet lock-file and exit-on-lock-contention
* Kubelet `--lock-file` and `--exit-on-lock-contention` date
back to usage of bootkube and at one point running Kubelet
in a "self-hosted" style whereby an on-host Kubelet (rkt)
started pods, but then a Kubelet DaemonSet was scheduled
and able to take over (hence self-hosted). `lock-file` and
`exit-on-lock-contention` flags supported this pivot. The
pattern has been out of favor (in bootkube too) for years
because of dueling Kubelet complexity
* Typhoon runs Kubelet as a container via an on-host systemd
unit using podman (Fedora CoreOS) or rkt (Flatcar Linux). In
fact, Typhoon no longer uses bootkube or control plane pivot
(let alone Kubelet pivot) and uses static pods since v1.16.0
* https://github.com/poseidon/typhoon/pull/536
2020-06-12 00:06:41 -07:00
Dalton Hubble
96711d7f17 Remove unused Kubelet cert / key Terraform state
* Generated Kubelet TLS certificate and key are not longer
used or distributed to machines since Kubelet TLS bootstrap
is used instead. Remove the certificate and key from state
2020-06-11 21:24:36 -07:00
Dalton Hubble
c9059d3fe9 Update Prometheus from v2.19.0-rc.0 to v2.19.0
* https://github.com/prometheus/prometheus/releases/tag/v2.19.0
2020-06-09 23:05:03 -07:00
Dalton Hubble
a287920169 Use strict mode for Container Linux Configs
* Enable terraform-provider-ct `strict` mode for parsing
Container Linux Configs and snippets
* Fix Container Linux Config systemd unit syntax `enable`
(old) to `enabled`
* Align with Fedora CoreOS which uses strict mode already
2020-06-09 23:00:36 -07:00
Dalton Hubble
8dc170b9d9 Update security disclosure contact email
* Use security@psdn.io across github.com/poseidon projects
2020-06-08 12:37:09 -07:00
Dalton Hubble
aed1a5f33d Fix Fedora CoreOS docs for selecting a stream
* Fedora CoreOS image `os_stream` stable, testing, and next
have been configurable since v1.18.3
* Remove mention of outdated `os_image` variable
2020-06-08 12:25:57 -07:00
Dalton Hubble
31d02b0221 Update Prometheus from v2.18.1 to v2.19.0-rc.0
* https://github.com/prometheus/prometheus/releases/tag/v2.19.0-rc.0
2020-06-05 00:16:45 -07:00
Dalton Hubble
8f875f80f5 Update Grafana from v7.0.1 to v7.0.3
* https://github.com/grafana/grafana/releases/tag/v7.0.2
* https://github.com/grafana/grafana/releases/tag/v7.0.3
2020-06-03 12:31:58 -07:00
Dalton Hubble
16c0b9152b Update kube-state-metrics from v1.9.6 to v1.9.7
* https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.9.7
2020-06-03 11:35:10 -07:00
Dalton Hubble
99dbce67a3 Tweak minor style elements of issue templates 2020-05-31 16:19:33 -07:00
Dalton Hubble
20bfd69780 Change Kubelet container image publishing
* Build Kubelet container images internally and publish
to Quay and Dockerhub (new) as an alternative in case of
registry outage or breach
* Use our infra to provide single and multi-arch (default)
Kublet images for possible future use
* Docs: Show how to use alternative Kubelet images via
snippets and a systemd dropin (builds on #737)

Changes:

* Update docs with changes to Kubelet image building
* If you prefer to trust images built by Quay/Dockerhub,
automated image builds are still available with unique
tags (albeit with some limitations):
  * Quay automated builds are tagged `build-{short_sha}`
  (limit: only amd64)
  * Dockerhub automated builts are tagged `build-{tag}`
  and `build-master` (limit: only amd64, no shas)

Links:

* Kubelet: https://github.com/poseidon/kubelet
* Docs: https://typhoon.psdn.io/topics/security/#container-images
* Registries:
  * quay.io/poseidon/kubelet
  * docker.io/psdn/kubelet
2020-05-30 23:34:23 -07:00
Dalton Hubble
ba44408b76 Update Calico from v3.14.0 to v3.14.1
* https://docs.projectcalico.org/v3.14/release-notes/
2020-05-30 22:08:37 -07:00
Dalton Hubble
455175d9e6 Update the fallback issue template
* Even "blank" issues need to fill out the fallback
template
2020-05-28 00:06:59 -07:00
Dalton Hubble
d45804b1f6
Update Github issue template to use drop-downs (#747)
* Create a stricter bug report template
* Highlight topics that are not accepted in issues: operation, support, debugging, advice, or Kubernetes concepts
* Add a section to strongly suggest bug reports link a PR or describe a solution. This may be able to weed out topics that aren't focused bug reports
2020-05-27 23:49:25 -07:00
Dalton Hubble
907a96916f Update mkdocs-material from v5.2.0 to v5.2.2
* https://github.com/squidfunk/mkdocs-material/releases/tag/5.2.2
2020-05-27 21:49:40 -07:00
Dalton Hubble
187bb17d39 Update Grafana from v7.0.0 to v7.0.1
* https://github.com/grafana/grafana/releases/tag/v7.0.1
2020-05-27 21:35:24 -07:00
Dalton Hubble
abc31c3711 Update node-exporter from v1.0.0-rc.1 to v1.0.0
* https://github.com/prometheus/node_exporter/releases/tag/v1.0.0
2020-05-27 21:33:03 -07:00
Dalton Hubble
283e14f3e0 Update recommended Terraform provider versions
* Sync Terraform provider plugin versions to those actively
used internally
* Fix terraform fmt
2020-05-22 01:12:53 -07:00
Dalton Hubble
e72f916c8d Update etcd from v3.4.8 to v3.4.9
* https://github.com/etcd-io/etcd/blob/master/CHANGELOG-3.4.md#v349-2020-05-20
2020-05-22 00:52:20 -07:00
Dalton Hubble
c52f9f8d08 Upgrade docs packages and refresh content
* Promote DigitalOcean from alpha to beta for Fedora
CoreOS and Flatcar Linux
* Upgrade mkdocs-material and PyPI packages for docs
* Replace docs mentions of Container Linux with Flatcar
Linux and move docs/cl to docs/flatcar-linux
* Deprecate CoreOS Container Linux support. Its still
usable for some time, but start removing docs
2020-05-20 23:31:26 -07:00
Dalton Hubble
ecae6679ff Update Kubernetes from v1.18.2 to v1.18.3
* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md
2020-05-20 20:37:39 -07:00
Dalton Hubble
4760543356 Set Kubelet image via kubelet.service KUBELET_IMAGE
* Write the systemd kubelet.service to use `KUBELET_IMAGE`
as the Kubelet. This provides a nice way to use systemd
dropins to temporarily override the image (e.g. during a
registry outage)

Note: Only Typhoon Kubelet images and registries are supported.
2020-05-19 22:39:53 -07:00
Dalton Hubble
09eb208b4e Fix Fedora CoreOS on GCP proposing controller recreate
* With Fedora CoreOS image stream support (#727), the latest
resolved image will change over the lifecycle of a cluster.
* Fix issue where an image diff proposed replacing a Fedora
CoreOS controller on GCP, introduced in #727 (unreleased)
* Also ignore image diffs to the GCP managed instance group
of workers. This aligns with worker AMI diffs being ignored
on AWS and similar on Azure, since workers update themselves.

Background:

* Controller nodes should strictly not be recreated by Terraform,
they are stateful (etcd) and should not be replaced
* Across cloud platforms, OS image diffs are ignored since both
Flatcar Linux and Fedora CoreOS nodes update themselves. For
workers, user-data or disk size diffs (where relevant) are allowed
to recreate workers templates/configs since these are considered
to be user-initiated declarations that a reprovision should be done
2020-05-19 21:41:51 -07:00
Dalton Hubble
8d024d22ad Update etcd from v3.4.7 to v3.4.8
* https://github.com/etcd-io/etcd/blob/master/CHANGELOG-3.4.md#v348-2020-05-18
2020-05-18 23:50:46 -07:00
Dalton Hubble
3bdddc452c Update Grafana from v7.0.0-beta2 to v7.0.0
* https://grafana.com/docs/grafana/latest/guides/whats-new-in-v7-0/
2020-05-18 23:42:32 -07:00
Dalton Hubble
ff4187a1fb Use new Azure subnet to set address_prefixes list
* Update Azure subnet `address_prefix` to `azure_prefixes` list
* Fix warning that `address_prefix` is deprecated
* Require `terraform-provider-azurerm` v2.8.0+ (action required)

Rel: https://github.com/terraform-providers/terraform-provider-azurerm/pull/6493
2020-05-18 23:35:47 -07:00
Dalton Hubble
2578be1f96 Rollback Grafana to v7.0.0-beta3, v7.0.0 image is missing
* Grafana hasn't published the v7.0.0 image yet
2020-05-16 12:32:10 -07:00
Dalton Hubble
90edcd3d77 Update node-exporter from v1.0.0-rc.0 to v1.0.0-rc.1
* https://github.com/prometheus/node_exporter/releases/tag/v1.0.0-rc.1
2020-05-15 18:03:19 -07:00
Dalton Hubble
a927c7c790 Update kube-state-metrics from v1.9.5 to v1.9.6
* https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.9.6
2020-05-15 17:42:24 -07:00
Dalton Hubble
d952576d2f Update Grafana from v7.0.0-beta3 to v7.0.0
* https://github.com/grafana/grafana/releases/tag/7.0.0
2020-05-15 17:38:59 -07:00
Dalton Hubble
70e389f37f Restore use of Flatcar Linux Azure Marketplace image
* Switch Flatcar Linux Azure to use the Marketplace image
from Kinvolk (offer `flatcar-container-linux-free`)
* Accepting Azure Marketplace terms is still neccessary,
update docs to show accepting the free offer rather than
BYOL

* Upstream Flatcar: https://github.com/flatcar-linux/Flatcar/issues/82
* Typhoon: https://github.com/poseidon/typhoon/issues/703
2020-05-13 22:50:24 -07:00
Dalton Hubble
a18bd0a707 Highlight SELinux enforcing mode in features 2020-05-13 21:57:38 -07:00
Dalton Hubble
01905b00bc Support Fedora CoreOS OS image streams on AWS
* Add `os_stream` variable to set the stream to stable (default),
testing, or next
* Remove unused os_image variable on Fedora CoreOS AWS
2020-05-13 21:45:12 -07:00
Dalton Hubble
f4194cd57a Update Grafana from v7.0.0-beta2 to v7.0.0-beta.3
* https://github.com/grafana/grafana/releases/tag/v7.0.0-beta3
2020-05-09 17:50:40 -07:00
Dalton Hubble
a2db4fa8c4 Update Calico from v3.13.3 to v3.14.0
* https://docs.projectcalico.org/v3.14/release-notes/
2020-05-09 16:05:30 -07:00
Dalton Hubble
358854e712 Fix Calico install-cni crash loop on Pod restarts
* Set a consistent MCS level/range for Calico install-cni
* Note: Rebooting a node was a workaround, because Kubelet
relabels /etc/kubernetes(/cni/net.d)

Background:

* On SELinux enforcing systems, the Calico CNI install-cni
container ran with default SELinux context and a random MCS
pair. install-cni places CNI configs by first creating a
temporary file and then moving them into place, which means
the file MCS categories depend on the containers SELinux
context.
* calico-node Pod restarts creates a new install-cni container
with a different MCS pair that cannot access the earlier
written file (it places configs every time), causing the
init container to error and calico-node to crash loop
* https://github.com/projectcalico/cni-plugin/issues/874

```
mv: inter-device move failed: '/calico.conf.tmp' to
'/host/etc/cni/net.d/10-calico.conflist'; unable to remove target:
Permission denied
Failed to mv files. This may be caused by selinux configuration on
the
host, or something else.
```

Note, this isn't a host SELinux configuration issue.

Related:

* https://github.com/poseidon/terraform-render-bootstrap/pull/186
2020-05-09 16:01:44 -07:00
Dalton Hubble
b5dabcea31 Use Fedora CoreOS image streams on Google Cloud
* Add `os_stream` variable to set a Fedora CoreOS stream
to `stable` (default), `testing`, or `next`
* Deprecate `os_image` variable. Remove docs about uploading
Fedora CoreOS images manually, this is no longer needed
* https://docs.fedoraproject.org/en-US/fedora-coreos/update-streams/

Rel: https://github.com/coreos/fedora-coreos-docs/pull/70
2020-05-08 01:23:12 -07:00
Dalton Hubble
3f0a5d2715 Update Grafana from v7.0.0-beta1 to v7.0.0-beta2
* https://github.com/grafana/grafana/releases/tag/v7.0.0-beta2
2020-05-07 23:04:44 -07:00
Dalton Hubble
33173c0206 Update Prometheus from v2.18.0 to v2.18.1
* https://github.com/prometheus/prometheus/releases/tag/v2.18.1
2020-05-07 22:59:11 -07:00
Dalton Hubble
70f30d9c07 Update Prometheus from v2.18.0-rc.1 to v2.18.0
* https://github.com/prometheus/prometheus/releases/tag/v2.18.0
2020-05-05 22:31:11 -07:00