Commit Graph

186 Commits

Author SHA1 Message Date
Dalton Hubble
ecae6679ff Update Kubernetes from v1.18.2 to v1.18.3
* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md
2020-05-20 20:37:39 -07:00
Dalton Hubble
4760543356 Set Kubelet image via kubelet.service KUBELET_IMAGE
* Write the systemd kubelet.service to use `KUBELET_IMAGE`
as the Kubelet. This provides a nice way to use systemd
dropins to temporarily override the image (e.g. during a
registry outage)

Note: Only Typhoon Kubelet images and registries are supported.
2020-05-19 22:39:53 -07:00
Dalton Hubble
8d024d22ad Update etcd from v3.4.7 to v3.4.8
* https://github.com/etcd-io/etcd/blob/master/CHANGELOG-3.4.md#v348-2020-05-18
2020-05-18 23:50:46 -07:00
Dalton Hubble
ff4187a1fb Use new Azure subnet to set address_prefixes list
* Update Azure subnet `address_prefix` to `azure_prefixes` list
* Fix warning that `address_prefix` is deprecated
* Require `terraform-provider-azurerm` v2.8.0+ (action required)

Rel: https://github.com/terraform-providers/terraform-provider-azurerm/pull/6493
2020-05-18 23:35:47 -07:00
Dalton Hubble
70e389f37f Restore use of Flatcar Linux Azure Marketplace image
* Switch Flatcar Linux Azure to use the Marketplace image
from Kinvolk (offer `flatcar-container-linux-free`)
* Accepting Azure Marketplace terms is still neccessary,
update docs to show accepting the free offer rather than
BYOL

* Upstream Flatcar: https://github.com/flatcar-linux/Flatcar/issues/82
* Typhoon: https://github.com/poseidon/typhoon/issues/703
2020-05-13 22:50:24 -07:00
Dalton Hubble
a18bd0a707 Highlight SELinux enforcing mode in features 2020-05-13 21:57:38 -07:00
Dalton Hubble
a2db4fa8c4 Update Calico from v3.13.3 to v3.14.0
* https://docs.projectcalico.org/v3.14/release-notes/
2020-05-09 16:05:30 -07:00
Dalton Hubble
358854e712 Fix Calico install-cni crash loop on Pod restarts
* Set a consistent MCS level/range for Calico install-cni
* Note: Rebooting a node was a workaround, because Kubelet
relabels /etc/kubernetes(/cni/net.d)

Background:

* On SELinux enforcing systems, the Calico CNI install-cni
container ran with default SELinux context and a random MCS
pair. install-cni places CNI configs by first creating a
temporary file and then moving them into place, which means
the file MCS categories depend on the containers SELinux
context.
* calico-node Pod restarts creates a new install-cni container
with a different MCS pair that cannot access the earlier
written file (it places configs every time), causing the
init container to error and calico-node to crash loop
* https://github.com/projectcalico/cni-plugin/issues/874

```
mv: inter-device move failed: '/calico.conf.tmp' to
'/host/etc/cni/net.d/10-calico.conflist'; unable to remove target:
Permission denied
Failed to mv files. This may be caused by selinux configuration on
the
host, or something else.
```

Note, this isn't a host SELinux configuration issue.

Related:

* https://github.com/poseidon/terraform-render-bootstrap/pull/186
2020-05-09 16:01:44 -07:00
Dalton Hubble
fd044ee117 Enable Kubelet TLS bootstrap and NodeRestriction
* Enable bootstrap token authentication on kube-apiserver
* Generate the bootstrap.kubernetes.io/token Secret that
may be used as a bootstrap token
* Generate a bootstrap kubeconfig (with a bootstrap token)
to be securely distributed to nodes. Each Kubelet will use
the bootstrap kubeconfig to authenticate to kube-apiserver
as `system:bootstrappers` and send a node-unique CSR for
kube-controller-manager to automatically approve to issue
a Kubelet certificate and kubeconfig (expires in 72 hours)
* Add ClusterRoleBinding for bootstrap token subjects
(`system:bootstrappers`) to have the `system:node-bootstrapper`
ClusterRole
* Add ClusterRoleBinding for bootstrap token subjects
(`system:bootstrappers`) to have the csr nodeclient ClusterRole
* Add ClusterRoleBinding for bootstrap token subjects
(`system:bootstrappers`) to have the csr selfnodeclient ClusterRole
* Enable NodeRestriction admission controller to limit the
scope of Node or Pod objects a Kubelet can modify to those of
the node itself
* Ability for a Kubelet to delete its Node object is retained
as preemptible nodes or those in auto-scaling instance groups
need to be able to remove themselves on shutdown. This need
continues to have precedence over any risk of a node deleting
itself maliciously

Security notes:

1. Issued Kubelet certificates authenticate as user `system:node:NAME`
and group `system:nodes` and are limited in their authorization
to perform API operations by Node authorization and NodeRestriction
admission. Previously, a Kubelet's authorization was broader. This
is the primary security motivation.

2. The bootstrap kubeconfig credential has the same sensitivity
as the previous generated TLS client-certificate kubeconfig.
It must be distributed securely to nodes. Its compromise still
allows an attacker to obtain a Kubelet kubeconfig

3. Bootstrapping Kubelet kubeconfig's with a limited lifetime offers
a slight security improvement.
  * An attacker who obtains the kubeconfig can likely obtain the
  bootstrap kubeconfig as well, to obtain the ability to renew
  their access
  * A compromised bootstrap kubeconfig could plausibly be handled
  by replacing the bootstrap token Secret, distributing the token
  to new nodes, and expiration. Whereas a compromised TLS-client
  certificate kubeconfig can't be revoked (no CRL). However,
  replacing a bootstrap token can be impractical in real cluster
  environments, so the limited lifetime is mostly a theoretical
  benefit.
  * Cluster CSR objects are visible via kubectl which is nice

4. Bootstrapping node-unique Kubelet kubeconfigs means Kubelet
clients have more identity information, which can improve the
utility of audits and future features

Rel: https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet-tls-bootstrapping/
Rel: https://github.com/poseidon/terraform-render-bootstrap/pull/185
2020-04-28 19:35:33 -07:00
Dalton Hubble
38a6bddd06 Update Calico from v3.13.1 to v3.13.3
* https://docs.projectcalico.org/v3.13/release-notes/
2020-04-23 23:58:02 -07:00
Dalton Hubble
d8966afdda Remove extraneous sudo from layout asset unpacking 2020-04-22 20:28:01 -07:00
Dalton Hubble
feac94605a Fix bootstrap mount to use shared volume SELinux label
* Race: During initial bootstrap, static control plane pods
could hang with Permission denied to bootstrap secrets. A
manual fix involved restarting Kubelet, which relabeled mounts
The race had no effect on subsequent reboots.
* bootstrap.service runs podman with a private unshared mount
of /etc/kubernetes/bootstrap-secrets which uses an SELinux MCS
label with a category pair. However, bootstrap-secrets should
be shared as its mounted by Docker pods kube-apiserver,
kube-scheduler, and kube-controller-manager. Restarting Kubelet
was a manual fix because Kubelet relabels all /etc/kubernetes
* Fix bootstrap Pod to use the shared volume label, which leaves
bootstrap-secrets files with SELinux level s0 without MCS
* Also allow failed bootstrap.service to be re-applied. This was
missing on bare-metal and AWS
2020-04-19 16:31:32 -07:00
Dalton Hubble
2b1b918b43 Revert Flatcar Linux Azure to manual upload images
* Initial support for Flatcar Linux on Azure used the Flatcar
Linux Azure Marketplace images (e.g. `flatcar-stable`) in
https://github.com/poseidon/typhoon/pull/664
* Flatcar Linux Azure Marketplace images have some unresolved
items https://github.com/poseidon/typhoon/issues/703
* Until the Marketplace items are resolved, revert to requiring
Flatcar Linux's images be manually uploaded (like GCP and
DigitalOcean)
2020-04-18 15:40:57 -07:00
Dalton Hubble
bf22222f7d Remove temporary workaround for v1.18.0 apply issue
* In v1.18.0, kubectl apply would fail to apply manifests if any
single manifest was unable to validate. For example, if a CRD and
CR were defined in the same directory, apply would fail since the
CR would be invalid as the CRD wouldn't exist
* Typhoon temporary workaround was to separate CNI CRD manifests
and explicitly apply them first. No longer needed in v1.18.1+
* Kubernetes v1.18.1 restored the prior behavior where kubectl apply
applies as many valid manifests as it can. In the example above, the
CRD would be applied and the CR could be applied if the kubectl apply
was re-run (allowing for apply loops).
* Upstream fix: https://github.com/kubernetes/kubernetes/pull/89864
2020-04-16 23:49:55 -07:00
Dalton Hubble
671eacb86e Update Kubernetes from v1.18.1 to v1.18.2
* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md#changelog-since-v1181
2020-04-16 23:40:52 -07:00
Dalton Hubble
e2d4af43be Fix Fedora CoreOS Azure MTU with Calico
* With Calico VXLAN on Fedora CoreOS the 1450 MTU should
be used
2020-04-12 23:20:04 -07:00
Dalton Hubble
5c4a3f73d5 Add support for Fedora CoreOS on Azure
* Add `azure/fedora-coreos/kubernetes` module
2020-04-12 16:35:49 -07:00
Dalton Hubble
76ab4c4c2a Change container-linux module preference to Flatcar Linux
* No change to Fedora CoreOS modules
* For Container Linx AWS and Azure, change the `os_image` default
from coreos-stable to flatcar-stable
* For Container Linux GCP and DigitalOcean, change `os_image` to
be required since users should upload a Flatcar Linux image and
set the variable
* For Container Linux bare-metal, recommend users change the
`os_channel` to Flatcar Linux. No actual module change.
2020-04-11 14:52:30 -07:00
Dalton Hubble
73af2f3b7c Update Kubernetes from v1.18.0 to v1.18.1
* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md#v1181
2020-04-08 19:41:48 -07:00
Dalton Hubble
17ea547723 Update etcd from v3.4.5 to v3.4.7
* https://github.com/etcd-io/etcd/releases/tag/v3.4.7
* https://github.com/etcd-io/etcd/releases/tag/v3.4.6
2020-04-06 21:09:25 -07:00
Dalton Hubble
3c1be7b0e0 Fix terraform fmt 2020-03-31 21:42:51 -07:00
Dalton Hubble
bbbaf949f9 Fix UDP outbound and clock sync timeouts on Azure workers
* Add "lb" outbound rule for worker TCP _and_ UDP traffic
* Fix Azure worker nodes clock synchronization being inactive
due to timeouts reaching the CoreOS / Flatcar NTP pool
* Fix Azure worker nodes not providing outbount UDP connectivity

Background:

Azure provides VMs outbound connectivity either by having a public
IP or via an SNAT masquerade feature bundled with their virtual
load balancing abstraction (in contrast with, say, a NAT gateway).

Azure worker nodes have only a private IP, but are associated with
the cluster load balancer's backend pool and ingress frontend IP.
Outbound traffic uses SNAT with this frontend IP. A subtle detail
with Azure SNAT seems to be that since both inbound lb_rule's are
TCP only, outbound UDP traffic isn't SNAT'd (highlights the reasons
Azure shouldn't have conflated inbound load balancing with outbound
SNAT concepts). However, adding a separate outbound rule and
disabling outbound SNAT on our ingress lb_rule's we can tell Azure
to continue load balancing as before, and support outbound SNAT for
worker traffic of both the TCP and UDP protocol.

Fixes clock synchronization timeouts:

```
systemd-timesyncd[786]: Timed out waiting for reply from
45.79.36.123:123 (3.flatcar.pool.ntp.org)
```

Azure controller nodes have their own public IP, so controllers (and
etcd) nodes have not had clock synchronization or outbound UDP issues
2020-03-31 21:00:16 -07:00
Dalton Hubble
135c6182b8 Update flannel from v0.11.0 to v0.12.0
* https://github.com/coreos/flannel/releases/tag/v0.12.0
2020-03-31 18:31:59 -07:00
Dalton Hubble
c53dc66d4a Rename Container Linux snippets variable for consistency
* Rename controller_clc_snippets to controller_snippets (cloud platforms)
* Rename worker_clc_snippets to worker_snippets (cloud platforms)
* Rename clc_snippets to snippets (bare-metal)
2020-03-31 18:25:51 -07:00
Dalton Hubble
9960972726 Fix bootstrap regression when networking="flannel"
* Fix bootstrap error for missing `manifests-networking/crd*yaml`
when `networking = "flannel"`
* Cleanup manifest-networking directory left during bootstrap
* Regressed in v1.18.0 changes for Calico https://github.com/poseidon/typhoon/pull/675
2020-03-31 18:21:59 -07:00
Dalton Hubble
bac5acb3bd Change default kube-system DaemonSet tolerations
* Change kube-proxy, flannel, and calico-node DaemonSet
tolerations to tolerate `node.kubernetes.io/not-ready`
and `node-role.kubernetes.io/master` (i.e. controllers)
explicitly, rather than tolerating all taints
* kube-system DaemonSets will no longer tolerate custom
node taints by default. Instead, custom node taints must
be enumerated to opt-in to scheduling/executing the
kube-system DaemonSets
* Consider setting the daemonset_tolerations variable
of terraform-render-bootstrap at a later date

Background: Tolerating all taints ruled out use-cases
where certain nodes might legitimately need to keep
kube-proxy or CNI networking disabled
Related: https://github.com/poseidon/terraform-render-bootstrap/pull/179
2020-03-31 01:00:45 -07:00
Dalton Hubble
fc686c8fc7 Fix delete-node.service kubectl service exec's
* Fix delete-node service that runs on worker (cloud-only)
shutdown to delete a Kubernetes node. Regressed in #669
(unreleased)
* Use rkt `--exec` to invoke kubectl binary in the kubelet
image
* Use podman `--entrypoint` to invoke the kubectl binary in
the kubelet image
2020-03-28 12:35:23 -07:00
Dalton Hubble
d25f23e675 Update docs from Kubernetes v1.17.4 to v1.18.0 2020-03-25 20:28:30 -07:00
Dalton Hubble
f100a90d28 Update Kubernetes from v1.17.4 to v1.18.0
* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md
2020-03-25 17:51:50 -07:00
Dalton Hubble
590d941f50 Switch from upstream hyperkube image to individual images
* Kubernetes plans to stop releasing the hyperkube container image
* Upstream will continue to publish `kube-apiserver`, `kube-controller-manager`,
`kube-scheduler`, and `kube-proxy` container images to `k8s.gcr.io`
* Upstream will publish Kubelet only as a binary for distros to package,
either as a DEB/RPM on traditional distros or a container image on
container-optimized operating systems
* Typhoon will package the upstream Kubelet (checksummed) and its
dependencies as a container image for use on CoreOS Container Linux,
Flatcar Linux, and Fedora CoreOS
* Update the Typhoon container image security policy to list
`quay.io/poseidon/kubelet`as an official distributed artifact

Hyperkube: https://github.com/kubernetes/kubernetes/pull/88676
Kubelet Container Image: https://github.com/poseidon/kubelet
Kubelet Quay Repo: https://quay.io/repository/poseidon/kubelet
2020-03-21 15:43:05 -07:00
Dalton Hubble
c3ef21dbf5 Update etcd from v3.4.4 to v3.4.5
* https://github.com/etcd-io/etcd/releases/tag/v3.4.5
2020-03-18 20:50:41 -07:00
Dalton Hubble
bc7902f40a Update Kubernetes from v1.17.3 to v1.17.4
* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.17.md#v1174
2020-03-13 00:06:41 -07:00
Dalton Hubble
70bf39bb9a Update Calico from v3.12.0 to v3.13.1
* https://docs.projectcalico.org/v3.13/release-notes/
2020-03-12 23:00:38 -07:00
Dalton Hubble
4e1b8f22df Add support for Flatcar Linux on Azure
* Accept `os_image` "flatcar-stable" and "flatcar-beta" to
use Kinvolk's Flatcar Linux images from the Azure Marketplace

Note: Flatcar Linux Azure Marketplace images require terms be
accepted before use
2020-03-12 22:52:48 -07:00
Dalton Hubble
7b0ea23cdc Upgrade terraform-provider-azurerm to v2.0+
* Add support for `terraform-provider-azurerm` v2.0+. Require
`terraform-provider-azurerm` v2.0+ and drop v1.x support since
the Azure provider major release is not backwards compatible
* Use Azure's new Linux VM and Linux VM Scale Set resources
* Change controller's Azure disk caching to None
* Associate subnets (in addition to NICs) with security groups
(aesthetic)
* If set, change `worker_priority` from `Low` to `Spot` (action required)

Related:

* https://www.terraform.io/docs/providers/azurerm/guides/2.0-upgrade-guide.html
2020-03-08 17:40:13 -07:00
Dalton Hubble
51cee6d5a4 Change Container Linux etcd-member to fetch with docker://
* Quay has historically generated ACI signatures for images to
facilitate rkt's notions of verification (it allowed authors to
actually sign images, though `--trust-keys-from-https` is in use
since etcd and most authors don't sign images). OCI standardization
didn't adopt verification ideas and checking signatures has fallen
out of favor.
* Fix an issue where Quay no longer seems to be generating ACI
signatures for new images (e.g. quay.io/coreos/etcd:v.3.4.4)
* Don't be alarmed by rkt `--insecure-options=image`. It refers
to disabling image signature checking (i.e. docker pull doesn't
check signatures either)
* System containers for Kubelet and bootstrap have transitioned
to the docker:// transport, so there is precedent and this brings
all the system containers on Container Linux controllers into
alignment
2020-03-02 19:57:45 -08:00
Dalton Hubble
6de5cf5a55 Update etcd from v3.4.3 to v3.4.4
* https://github.com/etcd-io/etcd/releases/tag/v3.4.4
2020-02-29 16:19:29 -08:00
Dalton Hubble
4a38fb5927 Update CoreDNS from v1.6.6 to v1.6.7
* https://coredns.io/2020/01/28/coredns-1.6.7-release/
2020-02-18 21:46:19 -08:00
Suraj Deshmukh
c4e64a9d1b
Change Kubelet /var/lib/calico mount to read-only (#643)
* Kubelet only requires read access to /var/lib/calico

Signed-off-by: Suraj Deshmukh <surajd.service@gmail.com>
2020-02-18 21:40:58 -08:00
Dalton Hubble
1243f395d1 Update Kubernetes from v1.17.2 to v1.17.3
* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.17.md#v1173
2020-02-11 20:22:14 -08:00
Dalton Hubble
ca96a1335c Update Calico from v3.11.2 to v3.12.0
* https://docs.projectcalico.org/release-notes/#v3120
* Remove reverse packet filter override, since Calico no
longer relies on the setting
* https://github.com/coreos/fedora-coreos-tracker/issues/219
* https://github.com/projectcalico/felix/pull/2189
2020-02-06 00:43:33 -08:00
Dalton Hubble
1cda5bcd2a Update Kubernetes from v1.17.1 to v1.17.2
* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.17.md#v1172
2020-01-21 18:27:39 -08:00
Dalton Hubble
7daabd28b5 Update Calico from v3.11.1 to v3.11.2
* https://docs.projectcalico.org/v3.11/release-notes/
2020-01-18 13:45:24 -08:00
Dalton Hubble
b642e3b41b Update Kubernetes from v1.17.0 to v1.17.1
* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.17.md#v1171
2020-01-14 20:21:36 -08:00
Dalton Hubble
43e05b9131 Enable kube-proxy metrics and allow Prometheus scrapes
* Configure kube-proxy --metrics-bind-address=0.0.0.0 (default
127.0.0.1) to serve metrics on 0.0.0.0:10249
* Add firewall rules to allow Prometheus (resides on a worker) to
scrape kube-proxy service endpoints on controllers or workers
* Add a clusterIP: None service for kube-proxy endpoint discovery
2020-01-06 21:11:18 -08:00
Dalton Hubble
b2eb3e05d0 Disable Kubelet 127.0.0.1.10248 healthz endpoint
* Kubelet runs a healthz server listening on 127.0.0.1:10248
by default. Its unused by Typhoon and can be disabled
* https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/
2019-12-29 11:23:25 -08:00
Dalton Hubble
f1f4cd6fc0 Inline Container Linux kubelet.service, deprecate kubelet-wrapper
* Change kubelet.service on Container Linux nodes to ExecStart Kubelet
inline to replace the use of the host OS kubelet-wrapper script
* Express rkt run flags and volume mounts in a clear, uniform way to
make the Kubelet service easier to audit, manage, and understand
* Eliminate reliance on a Container Linux kubelet-wrapper script
* Typhoon for Fedora CoreOS developed a kubelet.service that similarly
uses an inline ExecStart (except with podman instead of rkt) and a
more minimal set of volume mounts. Adopt the volume improvements:
  * Change Kubelet /etc/kubernetes volume to read-only
  * Change Kubelet /etc/resolv.conf volume to read-only
  * Remove unneeded /var/lib/cni volume mount

Background:

* kubelet-wrapper was added in CoreOS around the time of Kubernetes v1.0
to simplify running a CoreOS-built hyperkube ACI image via rkt-fly. The
script defaults are no longer ideal (e.g. rkt's notion of trust dates
back to quay.io ACI image serving and signing, which informed the OCI
standard images we use today, though they still lack rkt's signing ideas).
* Shipping kubelet-wrapper was regretted at CoreOS, but remains in the
distro for compatibility. The script is not updated to track hyperkube
changes, but it is stable and kubelet.env overrides bridge most gaps
* Typhoon Container Linux nodes have used kubelet-wrapper to rkt/rkt-fly
run the Kubelet via the official k8s.gcr.io hyperkube image using overrides
(new image registry, new image format, restart handling, new mounts, new
entrypoint in v1.17).
* Observation: Most of what it takes to run a Kubelet container is defined
in Typhoon, not in kubelet-wrapper. The wrapper's value is now undermined
by having to workaround its dated defaults. Typhoon may be better served
defining Kubelet.service explicitly
* Typhoon for Fedora CoreOS developed a kubelet.service without the use
of a host OS kubelet-wrapper which is both clearer and eliminated some
volume mounts
2019-12-29 11:17:26 -08:00
Dalton Hubble
50db3d0231 Rename CLC files and favor Terraform list index syntax
* Rename Container Linux Config (CLC) files to *.yaml to align
with Fedora CoreOS Config (FCC) files and for syntax highlighting
* Replace common uses of Terraform `element` (which wraps around)
with `list[index]` syntax to surface index errors
2019-12-28 12:14:01 -08:00
Dalton Hubble
11565ffa8a Update Calico from v3.10.2 to v3.11.1
* https://docs.projectcalico.org/v3.11/release-notes/
2019-12-28 11:08:03 -08:00
Dalton Hubble
daa8d9d9ec Update CoreDNS from v1.6.5 to v1.6.6
* https://coredns.io/2019/12/11/coredns-1.6.6-release/
2019-12-22 10:47:19 -05:00