* Change node-exporter DaemonSet tolerations from tolerating
all possible NoSchedule taints to tolerating the master taint
and the not ready taint (we'd like metrics regardless)
* Users who add custom node taints must add their custom taints
to the addon node-exporter DaemonSet. As an addon, its expected
users copy and manipulate manifests out-of-band in their own
systems
* Configure kube-proxy --metrics-bind-address=0.0.0.0 (default
127.0.0.1) to serve metrics on 0.0.0.0:10249
* Add firewall rules to allow Prometheus (resides on a worker) to
scrape kube-proxy service endpoints on controllers or workers
* Add a clusterIP: None service for kube-proxy endpoint discovery
* Change kubelet.service on Container Linux nodes to ExecStart Kubelet
inline to replace the use of the host OS kubelet-wrapper script
* Express rkt run flags and volume mounts in a clear, uniform way to
make the Kubelet service easier to audit, manage, and understand
* Eliminate reliance on a Container Linux kubelet-wrapper script
* Typhoon for Fedora CoreOS developed a kubelet.service that similarly
uses an inline ExecStart (except with podman instead of rkt) and a
more minimal set of volume mounts. Adopt the volume improvements:
* Change Kubelet /etc/kubernetes volume to read-only
* Change Kubelet /etc/resolv.conf volume to read-only
* Remove unneeded /var/lib/cni volume mount
Background:
* kubelet-wrapper was added in CoreOS around the time of Kubernetes v1.0
to simplify running a CoreOS-built hyperkube ACI image via rkt-fly. The
script defaults are no longer ideal (e.g. rkt's notion of trust dates
back to quay.io ACI image serving and signing, which informed the OCI
standard images we use today, though they still lack rkt's signing ideas).
* Shipping kubelet-wrapper was regretted at CoreOS, but remains in the
distro for compatibility. The script is not updated to track hyperkube
changes, but it is stable and kubelet.env overrides bridge most gaps
* Typhoon Container Linux nodes have used kubelet-wrapper to rkt/rkt-fly
run the Kubelet via the official k8s.gcr.io hyperkube image using overrides
(new image registry, new image format, restart handling, new mounts, new
entrypoint in v1.17).
* Observation: Most of what it takes to run a Kubelet container is defined
in Typhoon, not in kubelet-wrapper. The wrapper's value is now undermined
by having to workaround its dated defaults. Typhoon may be better served
defining Kubelet.service explicitly
* Typhoon for Fedora CoreOS developed a kubelet.service without the use
of a host OS kubelet-wrapper which is both clearer and eliminated some
volume mounts
* Update recommended Terraform and provider plugin versions
* Update the rough count of resources created per cluster
since its not been refreshed in a while (will vary based
on cluster options)
* Stop mapping node labels to targets discovered via Kubernetes
nodes (e.g. etcd, kubelet, cadvisor). It is rarely useful to
store node labels (e.g. kubernetes.io/os=linux) on these metrics
* kube-apiserver's apiserver_request_duration_seconds_bucket metric
has a high cardinality that includes labels for the API group, verb,
scope, resource, and component for each object type, including for
each CRD. This one metric has ~10k time series in a typical cluster
(btw 10-40% of total)
* Removing the apiserver request duration outright would make latency
alerts a NoOp and break a Grafana apiserver panel. Instead, drop series
that have a "group" label. Effectively, only request durations for
core Kubernetes APIs will be kept (e.g. cardinality won't grow with
each CRD added). This reduces the metric to ~2k unique series
* Original tutorials favored including the platform (e.g.
google-cloud) in modules (e.g. google-cloud-yavin). Prefer
naming conventions where each module / cluster has a simple
name (e.g. yavin) since the platform is usually redundant
* Retain the example cluster naming themes per platform
* kubernetes-mixin pod resource dashboards were split into
two ConfigMap parts because they provide richer networking
details
* New dashboards have been used by the author at the global
level, but were missing in the per-cluster Grafana tracked
here
* Add Prometheus alerts from node-exporter
* Add Grafana dashboard nodes.json, from node-exporter
* Not adding recording rules, since those are only used
by some node-exporter USE dashboards not being included
* terraform-provider-google v2.19.0 deprecates `instance_template`
within `google_compute_region_instance_group_manager` in order to
support a scheme with multiple version blocks. Adapt our single
version to the new format to resolve deprecation warnings.
* Fixes: Warning: "instance_template": [DEPRECATED] This field
will be replaced by `version.instance_template` in 3.0.0
* Require terraform-provider-google v2.19.0+ (action required)
* Set small CPU requests on static pods kube-apiserver,
kube-controller-manager, and kube-scheduler to align with
upstream tooling and for edge cases
* Effectively, a practical case for these requests hasn't been
observed. However, a small static pod CPU request may offer
a slight benefit if a controller became overloaded and the
below mechanisms were insufficient
Existing safeguards:
* Control plane nodes are tainted to isolate them from
ordinary workloads. Even dense workloads can only compress
CPU resources on worker nodes.
* Control plane static pods use the highest priority class, so
contention favors control plane pods (over say node-exporter)
and CPU is compressible too.
See: https://github.com/poseidon/terraform-render-bootstrap/pull/161
* Update terraform-render-bootstrap module to adopt the
Terrform v0.12 templatefile function feature to replace
the use of terraform-provider-template's `template_dir`
* Require Terraform v0.12.6+ which adds `for_each`
Background:
* `template_dir` was added to `terraform-provider-template`
to add support for template directory rendering in CoreOS
Tectonic Kubernetes distribution (~2017)
* Terraform v0.12 introduced a native `templatefile` function
and v0.12.6 introduced native `for_each` support (July 2019)
that makes it possible to replace `template_dir` usage
* addon-resizer is outdated and has been dropped from
kube-state-metrics examples. Those using it should look
to the cluster-proportional-vertical-autoscaler.
* Eliminate addon-resizer log spew
* Remove associated Role and RoleBinding
* Also fix kube-state-metrics readinessProbe port
* Update Prometheus rules/alerts and Grafana dashboards
* Remove dashboards that were moved to node-exporter, they
may be added back later if valuable
* Remove kube-prometheus based rules/alerts (ClockSkew alert)
* Change `networking` default from flannel to calico on
Azure and DigitalOcean
* AWS, bare-metal, and Google Cloud continue to default
to Calico (as they have since v1.7.5)
* Typhoon now defaults to using Calico and supporting
NetworkPolicy on all platforms
* Define bare-metal `controllers` and `workers` as a complex type
list(object{name=string, mac=string, domain=string}) to allow
clusters with many machines to be defined more cleanly
* Remove `controller_names` list variable
* Remove `controller_macs` list variable
* Remove `controller_domains` list variable
* Remove `worker_names` list variable
* Remove `worker_macs` list variable
* Remove `worker_domains` list variable
* Detect the most recent Fedora CoreOS AMI to allow usage
of Fedora CoreOS in supported regions (previously just
us-east-1)
* Unpin the Fedora CoreOS AMI image which was pinned to
images that had been checked. This does mean if Fedora
publishes a broken image, it will be selected
* Filter out "dev" images which have similar naming
* Review variables available in bare-metal kubernetes modules
for Container Linux and Fedora CoreOS
* Deprecate cluster_domain_suffix variable
* Remove deprecated container_linux_oem variable
* Prometheus was configured to use kubernetes discovery
of etcd targets based on nodes matching the node label
node-role.kubernetes.io/controller=true
* Kubernetes v1.16 stopped permitting node role labels
node-role.kubernetes.io/* so Typhoon renamed these labels
(no longer any association with roles) to
node.kubermetes.io/controller=true
* As a result, Prometheus didn't discover etcd targets,
etcd metrics were missing, etcd alerts were ineffective,
and the etcd Grafana dashboard was empty
* Introduced: https://github.com/poseidon/typhoon/pull/543
* Review variables available in DigitalOcean kubernetes
module and sync with documentation
* Promote Calico for DigitalOcean and Azure beyond experimental
(its the primary mode I've used since it was introduced)
* Review variables available in Azure kubernetes and workers
modules and sync with documentation
* Fix internal workers module default type to Standard_DS1_v2
* Review variables available in AWS kubernetes and workers
modules and documentation
* Switching between spot and on-demand has worked since
Terraform v0.12
* Generally, there are too many knobs. Less useful ones
should be de-emphasized or removed
* Remove `cluster_domain_suffix` documentation
* Document worker pools `node_labels` variable to set the
initial node labels for a homogeneous set of workers
* Document `worker_node_labels` convenience variable to
set the initial node labels for default worker nodes
* Drop `node-role.kubernetes.io/master` and
`node-role.kubernetes.io/node` node labels
* Kubelet (v1.16) now rejects the node labels used
in the kubectl get nodes ROLES output
* https://github.com/kubernetes/kubernetes/issues/75457
* Change default controller_type to Standard_B2s. A B2s is cheaper
by $17/month and provides 2 vCPU, 4GB RAM (vs 1 vCPU, 3.5GB RAM)
* Change default worker_type to Standard_DS1_v2. F1 was the previous
generation. The DS1_v2 is newer, similar cost, more memory, and still
supports Low Priority mode, if desired
* Use unique dashboard ID so that multiple replicas of Grafana
serve dashboards with uniform paths
* Fix issue where refreshing a dashboard served by one replica
could show a 404 unless the request went to the same replica
* terraform-provider-aws v2.23.0 allows AWS root block devices
to enable encryption by default.
* Require updating terraform-provider-aws to v2.23.0 or higher
* Enable root EBS device encryption by default for controller
instances and worker instances in auto-scaling groups
For comparison:
* Google Cloud persistent disks have been encrypted by
default for years
* Azure managed disk encryption is not ready yet (#486)
* Add a CoreDNS dashboard originally based on an upstream dashboard,
but now customized according to preferences
* Add an Nginx Ingress Controller based on an upstream dashboard,
but customized according to preferences
* Add all Azure controllers to the apiserver load balancer
backend address pool
* Previously, kube-apiserver availability relied on the 0th
controller being up. Multi-controller was just providing etcd
data redundancy
* Allow updating terraform-provider-ct to any release
beyond v0.3.2, but below v1.0. This relaxes the prior
constraint that allowed only v0.3.y provider versions
* Run kube-apiserver as a non-root user (nobody). User
no longer needs to bind low number ports.
* On most platforms, the kube-apiserver load balancer listens
on 6443 and fronts controllers with kube-apiserver pods using
port 6443. Google Cloud TCP proxy load balancers cannot listen
on 6443. However, GCP's load balancer can be made to listen on
443, while kube-apiserver uses 6443 across all platforms.
* Fix a GCP errata item https://github.com/poseidon/typhoon/wiki/Errata
* Removal of a Google Cloud cluster often required 2 runs of
`terraform apply` because network resource deletes timeout
after 4m. Raise the network deletion timeout to 6m to
ensure apply only needs to be run once to remove a cluster
* google_compute_backend_services use nested blocks to define
backends (instance groups heterogeneous controllers)
* Use Terraform v0.12.x dynamic blocks so the apiserver backend
service can refer to (up to zone-many) controller instance groups
* Previously, with Terraform v0.11.x, the apiserver backend service
had to list a fixed set of backends to span controller nodes across
zones in multi-controller setups. 3 backends were used because each
GCP region offered at least 3 zones. Single-controller clusters had
the cosmetic ugliness of unused instance groups
* Allow controllers to span more than 3 zones if avilable in a
region (e.g. currently only us-central1, with 4 zones)
Related:
* https://www.terraform.io/docs/providers/google/r/compute_backend_service.html
* https://www.terraform.io/docs/configuration/expressions.html#dynamic-blocks
* Delay changing README example. Its prominent display
on github.com may lead to new users copying it, even
though it corresponds to an "in between releases" state
and v1.14.4 doesn't exist yet
* Leave docs tutorials the same, they can reflect master
* Provide Terraform v0.11 to v0.12 migration guide. Show an
in-place strategy and a move resources strategy
* Describe in-place modifying an existing cluster and providers,
using the Terraform helper to edit syntax, and checking the
plan produces a zero diff
* Describe replacing existing clusters by creating a new config
directory for use with Terraform v0.12 only and moving resources
one by one
* Provide some limited advise on migrating non-Typhoon resources
* Replace v0.11 bracket type hints with Terraform v0.12 list expressions
* Use expression syntax instead of interpolated strings, where suggested
* Update bare-metal tutorial
* Define `clc_snippets` type constraint map(list(string))
* Define Terraform and plugin version requirements in versions.tf
* Require matchbox ~> 0.3.0 to support Terraform v0.12
* Require ct ~> 0.3.2 to support Terraform v0.12
* Azure only allows `eviction_policy` to be set for Low priority VMs.
Supporting Low priority VMs meant when Regular VMs were used, each
`terraform apply` rolled workers, to set eviction_policy to null.
* Terraform v0.12 nullable variables fix the issue and plan does not
produce a diff
* Replace v0.11 bracket type hints with Terraform v0.12 list expressions
* Use expression syntax instead of interpolated strings, where suggested
* Update Azure tutorial and worker pools documentation
* Define Terraform and plugin version requirements in versions.tf
* Require azurerm ~> 1.27 to support Terraform v0.12
* Require ct ~> 0.3.2 to support Terraform v0.12
* Replace v0.11 bracket type hints with Terraform v0.12 list expressions
* Use expression syntax instead of interpolated strings, where suggested
* Update Google Cloud tutorial and worker pools documentation
* Define Terraform and plugin version requirements in versions.tf
* Require google ~> 2.5 to support Terraform v0.12
* Require ct ~> 0.3.2 to support Terraform v0.12
* Replace v0.11 bracket type hints with Terraform v0.12 list expressions
* Use expression syntax instead of interpolated strings, where suggested
* Update AWS tutorial and worker pools documentation
* Define Terraform and plugin version requirements in versions.tf
* Require aws ~> 2.7 to support Terraform v0.12
* Require ct ~> 0.3.2 to support Terraform v0.12
* Replace v0.11 bracket type hints with Terraform v0.12 list expressions
* Use expression syntax instead of interpolated strings, where suggested
* Update DigitalOcean tutorial documentation
* Define Terraform and plugin version requirements in versions.tf
* Require digitalocean ~> v1.3 to support Terraform v0.12
* Require ct ~> v0.3.2 to support Terraform v0.12
* This change affects users who use worker pools on AWS, GCP, or
Azure with a Container Linux derivative
* Rename worker pool modules' `count` variable to `worker_count`,
because `count` will be a reserved variable name in Terraform v0.12
* Fix to remove a trailing slash that was erroneously introduced
in the scripting that updated from v1.14.1 to v1.14.2
* Workaround before this fix was to re-run `terraform init`
* Introduce "calico" as a `networking` option on Azure and DigitalOcean
using Calico's new VXLAN support (similar to flannel). Flannel remains
the default on these platforms for now.
* Historically, DigitalOcean and Azure only allowed Flannel as the
CNI provider, since those platforms don't support IPIP traffic that
was previously required for Calico.
* Looking forward, its desireable for Calico to become the default
across Typhoon clusters, since it provides NetworkPolicy and a
consistent experience
* No changes to AWS, GCP, or bare-metal where Calico remains the
default CNI provider. On these platforms, IPIP mode will always
be used, since its available and more performant than vxlan
* DigitalOcean clusters must secure copy a kubeconfig to
worker nodes, but Terraform could decide to try copying
before firewall rules have been added to allow SSH access.
* Add an explicit dependency on adding firewall rules first
* Change flannel port from the kernel default 8472 to the
IANA assigned VXLAN port 4789
* Update firewall rules or security groups for VXLAN
* Why now? Calico now offers its own VXLAN backend so
standardizing on the IANA port will simplify config
* https://github.com/coreos/flannel/blob/master/Documentation/backends.md#vxlan
* This reverts commit 6e5d66cf66
* kube-state-metrics v1.6.0-rc.0 fires KubeDeploymentReplicasMismatch
alerts where its own Deployment doesn't have replicas available,
(kube_deployment_status_replicas_available) even though all replicas
are available according to kubectl inspection
* This problem was present even with the CSR ClusterRole fix
(https://github.com/kubernetes/kube-state-metrics/pull/717)