Adjust Google Cloud worker health checks to use kube-proxy healthz
* Change the workers managed instance group to health check nodes via HTTP probe of the kube-proxy port 10256 /healthz endpoints * Advantages: kube-proxy is a lower value target (in case there were bugs in firewalls) that Kubelet, its more representative than health checking Kubelet (Kubelet must run AND kube-proxy Daemonset must be healthy), and its already used by kube-proxy liveness probes (better discoverability via kubectl or alerts on pods crashlooping) * Another motivator is that GKE clusters also use kube-proxy port 10256 checks to assess node health
This commit is contained in:
parent
760b4cd5ee
commit
e87d5aabc3
16
CHANGES.md
16
CHANGES.md
|
@ -21,7 +21,7 @@ Notable changes between versions.
|
||||||
|
|
||||||
### Flatcar Linux
|
### Flatcar Linux
|
||||||
|
|
||||||
* Migrate Flatcar Linux from Ignition spec v2.3.0 to v3.3.0 (**action required**)
|
* Migrate Flatcar Linux from Ignition spec v2.3.0 to v3.3.0 ([#1196](https://github.com/poseidon/typhoon/pull/1196)) (**action required**)
|
||||||
* Flatcar Linux 3185.0.0+ [supports](https://flatcar-linux.org/docs/latest/provisioning/ignition/specification/#ignition-v3) Ignition v3.x specs (which are rendered from Butane Configs, like Fedora CoreOS)
|
* Flatcar Linux 3185.0.0+ [supports](https://flatcar-linux.org/docs/latest/provisioning/ignition/specification/#ignition-v3) Ignition v3.x specs (which are rendered from Butane Configs, like Fedora CoreOS)
|
||||||
* `poseidon/ct` v0.11.0 [supports](https://github.com/poseidon/terraform-provider-ct/pull/131) the `flatcar` Butane Config variant
|
* `poseidon/ct` v0.11.0 [supports](https://github.com/poseidon/terraform-provider-ct/pull/131) the `flatcar` Butane Config variant
|
||||||
* Require poseidon v0.11+ and Flatcar Linux 3185.0.0+
|
* Require poseidon v0.11+ and Flatcar Linux 3185.0.0+
|
||||||
|
@ -35,8 +35,8 @@ version: 1.0.0
|
||||||
|
|
||||||
### AWS
|
### AWS
|
||||||
|
|
||||||
* [Refresh](https://docs.aws.amazon.com/autoscaling/ec2/userguide/asg-instance-refresh.html) instances in autoscaling group when launch configuration changes
|
* [Refresh](https://docs.aws.amazon.com/autoscaling/ec2/userguide/asg-instance-refresh.html) instances in autoscaling group when launch configuration changes ([#1208](https://github.com/poseidon/typhoon/pull/1208)) (**important**)
|
||||||
* Changes to worker launch configurations start an autoscaling group instance refresh to replace instances
|
* Worker launch configuration changes start an autoscaling group instance refresh to replace instances
|
||||||
* Instance refresh creates surge instances, waits for a warm-up period, then deletes old instances
|
* Instance refresh creates surge instances, waits for a warm-up period, then deletes old instances
|
||||||
* Changing `worker_type`, `disk_*`, `worker_price`, `worker_target_groups`, or Butane `worker_snippets` on existing worker nodes will replace instances
|
* Changing `worker_type`, `disk_*`, `worker_price`, `worker_target_groups`, or Butane `worker_snippets` on existing worker nodes will replace instances
|
||||||
* New AMIs or changing `os_stream` will be ignored, to allow Fedora CoreOS or Flatcar Linux to keep themselves updated
|
* New AMIs or changing `os_stream` will be ignored, to allow Fedora CoreOS or Flatcar Linux to keep themselves updated
|
||||||
|
@ -47,16 +47,16 @@ version: 1.0.0
|
||||||
### Google
|
### Google
|
||||||
|
|
||||||
* [Roll](https://cloud.google.com/compute/docs/instance-groups/rolling-out-updates-to-managed-instance-groups) instance template changes to worker managed instance groups ([#1207](https://github.com/poseidon/typhoon/pull/1207)) (**important**)
|
* [Roll](https://cloud.google.com/compute/docs/instance-groups/rolling-out-updates-to-managed-instance-groups) instance template changes to worker managed instance groups ([#1207](https://github.com/poseidon/typhoon/pull/1207)) (**important**)
|
||||||
* Changes to worker instance templates roll out by gradually replacing instances
|
* Worker instance template changes roll out by gradually replacing instances
|
||||||
* Automatic rollouts create surge instances, wait for Kubelet health checks, then delete old instances (0 unavailable instances)
|
* Automatic rollouts create surge instances, wait for Kubelet health checks, then delete old instances (0 unavailable instances)
|
||||||
* Changing `worker_type`, `disk_size`, `preemptible`, or Butane `worker_snippets` on existing worker nodes will replace instances
|
* Changing `worker_type`, `disk_size`, `worker_preemptible`, or Butane `worker_snippets` on existing worker nodes will replace instances
|
||||||
* New AMIs or changing `os_stream` will be ignored, to allow Fedora CoreOS or Flatcar Linux to keep themselves updated
|
* New AMIs or changing `os_stream` will be ignored, to allow Fedora CoreOS or Flatcar Linux to keep themselves updated
|
||||||
* Previously, new instance templates were made in the same way, but not applied to instances unless manually replaced
|
* Previously, new instance templates were made in the same way, but not applied to instances unless manually replaced
|
||||||
* Add health checks to worker managed instance groups (i.e. "autohealing") ([#1207](https://github.com/poseidon/typhoon/pull/1207))
|
* Add health checks to worker managed instance groups (i.e. "autohealing") ([#1207](https://github.com/poseidon/typhoon/pull/1207))
|
||||||
* Use SSL health checks to probe the Kubelet every 30s
|
* Use health checks to probe kube-proxy every 30s
|
||||||
* Replace worker nodes that fail the health check 6 times (3min)
|
* Replace worker nodes that fail the health check 6 times (3min)
|
||||||
* Name `kube-apiserver` and `kubelet` health checks consistently ([#1207](https://github.com/poseidon/typhoon/pull/1207))
|
* Name `kube-apiserver` and `worker` health checks consistently ([#1207](https://github.com/poseidon/typhoon/pull/1207))
|
||||||
* Use name `${cluster_name}-apiserver-health` and `${cluster_name}-kubelet-health`
|
* Use name `${cluster_name}-apiserver-health` and `${cluster_name}-worker-health`
|
||||||
* Rename managed instance group from `${cluster_name}-worker-group` to `${cluster_name}-worker` ([#1207](https://github.com/poseidon/typhoon/pull/1207))
|
* Rename managed instance group from `${cluster_name}-worker-group` to `${cluster_name}-worker` ([#1207](https://github.com/poseidon/typhoon/pull/1207))
|
||||||
* Fix bug provisioning clusters with multiple controller nodes ([#1195](https://github.com/poseidon/typhoon/pull/1195))
|
* Fix bug provisioning clusters with multiple controller nodes ([#1195](https://github.com/poseidon/typhoon/pull/1195))
|
||||||
|
|
||||||
|
|
|
@ -196,13 +196,13 @@ resource "google_compute_firewall" "allow-ingress" {
|
||||||
target_tags = ["${var.cluster_name}-worker"]
|
target_tags = ["${var.cluster_name}-worker"]
|
||||||
}
|
}
|
||||||
|
|
||||||
resource "google_compute_firewall" "google-kubelet-health-checks" {
|
resource "google_compute_firewall" "google-worker-health-checks" {
|
||||||
name = "${var.cluster_name}-kubelet-health"
|
name = "${var.cluster_name}-worker-health"
|
||||||
network = google_compute_network.network.name
|
network = google_compute_network.network.name
|
||||||
|
|
||||||
allow {
|
allow {
|
||||||
protocol = "tcp"
|
protocol = "tcp"
|
||||||
ports = [10250]
|
ports = [10256]
|
||||||
}
|
}
|
||||||
|
|
||||||
# https://cloud.google.com/compute/docs/instance-groups/autohealing-instances-in-migs
|
# https://cloud.google.com/compute/docs/instance-groups/autohealing-instances-in-migs
|
||||||
|
|
|
@ -36,14 +36,14 @@ resource "google_compute_region_instance_group_manager" "workers" {
|
||||||
|
|
||||||
auto_healing_policies {
|
auto_healing_policies {
|
||||||
health_check = google_compute_health_check.worker.id
|
health_check = google_compute_health_check.worker.id
|
||||||
initial_delay_sec = 120
|
initial_delay_sec = 300
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
# Health check for worker Kubelet
|
# Health check for worker node
|
||||||
resource "google_compute_health_check" "worker" {
|
resource "google_compute_health_check" "worker" {
|
||||||
name = "${var.name}-kubelet-health"
|
name = "${var.name}-worker-health"
|
||||||
description = "Health check for worker Kubelet"
|
description = "Health check for worker node"
|
||||||
|
|
||||||
timeout_sec = 20
|
timeout_sec = 20
|
||||||
check_interval_sec = 30
|
check_interval_sec = 30
|
||||||
|
@ -51,8 +51,9 @@ resource "google_compute_health_check" "worker" {
|
||||||
healthy_threshold = 1
|
healthy_threshold = 1
|
||||||
unhealthy_threshold = 6
|
unhealthy_threshold = 6
|
||||||
|
|
||||||
ssl_health_check {
|
http_health_check {
|
||||||
port = "10250"
|
port = "10256"
|
||||||
|
request_path = "/healthz"
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
|
@ -196,13 +196,13 @@ resource "google_compute_firewall" "allow-ingress" {
|
||||||
target_tags = ["${var.cluster_name}-worker"]
|
target_tags = ["${var.cluster_name}-worker"]
|
||||||
}
|
}
|
||||||
|
|
||||||
resource "google_compute_firewall" "google-kubelet-health-checks" {
|
resource "google_compute_firewall" "google-worker-health-checks" {
|
||||||
name = "${var.cluster_name}-kubelet-health"
|
name = "${var.cluster_name}-worker-health"
|
||||||
network = google_compute_network.network.name
|
network = google_compute_network.network.name
|
||||||
|
|
||||||
allow {
|
allow {
|
||||||
protocol = "tcp"
|
protocol = "tcp"
|
||||||
ports = [10250]
|
ports = [10256]
|
||||||
}
|
}
|
||||||
|
|
||||||
# https://cloud.google.com/compute/docs/instance-groups/autohealing-instances-in-migs
|
# https://cloud.google.com/compute/docs/instance-groups/autohealing-instances-in-migs
|
||||||
|
|
|
@ -36,14 +36,14 @@ resource "google_compute_region_instance_group_manager" "workers" {
|
||||||
|
|
||||||
auto_healing_policies {
|
auto_healing_policies {
|
||||||
health_check = google_compute_health_check.worker.id
|
health_check = google_compute_health_check.worker.id
|
||||||
initial_delay_sec = 120
|
initial_delay_sec = 300
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
# Health check for worker Kubelet
|
# Health check for worker Kubelet
|
||||||
resource "google_compute_health_check" "worker" {
|
resource "google_compute_health_check" "worker" {
|
||||||
name = "${var.name}-kubelet-health"
|
name = "${var.name}-worker-health"
|
||||||
description = "Health check for worker Kubelet"
|
description = "Health check for worker node"
|
||||||
|
|
||||||
timeout_sec = 20
|
timeout_sec = 20
|
||||||
check_interval_sec = 30
|
check_interval_sec = 30
|
||||||
|
@ -51,8 +51,9 @@ resource "google_compute_health_check" "worker" {
|
||||||
healthy_threshold = 1
|
healthy_threshold = 1
|
||||||
unhealthy_threshold = 6
|
unhealthy_threshold = 6
|
||||||
|
|
||||||
ssl_health_check {
|
http_health_check {
|
||||||
port = "10250"
|
port = "10256"
|
||||||
|
request_path = "/healthz"
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue