typhoon

mirror of https://github.com/puppetmaster/typhoon.git synced 2025-10-04 00:54:38 +02:00

Author	SHA1	Message	Date
Dalton Hubble	28d0891729	Annotate nginx-ingress addon for Prometheus auto-discovery * Add Google Cloud firewall rule to allow worker to worker access to health and metrics	2018-05-19 13:13:14 -07:00
William Zhang	2ae126bf68	Fix README link to tutorial	2018-05-19 13:10:22 -07:00
Dalton Hubble	c2b719dc75	Configure Prometheus to scrape Kubelets directly * Use Kubelet bearer token authn/authz to scrape metrics * Drop RBAC permission from nodes/proxy to nodes/metrics * Stop proxying kubelet scrapes through the apiserver, since this required higher privilege (nodes/proxy) and can add load to the apiserver on large clusters	2018-05-14 23:06:50 -07:00
Dalton Hubble	37981f9fb1	Allow bearer token authn/authz to the Kubelet * Require Webhook authorization to the Kubelet * Switch apiserver X509 client cert org to systems:masters to grant the apiserver admin and satisfy the authorization requirement. kubectl commands like logs or exec that have the apiserver make requests of a kubelet continue to work as before * https://kubernetes.io/docs/admin/kubelet-authentication-authorization/ * https://github.com/poseidon/typhoon/issues/215	2018-05-13 23:20:42 -07:00
Dalton Hubble	f2ee75ac98	Require Terraform v0.11.x, drop v0.10.x support * Raise minimum Terraform version to v0.11.0 * Terraform v0.11.x has been supported since Typhoon v1.9.2 and Terraform v0.10.x was last released in Nov 2017. I'd like to stop worrying about v0.10.x and remove migration docs as a later followup * Migration docs docs/topics/maintenance.md#terraform-v011x	2018-05-10 02:20:46 -07:00
Dalton Hubble	8b8e364915	Update etcd from v3.3.4 to v3.3.5 * https://github.com/coreos/etcd/releases/tag/v3.3.5	2018-05-10 02:12:53 -07:00
Dalton Hubble	9d4cbb38f6	Rerun terraform fmt	2018-05-01 21:41:22 -07:00
Dalton Hubble	e889430926	Update kube-dns from v1.14.9 to v1.14.10 * https://github.com/kubernetes/kubernetes/pull/62676	2018-04-28 00:43:09 -07:00
Dalton Hubble	32ddfa94e1	Update Kubernetes from v1.10.1 to v1.10.2 * https://github.com/kubernetes/kubernetes/releases/tag/v1.10.2	2018-04-28 00:27:00 -07:00
Dalton Hubble	681450aa0d	Update etcd from v3.3.3 to v3.3.4 * https://github.com/coreos/etcd/releases/tag/v3.3.4	2018-04-27 23:57:26 -07:00
Dalton Hubble	567e18f015	Fix conflict between Calico and NetworkManager * Observed frequent kube-scheduler and controller-manager restarts with Calico as the CNI provider. Root cause was unclear since control plane was functional and tests of pod to pod network connectivity passed * Root cause: Calico sets up cali* and tunl* network interfaces for containers on hosts. NetworkManager tries to manage these interfaces. It periodically disconnected veth pairs. Logs did not surface this issue since its not an error per-se, just Calico and NetworkManager dueling for control. Kubernetes correctly restarted pods failing health checks and ensured 2 replicas were running so the control plane functioned mostly normally. Pod to pod connecitivity was only affected occassionally. Pain to debug. * Solution: Configure NetworkManager to ignore the Calico ifaces per Calico's recommendation. Cloud-init writes files after NetworkManager starts, so a restart is required on first boot. On subsequent boots, the file is present so no restart is needed	2018-04-25 21:45:58 -07:00
Dalton Hubble	0a7fab56e2	Load ip_vs kernel module on boot as workaround * (containerized) kube-proxy warns that it is unable to load the ip_vs kernel module despite having the correct mounts. Atomic uses an xz compressed module and modprobe in the container was not compiled with compression support * Workaround issue for now by always loading ip_vs on-host * https://github.com/kubernetes/kubernetes/issues/60	2018-04-25 21:45:58 -07:00
Dalton Hubble	d784b0fca6	Switch to quay.io/poseidon tagged system containers	2018-04-25 18:15:18 -07:00
Dalton Hubble	7198b9016c	Update Calico from v3.0.4 to v3.1.1 for Atomic	2018-04-21 18:46:56 -07:00
Dalton Hubble	9b88d4bbfd	Use bootkube system container on fedora-atomic * Use the upstream bootkube image packaged with the required metadata to be usable as a system container under systemd * Run bootkube with runc so no host level components use Docker any more. Docker is still the runtime * Remove bootkube script and old systemd unit	2018-04-21 18:46:56 -07:00
Dalton Hubble	3dde4ba8ba	Mount host's /etc/os-release in kubelet system containers * Fix `kubectl describe node` to reflect the host's operating system	2018-04-21 18:46:56 -07:00
Dalton Hubble	e148552220	Enable kubelet allocatable enforcement and QoS cgroup hierarchy * Change kubelet system image to use --cgroups-per-qos=true (default) instead of false * Change kubelet system image to use --enforce-node-allocatable=pods instead of an empty string	2018-04-21 18:46:56 -07:00
Dalton Hubble	d8d1468f03	Update kubelet system container image to mount /etc/hosts * Fix kubelet port-forward on Google Cloud / Fedora Atomic * Mount the host's /etc/hosts in kubelet system containers * Problem: kubelet runc system containers on Atomic were not mounting the host's /etc/hosts, like rkt-fly does on Container Linux. `kubectl port-forward` calls socat with localhost. DNS servers on AWS, DO, and in many bare-metal environments resolve localhost to the caller as a convenience. Google Cloud notably does not nor is it required to do so and this surfaced the missing /etc/hosts in runc kubelet namespaces.	2018-04-21 18:46:56 -07:00
Dalton Hubble	2b74aba564	Add Google Cloud fedora-atomic module * Network load balancer for ingress doesn't work yet because Compute Engine packages are missing * port-forward / socat is broken	2018-04-21 18:46:56 -07:00
Dalton Hubble	a54f76db2a	Update Calico from v3.0.4 to v3.1.1 * https://github.com/projectcalico/calico/releases/tag/v3.1.1 * https://github.com/projectcalico/calico/releases/tag/v3.1.0	2018-04-21 18:30:36 -07:00
Dalton Hubble	ad2e4311d1	Switch GCP network lb to global TCP proxy lb * Allow multi-controller clusters on Google Cloud * GCP regional network load balancers have a long open bug in which requests originating from a backend instance are routed to the instance itself, regardless of whether the health check passes or not. As a result, only the 0th controller node registers. We've recommended just using single master GCP clusters for a while * https://issuetracker.google.com/issues/67366622 * Workaround issue by switching to a GCP TCP Proxy load balancer. TCP proxy lb routes traffic to a backend service (global) of instance group backends. In our case, spread controllers across 3 zones (all regions have 3+ zones) and organize them in 3 zonal unmanaged instance groups that serve as backends. Allows multi-controller cluster creation * GCP network load balancers only allowed legacy HTTP health checks so kubelet 10255 was checked as an approximation of controller health. Replace with TCP apiserver health checks to detect unhealth or unresponsive apiservers. * Drawbacks: GCP provision time increases, tailed logs now timeout (similar tradeoff in AWS), controllers only span 3 zones instead of the exact number in the region * Workaround in Typhoon has been known and posted for 5 months, but there still appears to be no better alternative. Its probably time to support multi-master and accept the downsides	2018-04-18 00:09:06 -07:00
Dalton Hubble	77c0a4cf2e	Update Kubernetes from v1.10.0 to v1.10.1 * Use kubernetes-incubator/bootkube v0.12.0	2018-04-12 20:57:31 -07:00
Dalton Hubble	5035d56db2	Refactor GCP to remove controller internal module * Remove the controller internal module to align with other platforms and since its not a supported use case	2018-04-12 19:41:51 -07:00
Dalton Hubble	9bb3de5327	Skip creating unused dirs on worker nodes	2018-04-11 22:23:51 -07:00
Dalton Hubble	c8eabc2af4	Fix GCP controller_type and worker_type vars	2018-04-11 22:19:58 -07:00
Dalton Hubble	6b08bde479	Use k8s.gcr.io instead of gcr.io/google_containers * Kubernetes recommends using the alias to fetch images from the nearest GCR regional mirror, to abstract the use of GCR, and to drop names containing 'google' * https://groups.google.com/forum/#!msg/kubernetes-dev/ytjk_rNrTa0/3EFUHvovCAAJ	2018-04-08 12:57:52 -07:00
Dalton Hubble	f4b2396718	Return Prometheus deployment to be a worker workload * Expose etcd metrics to workers so Prometheus can run on a worker, rather than a controller * Drop temporary firewall rules allowing Prometheus to run on a controller and scrape targes * Related to https://github.com/poseidon/typhoon/pull/175	2018-04-08 12:20:00 -07:00
Dalton Hubble	18dbaf74ce	Update kube-dns from v1.14.8 to v1.14.9 * https://github.com/kubernetes/kubernetes/pull/61908	2018-04-04 21:00:23 -07:00
Dalton Hubble	ce001e9d56	Update etcd from v3.3.2 to v3.3.3 * https://github.com/coreos/etcd/releases/tag/v3.3.3	2018-04-04 20:32:24 -07:00
Dalton Hubble	d770393dbc	Add etcd metrics, Prometheus scrapes, and Grafana dash * Use etcd v3.3 --listen-metrics-urls to expose only metrics data via http://0.0.0.0:2381 on controllers * Add Prometheus discovery for etcd peers on controller nodes * Temporarily drop two noisy Prometheus alerts	2018-04-03 20:31:00 -07:00
Dalton Hubble	1cc043d1eb	Update Kubernetes from v1.9.6 to v1.10.0	2018-03-30 22:14:07 -07:00
Dalton Hubble	de4d90750e	Use consistent naming of remote provision steps	2018-03-26 00:29:57 -07:00
Dalton Hubble	7acd4931f6	Remove redundant kubeconfig copy on AWS and GCP * AWS and Google Cloud make use of auto-scaling groups and managed instance groups, respectively. As such, the kubeconfig is already held in cloud user-data * Controller instances are provisioned with a kubeconfig from user-data. Its redundant to use a Terraform remote file copy step for the kubeconfig.	2018-03-26 00:01:47 -07:00
Dalton Hubble	fdb543e834	Add optional controller_type and worker_type vars on GCP * Remove optional machine_type variable on Google Cloud * Use controller_type and worker_type instead	2018-03-25 22:11:18 -07:00
Dalton Hubble	8d3d4220fd	Add disk_size variable on Google Cloud	2018-03-25 22:04:14 -07:00
Dalton Hubble	e43cf9f608	Organize and cleanup variable descriptions	2018-03-25 21:44:43 -07:00
Dalton Hubble	a04ef3919a	Update Kubernetes from v1.9.5 to v1.9.6	2018-03-21 20:29:52 -07:00
Dalton Hubble	758c09fa5c	Update Kubernetes from v1.9.4 to v1.9.5	2018-03-19 00:25:44 -07:00
Dalton Hubble	7f7bc960a6	Set default Google Cloud os_image to coreos-stable	2018-03-19 00:08:26 -07:00
Dalton Hubble	f3730b2bfa	Add Container Linux Config snippets feature * Introduce the ability to support Container Linux Config "snippets" for controllers and workers on cloud platforms. This allows end-users to customize hosts by providing Container Linux configs that are additively merged into the base configs defined by Typhoon. Config snippets are validated, merged, and show any errors during `terraform plan` * Example uses include adding systemd units, network configs, mounts, files, raid arrays, or other disk provisioning features provided by Container Linux Configs (using Ignition low-level) * Requires terraform-provider-ct v0.2.1 plugin	2018-03-18 18:28:18 -07:00
Dalton Hubble	88aa9a46e5	Add /var/lib/calico volume mount to Calico DaemonSet	2018-03-18 16:40:38 -07:00
Dalton Hubble	efa90d8b44	Add a new key=value label to controller nodes * Add a node-role.kubernetes.io/controller="true" node label to controllers so Prometheus service discovery can filter to services that only run on controllers (i.e. masters) * Leave node-role.kubernetes.io/master="" untouched as its a Kubernetes convention	2018-03-18 16:39:10 -07:00
Dalton Hubble	21f2cef12f	Improve changelog, README, and index page	2018-03-12 20:58:02 -07:00
Dalton Hubble	931e311786	Update Kubernetes from v1.9.3 to v1.9.4	2018-03-12 18:07:50 -07:00
Dalton Hubble	2592a0aad4	Allow Google accelerators (i.e. GPUs) on workers	2018-03-11 17:21:24 -07:00
Dalton Hubble	9fb1e1a0e2	Update etcd from v3.3.1 to v3.3.2 * https://github.com/coreos/etcd/releases/tag/v3.3.2	2018-03-10 13:44:35 -08:00
Dalton Hubble	c112ee3829	Rename cluster_name to name in internal module * Ensure consistency between AWS and GCP platforms	2018-03-03 17:52:01 -08:00
Dalton Hubble	45b556c08f	Fix overly strict firewall for GCP "worker pools" * Fix issue where worker firewall rules didn't apply to additional workers attached to a GCP cluster using the new "worker pools" feature (unreleased, #148). Solves host connection timeouts and pods not being scheduled to attached worker pools. * Add `name` field to GCP internal worker module to represent the unique name of of the worker pool * Use `cluster_name` field of GCP internal worker module for passing the name of the cluster to which workers should be attached	2018-03-03 17:40:17 -08:00
Dalton Hubble	da6aafe816	Revert "Add module version requirements to internal workers modules" * This reverts commit `cce4537487`. * Provider passing to child modules is complex and the behavior changed between Terraform v0.10 and v0.11. We're continuing to allow both versions so this change should be reverted. For the time being, those using our internal Terraform modules will have to be aware of the minimum version for AWS and GCP providers, there is no good way to do enforcement.	2018-03-03 16:56:34 -08:00
Dalton Hubble	cce4537487	Add module version requirements to internal workers modules	2018-03-03 14:39:25 -08:00

1 2 3

127 Commits