typhoon

mirror of https://github.com/puppetmaster/typhoon.git synced 2025-10-04 14:54:36 +02:00

Author	SHA1	Message	Date
Dalton Hubble	2e4bf4d7ae	Add Fedora Atomic announcement and improve docs	2018-04-26 08:18:39 -07:00
Dalton Hubble	b6a51d0b68	Add architecture docs on operating systems	2018-04-25 22:59:48 -07:00
Dalton Hubble	567e18f015	Fix conflict between Calico and NetworkManager * Observed frequent kube-scheduler and controller-manager restarts with Calico as the CNI provider. Root cause was unclear since control plane was functional and tests of pod to pod network connectivity passed * Root cause: Calico sets up cali* and tunl* network interfaces for containers on hosts. NetworkManager tries to manage these interfaces. It periodically disconnected veth pairs. Logs did not surface this issue since its not an error per-se, just Calico and NetworkManager dueling for control. Kubernetes correctly restarted pods failing health checks and ensured 2 replicas were running so the control plane functioned mostly normally. Pod to pod connecitivity was only affected occassionally. Pain to debug. * Solution: Configure NetworkManager to ignore the Calico ifaces per Calico's recommendation. Cloud-init writes files after NetworkManager starts, so a restart is required on first boot. On subsequent boots, the file is present so no restart is needed	2018-04-25 21:45:58 -07:00
Dalton Hubble	0a7fab56e2	Load ip_vs kernel module on boot as workaround * (containerized) kube-proxy warns that it is unable to load the ip_vs kernel module despite having the correct mounts. Atomic uses an xz compressed module and modprobe in the container was not compiled with compression support * Workaround issue for now by always loading ip_vs on-host * https://github.com/kubernetes/kubernetes/issues/60	2018-04-25 21:45:58 -07:00
Dalton Hubble	d784b0fca6	Switch to quay.io/poseidon tagged system containers	2018-04-25 18:15:18 -07:00
Dalton Hubble	cd913986df	Write documentation for Fedora Atomic	2018-04-24 01:10:27 -07:00
Dalton Hubble	af54efec28	Organize docs by operating system	2018-04-23 19:55:28 -07:00
Dalton Hubble	7198b9016c	Update Calico from v3.0.4 to v3.1.1 for Atomic	2018-04-21 18:46:56 -07:00
Dalton Hubble	f36c890234	Fix ostree repo to be called fedora-atomic on bare-metal * atomic host updates were fetching updates from the repo cache fedora-atomic-27, instead of from upstream	2018-04-21 18:46:56 -07:00
Dalton Hubble	233ec6dcb0	Update Fedora Atomic AMI to version 27.122 * http://www.projectatomic.io/blog/2018/04/fedora-atomic-20-apr-18/ * Atomic publishes nightly AMIs which sometimes don't boot or have issues. Until there is a source of reliable AMIs, pin the best known working AMI * Rel 66a66f0d18544591ffdbf8fae9df790113c93d72	2018-04-21 18:46:56 -07:00
Dalton Hubble	3f2978821b	Add atomic_assets_endpoint var for fedora-atomic bare-metal	2018-04-21 18:46:56 -07:00
Dalton Hubble	9b88d4bbfd	Use bootkube system container on fedora-atomic * Use the upstream bootkube image packaged with the required metadata to be usable as a system container under systemd * Run bootkube with runc so no host level components use Docker any more. Docker is still the runtime * Remove bootkube script and old systemd unit	2018-04-21 18:46:56 -07:00
Dalton Hubble	3dde4ba8ba	Mount host's /etc/os-release in kubelet system containers * Fix `kubectl describe node` to reflect the host's operating system	2018-04-21 18:46:56 -07:00
Dalton Hubble	e148552220	Enable kubelet allocatable enforcement and QoS cgroup hierarchy * Change kubelet system image to use --cgroups-per-qos=true (default) instead of false * Change kubelet system image to use --enforce-node-allocatable=pods instead of an empty string	2018-04-21 18:46:56 -07:00
Dalton Hubble	d8d1468f03	Update kubelet system container image to mount /etc/hosts * Fix kubelet port-forward on Google Cloud / Fedora Atomic * Mount the host's /etc/hosts in kubelet system containers * Problem: kubelet runc system containers on Atomic were not mounting the host's /etc/hosts, like rkt-fly does on Container Linux. `kubectl port-forward` calls socat with localhost. DNS servers on AWS, DO, and in many bare-metal environments resolve localhost to the caller as a convenience. Google Cloud notably does not nor is it required to do so and this surfaced the missing /etc/hosts in runc kubelet namespaces.	2018-04-21 18:46:56 -07:00
Dalton Hubble	2b74aba564	Add Google Cloud fedora-atomic module * Network load balancer for ingress doesn't work yet because Compute Engine packages are missing * port-forward / socat is broken	2018-04-21 18:46:56 -07:00
Dalton Hubble	24d230505a	Add cloud-metadata.service on AWS fedora-atomic	2018-04-21 18:46:56 -07:00
Dalton Hubble	cf22e70b46	Name ostree remote repo fedora-atomic across platforms	2018-04-21 18:46:56 -07:00
Dalton Hubble	b3cf9508b6	Update Fedora Atomic modules to Kubernetes v1.10.1	2018-04-21 18:46:56 -07:00
Dalton Hubble	5212684472	Temporarily pin Fedora Atomic AMI * Atomic has published AMI images that shutdown immediately after being powered on	2018-04-21 18:46:56 -07:00
Dalton Hubble	f990473cde	Update control plane manifests and add etcd metrics * Enable etcd v3.3 metrics to expose metrics for scraping by Prometheus * Use k8s.gcr.io instead of gcr.io/google_containers * Add flexvolume plugin mount to controller manager * Update kube-dns from v1.14.8 to v1.14.9	2018-04-21 18:46:56 -07:00
Dalton Hubble	8523a086e2	Fix kubelet system container to mount CNI plugins * Mount /opt/cni/bin in kubelet system container so CNI plugin binaries can be found. Before, flannel worked because the kubelet falls back to flannel plugin baked into the hyperkube (undesired) * Move the CNI bin install location later, since /opt changes may be lost between ostree rebases	2018-04-21 18:46:56 -07:00
Dalton Hubble	19bc5aea9e	Use kubelet system container on fedora-atomic * Use the upstream hyperkube image packaged with the required metadata to be usable as a system container under systemd * Fix port-forward since socat is included	2018-04-21 18:46:56 -07:00
Dalton Hubble	8d7cfc1a45	Use etcd system container on fedora-atomic * Use the upstream etcd image packaged with the required metadata to be usable as a system container (runc) under systemd	2018-04-21 18:46:56 -07:00
Dalton Hubble	9969c357da	Change AWS Fedora module to fedora-atomic	2018-04-21 18:46:56 -07:00
Dalton Hubble	4e43b2ff48	Change DO Fedora module to fedora-atomic	2018-04-21 18:46:56 -07:00
Dalton Hubble	ddc75e99ac	Add bare-metal Fedora Atomic module * Several known hacks and broken areas * Download v1.10 Kubelet from release tarball * Install flannel CNI binaries to /opt/cni * Switch SELinux to Permissive * Disable firewalld service * port-forward won't work, socat missing	2018-04-21 18:46:56 -07:00
Dalton Hubble	b80a2eb8a0	Sync fedora-cloud modules with Container Linux * Update manifests for Kubernetes v1.10.0 * Update etcd from v3.3.2 to v3.3.3 * Add disk_type optional variable on AWS * Remove redundant kubeconfig copy on AWS * Distribute etcd secres only to controllers * Organize module variables and ssh steps	2018-04-21 18:46:56 -07:00
Dalton Hubble	3610da8b71	Add fedora-cloud module for AWS	2018-04-21 18:46:56 -07:00
Dalton Hubble	485586e5d8	Add fedora-cloud module for Digital Ocean	2018-04-21 18:46:56 -07:00
Dalton Hubble	a54f76db2a	Update Calico from v3.0.4 to v3.1.1 * https://github.com/projectcalico/calico/releases/tag/v3.1.1 * https://github.com/projectcalico/calico/releases/tag/v3.1.0	2018-04-21 18:30:36 -07:00
Dalton Hubble	e0d9e9979c	Update nginx-ingress from 0.12.0 to 0.13.0 * https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.13.0	2018-04-18 21:12:09 -07:00
Dalton Hubble	ad2e4311d1	Switch GCP network lb to global TCP proxy lb * Allow multi-controller clusters on Google Cloud * GCP regional network load balancers have a long open bug in which requests originating from a backend instance are routed to the instance itself, regardless of whether the health check passes or not. As a result, only the 0th controller node registers. We've recommended just using single master GCP clusters for a while * https://issuetracker.google.com/issues/67366622 * Workaround issue by switching to a GCP TCP Proxy load balancer. TCP proxy lb routes traffic to a backend service (global) of instance group backends. In our case, spread controllers across 3 zones (all regions have 3+ zones) and organize them in 3 zonal unmanaged instance groups that serve as backends. Allows multi-controller cluster creation * GCP network load balancers only allowed legacy HTTP health checks so kubelet 10255 was checked as an approximation of controller health. Replace with TCP apiserver health checks to detect unhealth or unresponsive apiservers. * Drawbacks: GCP provision time increases, tailed logs now timeout (similar tradeoff in AWS), controllers only span 3 zones instead of the exact number in the region * Workaround in Typhoon has been known and posted for 5 months, but there still appears to be no better alternative. Its probably time to support multi-master and accept the downsides	2018-04-18 00:09:06 -07:00
@luke	490b628e2d	Use relative image links to appear in Github markdown	2018-04-17 23:40:58 -07:00
Dalton Hubble	23a8156bdf	Fix a few typos in comments	2018-04-15 17:21:49 -07:00
Dalton Hubble	9789881243	Update kube-state-metrics from v1.3.0 to v1.3.1 * https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.3.1	2018-04-15 17:10:02 -07:00
Dalton Hubble	77c0a4cf2e	Update Kubernetes from v1.10.0 to v1.10.1 * Use kubernetes-incubator/bootkube v0.12.0 v1.10.1	2018-04-12 20:57:31 -07:00
Dalton Hubble	5035d56db2	Refactor GCP to remove controller internal module * Remove the controller internal module to align with other platforms and since its not a supported use case	2018-04-12 19:41:51 -07:00
Dalton Hubble	9bb3de5327	Skip creating unused dirs on worker nodes	2018-04-11 22:23:51 -07:00
Dalton Hubble	c8eabc2af4	Fix GCP controller_type and worker_type vars	2018-04-11 22:19:58 -07:00
Matt Dorn	2eaf858c5c	Update example BGPPeer manifest Previous example may have been outdated. It resulted in `error: unable to recognize "example.yaml": no matches for /, Kind=bgpPeer` . See https://docs.projectcalico.org/v3.0/reference/calicoctl/resources/bgppeer.	2018-04-09 23:23:18 -05:00
Dalton Hubble	b8656fd74b	Clarify bare-metal SSH instructions	2018-04-08 14:11:05 -07:00
Dalton Hubble	d276fffcda	Fix bare-metal multiple apply/ssh on Terraform v0.11.4+ * Terraform v0.11.4 introduced changes to remote-exec that mean Typhoon bare-metal clusters require multiple runs of terraform apply to ssh and bootstrap. * Bare-metal installs PXE boot a live instance to install to disk and then reboot from disk as controllers/workers. Terraform remote-exec has no way to "know" to wait until the reboot has occurred to kickoff Kubernetes bootstrap. Previously Typhoon created a "debug" user during this install phase to allow an admin to SSH, but remote-exec would hang, trying to connect as user "core". Terraform v0.11.4 changes this behavior so remote-exec fails and a user must re-run terraform apply until succeeding. * A new way to "trick" remote-exec into waiting for the reboot into the disk install is to run SSH on a non-standard port during the disk install. This retains the ability for an admin to SSH during install (most distros don't have this) and fixes the issue so only a single run of terraform apply is needed. * https://github.com/hashicorp/terraform/pull/17359#issuecomment-376415464	2018-04-08 13:32:31 -07:00
Dalton Hubble	6b08bde479	Use k8s.gcr.io instead of gcr.io/google_containers * Kubernetes recommends using the alias to fetch images from the nearest GCR regional mirror, to abstract the use of GCR, and to drop names containing 'google' * https://groups.google.com/forum/#!msg/kubernetes-dev/ytjk_rNrTa0/3EFUHvovCAAJ	2018-04-08 12:57:52 -07:00
Dalton Hubble	f4b2396718	Return Prometheus deployment to be a worker workload * Expose etcd metrics to workers so Prometheus can run on a worker, rather than a controller * Drop temporary firewall rules allowing Prometheus to run on a controller and scrape targes * Related to https://github.com/poseidon/typhoon/pull/175	2018-04-08 12:20:00 -07:00
Dalton Hubble	b76126db93	Update docs builder and material theme	2018-04-08 00:00:03 -07:00
Dalton Hubble	7186aa46da	Update kube-state-metrics from v1.2.0 to v1.3.0 * https://github.com/kubernetes/kube-state-metrics/pull/412 * https://github.com/kubernetes/kube-state-metrics/pull/413	2018-04-04 21:04:13 -07:00
Dalton Hubble	18dbaf74ce	Update kube-dns from v1.14.8 to v1.14.9 * https://github.com/kubernetes/kubernetes/pull/61908	2018-04-04 21:00:23 -07:00
Dalton Hubble	ce001e9d56	Update etcd from v3.3.2 to v3.3.3 * https://github.com/coreos/etcd/releases/tag/v3.3.3	2018-04-04 20:32:24 -07:00
Dalton Hubble	d770393dbc	Add etcd metrics, Prometheus scrapes, and Grafana dash * Use etcd v3.3 --listen-metrics-urls to expose only metrics data via http://0.0.0.0:2381 on controllers * Add Prometheus discovery for etcd peers on controller nodes * Temporarily drop two noisy Prometheus alerts	2018-04-03 20:31:00 -07:00

... 2 3 4 5 6 ...

496 Commits