Commit Graph

131 Commits

Author SHA1 Message Date
Dalton Hubble cbe646fba6 Label namespaces to ease writing Network Policies 2018-06-09 11:45:11 -07:00
Dalton Hubble c166b2ba33 Update prometheus from v2.2.1 to v2.3.0 2018-06-09 11:43:10 -07:00
Dalton Hubble 79260c48f6 Update Kubernetes from v1.10.3 to v1.10.4 2018-06-06 23:23:11 -07:00
Dalton Hubble 589c3569b7 Update etcd from v3.3.5 to v3.3.6
* https://github.com/coreos/etcd/releases/tag/v3.3.6
2018-06-06 23:19:30 -07:00
Dalton Hubble d32e6797ae Annotate Grafana so Prometheus scrapes metrics 2018-05-30 22:37:47 -07:00
Dalton Hubble 32a9a83190 Add Prometheus liveness and readiness probes 2018-05-30 22:34:07 -07:00
Dalton Hubble 6e968cd152 Update Calico from v3.1.2 to v3.1.3
* https://github.com/projectcalico/calico/releases/tag/v3.1.3
* https://github.com/projectcalico/cni-plugin/releases/tag/v3.1.3
2018-05-30 21:32:12 -07:00
Dalton Hubble 4ea1fde9c5 Update Kubernetes from v1.10.2 to v1.10.3
* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.10.md#v1103
* Update Calico from v3.1.1 to v3.1.2
2018-05-21 21:38:43 -07:00
Dalton Hubble 1e2eec6487 Update Fedora Atomic from 27 to 28 on DigitalOcean
* Fedora Atomic 27 images disappeared from DigitalOcean and
forced this early update (there are known bugs)
2018-05-21 21:30:23 -07:00
Dalton Hubble 28d0891729 Annotate nginx-ingress addon for Prometheus auto-discovery
* Add Google Cloud firewall rule to allow worker to worker access
to health and metrics
2018-05-19 13:13:14 -07:00
Dalton Hubble 714419342e Update nginx-ingress from 0.14.0 to 0.15.0
* https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.15.0
2018-05-17 21:42:55 -07:00
Dalton Hubble 3701c0b1fe Update Grafana from v5.1.2 to v5.1.3
* https://github.com/grafana/grafana/releases/tag/v5.1.3
2018-05-17 21:36:09 -07:00
Dalton Hubble 0c3557e68e Allow Flatcar Linux os_channel on bare-metal
* Choose the Container Linux derivative Flatcar Linux on
bare-metal by setting os_channel to flatcar-stable, flatcar-beta
or flatcar-alpha
* As with Container Linux from Red Hat, the version (os_version)
must correspond to the channel being used
* Thank you to @dongsupark from Kinvolk
2018-05-17 20:09:36 -07:00
Dalton Hubble adc6c6866d Rename container_linux_ bare-metal variables
* Allow for Container Linux derivatives
* Replace container_linux_channel variable with `os_channel`
* Replace `container_linux_version` variable with `os_version`
* Please change values `stable`, `beta`, or `alpha` to `coreos-stable`,
`coreos-beta`, `coreos-alpha` (action required!)
2018-05-16 22:40:39 -07:00
Dalton Hubble 9ac7b0655f Add bare-metal network_ip_autodetection_method variable for multi-NIC
* Allow setting the Calico host IPv4 address autodetection method
* Use Calico's default "first-found" method to support single NIC
and bonded NIC nodes
* Allow methods like `can-reach=IP` or `interface=REGEX` for multi
NIC nodes
* https://docs.projectcalico.org/v3.1/reference/node/configuration#ip-autodetection-methods
2018-05-15 23:27:34 -07:00
Dalton Hubble c2b719dc75 Configure Prometheus to scrape Kubelets directly
* Use Kubelet bearer token authn/authz to scrape metrics
* Drop RBAC permission from nodes/proxy to nodes/metrics
* Stop proxying kubelet scrapes through the apiserver, since
this required higher privilege (nodes/proxy) and can add
load to the apiserver on large clusters
2018-05-14 23:06:50 -07:00
Dalton Hubble 37981f9fb1 Allow bearer token authn/authz to the Kubelet
* Require Webhook authorization to the Kubelet
* Switch apiserver X509 client cert org to systems:masters
to grant the apiserver admin and satisfy the authorization
requirement. kubectl commands like logs or exec that have
the apiserver make requests of a kubelet continue to work
as before
* https://kubernetes.io/docs/admin/kubelet-authentication-authorization/
* https://github.com/poseidon/typhoon/issues/215
2018-05-13 23:20:42 -07:00
Dalton Hubble 5eb11f5104 Allow Flatcar Linux os_image on AWS, rename os_channel
* Replace os_channel variable with os_image to align naming
across clouds. Users who set this option to stable, beta, or
alpha should now set os_image to coreos-stable, coreos-beta,
or coreos-alpha.
* Default os_image to coreos-stable. This continues to use
the most recent image from the stable channel as always.
* Allow Container Linux derivative Flatcar Linux by setting
os_image to `flatcar-stable`, `flatcar-beta`, `flatcar-alpha`
2018-05-12 11:41:58 -07:00
Dalton Hubble f2ee75ac98 Require Terraform v0.11.x, drop v0.10.x support
* Raise minimum Terraform version to v0.11.0
* Terraform v0.11.x has been supported since Typhoon v1.9.2
and Terraform v0.10.x was last released in Nov 2017. I'd like
to stop worrying about v0.10.x and remove migration docs as
a later followup
* Migration docs docs/topics/maintenance.md#terraform-v011x
2018-05-10 02:20:46 -07:00
Dalton Hubble 8b8e364915 Update etcd from v3.3.4 to v3.3.5
* https://github.com/coreos/etcd/releases/tag/v3.3.5
2018-05-10 02:12:53 -07:00
Dalton Hubble fb88113523 Disable default Google Analytics in Grafana addon
* Its come to my attention Grafana reports analytics data
by default. Typhoon's philosophy requires user permission
for data collection so the addon should have this disabled
* http://docs.grafana.org/installation/configuration/#analytics
2018-05-10 01:18:47 -07:00
Dalton Hubble 1854f5c104 Update Grafana from v5.1.1 to v5.1.2
* https://github.com/grafana/grafana/releases/tag/v5.1.2
2018-05-10 01:09:08 -07:00
Dalton Hubble 726b58b697 Update Grafana from v5.0.4 to v5.1.1
* https://github.com/grafana/grafana/releases/tag/v5.1.1
* https://github.com/grafana/grafana/releases/tag/v5.1.0
2018-05-07 22:05:19 -07:00
Dalton Hubble a54e3c0da1 Fix Prometheus data dir to /var/lib/prometheus
* A data volume (emptyDir) is mounted to /var/lib/prometheus
* Users could swap emptyDir for any desired volume if data
persistence is desired. Prometheus previously defaulted to
keeping its data in ./data relative to /prometheus. Override
this behavior to store data in /var/lib/prometheus
2018-05-01 22:05:27 -07:00
Dalton Hubble cc29530ba0 Allow preemptible workers on AWS via spot instances
* Add `worker_price` to allow worker spot instances. Defaults
to empty string for the worker autoscaling group to use regular
on-demand instances.
* Add `spot_price` to internal `workers` module for spot worker
pools
* Note: Unlike GCP `preemptible` workers, spot instances require
you to pick a bid price.
2018-04-29 13:31:17 -07:00
Dalton Hubble 385584b712 Add changelog notes for release 2018-04-29 12:04:44 -07:00
Dalton Hubble 32ddfa94e1 Update Kubernetes from v1.10.1 to v1.10.2
* https://github.com/kubernetes/kubernetes/releases/tag/v1.10.2
2018-04-28 00:27:00 -07:00
Dalton Hubble 681450aa0d Update etcd from v3.3.3 to v3.3.4
* https://github.com/coreos/etcd/releases/tag/v3.3.4
2018-04-27 23:57:26 -07:00
Dalton Hubble fafa028052 Add Typhoon for Fedora Atomic to changelog 2018-04-27 23:55:59 -07:00
Dalton Hubble a54f76db2a Update Calico from v3.0.4 to v3.1.1
* https://github.com/projectcalico/calico/releases/tag/v3.1.1
* https://github.com/projectcalico/calico/releases/tag/v3.1.0
2018-04-21 18:30:36 -07:00
Dalton Hubble e0d9e9979c Update nginx-ingress from 0.12.0 to 0.13.0
* https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.13.0
2018-04-18 21:12:09 -07:00
Dalton Hubble ad2e4311d1 Switch GCP network lb to global TCP proxy lb
* Allow multi-controller clusters on Google Cloud
* GCP regional network load balancers have a long open
bug in which requests originating from a backend instance
are routed to the instance itself, regardless of whether
the health check passes or not. As a result, only the 0th
controller node registers. We've recommended just using
single master GCP clusters for a while
* https://issuetracker.google.com/issues/67366622
* Workaround issue by switching to a GCP TCP Proxy load
balancer. TCP proxy lb routes traffic to a backend service
(global) of instance group backends. In our case, spread
controllers across 3 zones (all regions have 3+ zones) and
organize them in 3 zonal unmanaged instance groups that
serve as backends. Allows multi-controller cluster creation
* GCP network load balancers only allowed legacy HTTP health
checks so kubelet 10255 was checked as an approximation of
controller health. Replace with TCP apiserver health checks
to detect unhealth or unresponsive apiservers.
* Drawbacks: GCP provision time increases, tailed logs now
timeout (similar tradeoff in AWS), controllers only span 3
zones instead of the exact number in the region
* Workaround in Typhoon has been known and posted for 5 months,
but there still appears to be no better alternative. Its
probably time to support multi-master and accept the downsides
2018-04-18 00:09:06 -07:00
Dalton Hubble 9789881243 Update kube-state-metrics from v1.3.0 to v1.3.1
* https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.3.1
2018-04-15 17:10:02 -07:00
Dalton Hubble 77c0a4cf2e Update Kubernetes from v1.10.0 to v1.10.1
* Use kubernetes-incubator/bootkube v0.12.0
2018-04-12 20:57:31 -07:00
Dalton Hubble 5035d56db2 Refactor GCP to remove controller internal module
* Remove the controller internal module to align with
other platforms and since its not a supported use case
2018-04-12 19:41:51 -07:00
Dalton Hubble d276fffcda Fix bare-metal multiple apply/ssh on Terraform v0.11.4+
* Terraform v0.11.4 introduced changes to remote-exec
that mean Typhoon bare-metal clusters require multiple
runs of terraform apply to ssh and bootstrap.
* Bare-metal installs PXE boot a live instance to install
to disk and then reboot from disk as controllers/workers.
Terraform remote-exec has no way to "know" to wait until
the reboot has occurred to kickoff Kubernetes bootstrap.
Previously Typhoon created a "debug" user during this
install phase to allow an admin to SSH, but remote-exec
would hang, trying to connect as user "core". Terraform
v0.11.4 changes this behavior so remote-exec fails and
a user must re-run terraform apply until succeeding.
* A new way to "trick" remote-exec into waiting for the
reboot into the disk install is to run SSH on a non-standard
port during the disk install. This retains the ability
for an admin to SSH during install (most distros don't have
this) and fixes the issue so only a single run of terraform
apply is needed.
* https://github.com/hashicorp/terraform/pull/17359#issuecomment-376415464
2018-04-08 13:32:31 -07:00
Dalton Hubble 6b08bde479 Use k8s.gcr.io instead of gcr.io/google_containers
* Kubernetes recommends using the alias to fetch images
from the nearest GCR regional mirror, to abstract the use
of GCR, and to drop names containing 'google'
* https://groups.google.com/forum/#!msg/kubernetes-dev/ytjk_rNrTa0/3EFUHvovCAAJ
2018-04-08 12:57:52 -07:00
Dalton Hubble 7186aa46da Update kube-state-metrics from v1.2.0 to v1.3.0
* https://github.com/kubernetes/kube-state-metrics/pull/412
* https://github.com/kubernetes/kube-state-metrics/pull/413
2018-04-04 21:04:13 -07:00
Dalton Hubble 18dbaf74ce Update kube-dns from v1.14.8 to v1.14.9
* https://github.com/kubernetes/kubernetes/pull/61908
2018-04-04 21:00:23 -07:00
Dalton Hubble ce001e9d56 Update etcd from v3.3.2 to v3.3.3
* https://github.com/coreos/etcd/releases/tag/v3.3.3
2018-04-04 20:32:24 -07:00
Dalton Hubble d770393dbc Add etcd metrics, Prometheus scrapes, and Grafana dash
* Use etcd v3.3 --listen-metrics-urls to expose only metrics
data via http://0.0.0.0:2381 on controllers
* Add Prometheus discovery for etcd peers on controller nodes
* Temporarily drop two noisy Prometheus alerts
2018-04-03 20:31:00 -07:00
Dalton Hubble 642f7ec22f Update CHANGES.md with Kubernetes link 2018-03-30 23:12:38 -07:00
Dalton Hubble f8e9bfb1c0 Add disk_type variable for EBS volume type on AWS
* Change EBS volume type from `standard` ("prior generation)
 to `gp2`. Prometheus alerts are tuned for SSDs
* Other platforms have fast enough disks by default
2018-03-29 22:51:54 -07:00
Dalton Hubble b1e41dcb99 addons: Update from Grafana v4.6.3 to v5.0.4
This reverts commit c59a9c66b1.
2018-03-28 19:45:19 -07:00
Dalton Hubble cfd603bea2 Ensure etcd secrets are only distributed to controller hosts
* Previously, etcd secrets were erroneously distributed to worker
nodes (permissions 500, ownership etc:etcd).
2018-03-25 23:46:44 -07:00
Dalton Hubble fdb543e834 Add optional controller_type and worker_type vars on GCP
* Remove optional machine_type variable on Google Cloud
* Use controller_type and worker_type instead
2018-03-25 22:11:18 -07:00
Dalton Hubble 8d3d4220fd Add disk_size variable on Google Cloud 2018-03-25 22:04:14 -07:00
Dalton Hubble ba9daf439e Remove unmaintained pxe-worker internal module 2018-03-25 21:57:52 -07:00
Dalton Hubble 38adb14bd2 Remove optional variable networking on Digital Ocean
* Calico isn't viable on Digital Ocean because their firewalls
do not support IP-IP protocol. Its not viable to run a cluster
without firewalls just to use Calico.
* Remove the caveat note. Don't allow users to shoot themselves
in the foot
2018-03-25 21:48:51 -07:00
Dalton Hubble da2be86e8c Add v1.9.6 heading to CHANGES.md 2018-03-22 22:01:29 -07:00