Compare commits

..

86 Commits

Author SHA1 Message Date
f00ecde854 Rollback nginx-ingress on GCE to 0.9.0-beta.17
* https://github.com/kubernetes/ingress-nginx/issues/1788
2017-12-02 14:06:22 -08:00
d85300f947 Clarify only Terraform v0.10.x should be used
* It is not safe to update to Terraform v0.11.x yet
* https://github.com/hashicorp/terraform/issues/16824
2017-12-02 01:31:39 -08:00
65f006e6cc addons: Sync prometheus alerts to upstream
* https://github.com/coreos/prometheus-operator/pull/774
2017-12-01 23:24:08 -08:00
8d3817e0ae addons: Update nginx-ingress to 0.9.0-beta.19
* https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.9.0-beta.19
2017-12-01 22:32:33 -08:00
5f5eec1175 Update bootkube and terraform-render-bootkube to v0.9.0 2017-12-01 22:27:48 -08:00
5308fde3d3 Add Kubernetes certification badge 2017-11-29 19:26:49 -08:00
9ab61d7bf5 Add Typhoon images with and without text
* Serve images from GCS poseidon, rather than dghubble
2017-11-29 01:01:01 -08:00
6483f613c5 Update Kubernetes from v1.8.3 to v1.8.4 2017-11-28 21:52:11 -08:00
56c6bf431a Update terraform-render-bootkube for Kubernetes v1.8.4
* Update hyperkube from v1.8.3 to v1.8.4
* Remove flock from bootstrap-apiserver and kube-apiserver
* Remove unused critical-pod annotations in manifests
* Use service accounts for kube-proxy and pod-checkpointer
* Update Calico from v2.6.1 to v2.6.3
* Update flannel from v0.9.0 to v0.9.1
* Remove Calico termination grace period to prevent calico
from getting stuck for extended periods
* https://github.com/poseidon/terraform-render-bootkube/pull/29
2017-11-28 21:42:26 -08:00
63ab117205 addons: Add prometheus rules for DaemonSets
* https://github.com/coreos/prometheus-operator/pull/755
2017-11-16 23:51:21 -08:00
1cd262e712 addons: Fix prometheus K8SApiServerLatency alert rule
* https://github.com/coreos/prometheus-operator/issues/751
2017-11-16 23:37:15 -08:00
32bdda1b6c addons: Update Grafana from v4.6.1 to v4.6.2
* https://github.com/grafana/grafana/releases/tag/v4.6.2
2017-11-16 23:34:36 -08:00
07d257aa7b Add initrd kernel argument needed by UEFI clients
* https://github.com/coreos/bugs/issues/1239
2017-11-16 23:19:51 -08:00
fd96067125 Fix docs link for security issue reporting 2017-11-10 21:38:41 -08:00
9d16f5c78a Update min Google plugin and remove target pool workaround
* With google provider 1.2, target pool instances can use self_link
and zone/name formats without causing a diff on each plan
* Original workaround: 77fc14db71
2017-11-10 21:15:19 -08:00
159443bae7 addons: Add better alerting rules to Prometheus manifests
* Adapt the coreos/prometheus-operator alerting rules for Typhoon,
https://github.com/coreos/prometheus-operator/tree/master/contrib/kube-prometheus/manifests
* Add controller manager and scheduler shim services to let
prometheus discover them via service endpoints
* Fix several alert rules to use service endpoint discovery
* A few rules still don't do much, but they default to green
2017-11-10 20:57:47 -08:00
119dc859d3 addons: Update nginx-ingress to 0.9.0-beta.17
* https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.9.0-beta.17
2017-11-10 20:16:40 -08:00
5f6b0728c5 Update bootkube and terraform-render-bootkube to v0.8.2 2017-11-10 20:01:37 -08:00
d774c51297 Update Kubernetes from v1.8.2 to v1.8.3 2017-11-08 23:34:19 -08:00
f6a8fb363e Remove deprecated kubelet --require-kubeconfig flag
* https://github.com/kubernetes/kubernetes/pull/40050
2017-11-08 23:34:19 -08:00
f570af9418 addons: Update from Prometheus v1.8.2 to v2.0.0 2017-11-08 22:48:23 -08:00
4ec6732b98 Output the Google network name and self_link
* Allow users to add custom firewall rules for unique cases
2017-11-08 00:19:49 -08:00
ea1efb536a Remove old firewall rule for bootstrap self-hosted etcd 2017-11-08 00:15:20 -08:00
451fd86470 Improve internal firewall rules on Google Cloud
* Whitelist internal traffic between controllers and workers
* Switch to tag-based firewall policies rather than source IP
2017-11-08 00:15:06 -08:00
b1b611b22c Add docs to use one controller on Google Cloud 2017-11-07 19:51:03 -08:00
eabf00fbf1 Add missing controller dependency before bootkube start
* Require the controller module to be completed before starting
to remote exec bootkube start, otherwise its possible the controller
nodes were created, but not the network load balancer
2017-11-07 19:12:05 -08:00
8eaa72c1ca addons: Update nginx-ingress to 0.9.0-beta.16
* Image registry changed from gcr.io to quay.io
* https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.9.0-beta.16
2017-11-06 23:15:15 -08:00
58cf82da56 Promote AWS platform from alpha to beta 2017-11-06 21:38:24 -08:00
ccc832f468 Add firewall rule to allow apiserver to proxy other controller kubelets
* Prometheus proxies through the apiserver to scrape kubelets
* In multi-controller setups, an apiserver must be able to scrape
kubelets (10250) on other controllers
2017-11-06 01:03:53 -08:00
90f8d62204 Add firewall rules to allow prometheus to reach node-exporter
* node_exporter service endpoints run on hostNetwork port 9100
* Re-evaluate after https://github.com/kubernetes-incubator/bootkube/pull/711
2017-11-06 01:03:53 -08:00
af5c413abf Focus controller ELB on load balancing apiservers
* ELB distributing load across controllers is no longer the mechanism
used to SSH to instances to distribute secrets
* Focus the ELB on load balancing across apiserver and edit the HTTP
health check to an SSL:443 check
2017-11-06 01:03:53 -08:00
168c487484 Remove mention of self-hosted etcd, its deprecated 2017-11-06 01:03:53 -08:00
805dd772a8 Run etcd cluster on-host, across controllers on AWS
* Change controllers ASG to heterogeneous EC2 instances
* Create DNS records for each controller's private IP for etcd
* Change etcd to run on-host, across controllers (etcd-member.service)
* Reduce time to bootstrap a cluster
* Deprecate self-hosted-etcd on the AWS platform
2017-11-06 01:03:53 -08:00
c6ec6596d8 Minor cleanup for zones, docs, and outputs
* Spread across all zones, regardless of UP/DOWN state
* Remove unused outputs of private IPs
2017-11-06 00:56:26 -08:00
47a9989927 Fix null_resource ordering constraints
* Ensure etcd TLS assets and kubeconfig are copied before
any attempt is made to run bootkube start
2017-11-06 00:55:44 -08:00
10b977d54a addons: Set kube-state-metrics to have clusterIP None
* kube-state-metrics service exists to facilitate prometheus discovery
2017-11-05 17:54:09 -08:00
b7a268fc45 addons: Add prometheus alertmanager flag
* Pass -alertmanager.url to work with a user's in-cluster
alertmanager deployment, if any
2017-11-05 15:50:46 -08:00
279f36effd addons: Add grafana 4.6.1 and extend prometheus docs 2017-11-05 15:23:56 -08:00
77fc14db71 Workaround target pool issue by listing instances as zone/name
* Instances can be listed by zone/name or self_link URL, but the
provider desires they be in zone/name form, which causes a diff
* https://github.com/terraform-providers/terraform-provider-google/issues/46
2017-11-05 14:07:05 -08:00
2b0296d671 Create controller instances across zones in the region
* Change controller instances to automatically span zones in a region
* Remove the `zone` required variable
2017-11-05 13:24:32 -08:00
7b38271212 Run etcd cluster on-host, across controllers on Google Cloud
* Change controllers from a managed group to individual instances
* Create discrete DNS records to each controller's private IP for etcd
* Change etcd to run on-host, across controllers (etcd-member.service)
* Reduce time to bootstrap a cluster
* Deprecate self-hosted-etcd on the Google Cloud platform
2017-11-05 11:03:35 -08:00
ae07a21e3d addons: Omit static resource requests/limits for kube-state-metrics
* Allow the addon-resizer to dynamically set resource values
* https://github.com/kubernetes/kube-state-metrics/pull/285
2017-11-04 14:41:04 -07:00
0ab1ae3210 addons: Fix typo in kube-state-metrics strategy 2017-11-04 14:39:56 -07:00
67e3d2b86e docs: GCE network bandwidth is excellent, even btw zones
* Remove performance note that the GCE vs AWS network performance
is not an equal comparison. On both platforms, workers now span the
(availability) zones of a region.
* Testing host-to-host and pod-to-pod network bandwidth between nodes
(now located in different zones) showed no reduction in bandwidth
2017-11-04 14:08:20 -07:00
a48dd9ebd8 Require google provider version ~> 1.1
* Require google provider plugin 1.1 or higher which includes fix:
https://github.com/terraform-providers/terraform-provider-google/issues/574
* Remove workaround which statically set the persistent disk name
* Original reasons for workaround in a97df839 or GH #34
2017-11-04 12:59:19 -07:00
26a291aef4 Remove controller_preemptible option on Google Cloud
* Controller preemption is not safe or covered in documentation. Delete
the option, the variable is a holdover from old experiments
* Note, worker_preemeptible is still a great feature that's supported
2017-11-04 12:59:19 -07:00
251a14519f Fix typo in internal template variable name
* ssh_authorized_keys should be ssh_authorized_key to match the user
facing variable which only allows a single SSH authorized key
2017-11-04 12:59:19 -07:00
6300383b43 Change worker managed instance group to span zones in region
* Change Google Cloud module to require the `region` variable
* Workers are created in random zones within the given region
* Tolerate Google Cloud zone failures or capacity issues
* If workers are preempted (if enabled), replacement instances can
be drawn from any zone in the region, which should avoid scheduling
issues that were possible before if a single zone aggressively
preempts instances (presumably due to Google Cloud capacity)
2017-11-04 12:59:19 -07:00
e32885c9cd addons: Update prometheus from v1.8.0 to v1.8.2
* https://github.com/prometheus/prometheus/releases/tag/v1.8.2
2017-11-04 11:00:39 -07:00
fe8afdbee9 Update Typhoon logo and favicon 2017-11-04 01:20:17 -07:00
878f5a3647 Bump bootkube and terraform-render-bootkube to v0.8.1
* Use the v0.8.1 tagged terraform-render-bootkube module
* Use the v0.8.1 quay.io/coreos/bootkube image to bootstrap
2017-10-28 12:50:37 -07:00
34ec7e9862 Relax pessimistic constraints on 1.0+ providers
* Constrains ~> 1.0 means users can use 1.0.1, 1.1, but not 2.0
* https://www.terraform.io/docs/configuration/terraform.html
2017-10-25 23:27:28 -07:00
f6c6e85f84 Require minimum Terraform and plugin versions
* Bump minimum Terraform version to v0.10.4
* Allow minor version updates for 1.0+ plugins
* Fix versions for plugins which are pre-1.0
2017-10-25 23:00:31 -07:00
8582e19077 Expand Nginx Ingress liveness and readiness probes
* Remove dnsPolicy: ClusterFirst
* https://github.com/kubernetes/ingress-nginx/pull/1584
2017-10-25 22:29:20 -07:00
3727c40c6c Update Nginx Ingress defaultbackend from 1.0 to 1.4
* https://github.com/kubernetes/ingress-nginx/pull/1568
2017-10-25 22:16:23 -07:00
b608f9c615 addons: Use service endpoints to scrape node-exporter 2017-10-24 22:59:00 -07:00
ec1dbb853c addons: Include kube-state-metrics exporter manifests 2017-10-24 22:59:00 -07:00
d046d45769 addons: Include Prometheus and node-exporter manifests 2017-10-24 22:58:59 -07:00
a73f57fe4e Update CLUO from v0.4.0 to v0.4.1 2017-10-24 22:14:03 -07:00
60bc8957c9 Update Kubernetes from v1.8.1 to v1.8.2
* Kubernetes v1.8.2 fixes a memory leak in the v1.8.1 apiserver
* Switch to using the `gcr.io/google_containers/hyperkube` for the
on-host kubelet and shutdown drains
* Update terraform-render-bootkube manifests generation
  * Update flannel from v0.8.0 to v0.9.0
  * Add `hairpinMode` to flannel CNI config
  * Add `--no-negcache` to kube-dns dnsmasq
2017-10-24 21:44:26 -07:00
8b78c65483 Update Google Cloud Kubernetes from v1.7.7 to v1.8.1 2017-10-20 16:09:11 -07:00
f86c00288f Add missing update-agent RBAC role to get pods
* Drain now gets pods, deletes pods, and waits for deletion
2017-10-20 01:21:46 -07:00
a57b3cf973 Update CLUO addon to v0.4.0 and RBAC ClusterRole 2017-10-20 00:40:17 -07:00
10c5487ad7 Add docs corrections for versions and log output 2017-10-20 00:39:17 -07:00
e4c479554c Update AWS, DO, BM Kubernetes from v1.7.7 to v1.8.1
* Update from bootkube v0.7.0 to v0.8.0
* Leave Google Cloud update to a followup commit
2017-10-19 21:10:04 -07:00
be113e77b4 Fix links and add Calico BGP peering notes 2017-10-17 19:10:18 -07:00
911c53e4ae Add Ubiquity EdgeRouter documentation 2017-10-17 18:51:40 -07:00
bfa8dfc75d Conditionally set networkd content on bare-metal
* Without this change, if a cluster doesn't set the controller
or worker networkd lists, an err "element() may not be used
with an empty list" occurs.
* controller_networkds and worker_networks are intended to be
optional and temporary, not required at all
2017-10-17 18:47:12 -07:00
43dc44623f Fix the terraform fmt of configs 2017-10-16 01:32:25 -07:00
734bc1d32a Add performance benchmark for flannel with bonded NICs 2017-10-16 01:12:13 -07:00
41e632280f Remove unused storage section ala PXE-only Matchbox templating 2017-10-16 00:42:20 -07:00
fc22f04dd6 Add temporary variables for multi-nic testing
* Accept ordered lists of controller and worker networkd configs
* Do not rely on these variables. They will be replaced with a
cleaner mechanism at a future date
2017-10-16 00:39:58 -07:00
377e14c80b Fix ingress addon docs recursive apply command 2017-10-16 00:29:04 -07:00
9ec8ec4afc Secure copy etcd TLS credentials to controllers only
* Controllers receive etcd TLS credentials
* Controllers and workers receive a kubeconfig
2017-10-14 20:48:02 -07:00
5c1ed37ff5 Add SSH key to user "debug" during disk-install phase
* Avoid adding SSH authorized key for user "core" during the disk
install, so that terraform apply cannot SSH until post-install
2017-10-14 20:37:42 -07:00
e765fb310d Allow setting custom PXE boot kernel_args on bare-metal 2017-10-14 19:39:10 -07:00
7b5ffd0085 Add Container Linux reboot-coordinator RBAC
* Add a reboot-coordinator namespace for CLUO components
* Define an RBAC ClusterRole for update-operator and update-agent
* Replace the older-style where CLUO ran in kube-system, with
admin privilege
2017-10-14 19:35:06 -07:00
123439c2a4 Remove or compress docs image assets 2017-10-14 19:12:22 -07:00
11453bac91 Update heapster addon from v1.4.0 to v1.4.3
* Use normal name and phase labels
2017-10-14 19:07:37 -07:00
dd0c61d1d9 Update Nginx Ingress controller addon to 0.9.0-beta.15 2017-10-14 18:30:58 -07:00
5c87529011 Demote Google Cloud from stable to beta
* See #34 postmortem and action items for context on
when stable status will be restored
2017-10-11 19:32:04 -07:00
a97df839ea google-cloud: Set disk.device_name to match API default
* Terraform provider "google" plugin releases leave the disk
device_name as "" by default. Recently the API has started to
set a default name "persistent-disk-0". Plan and apply show
all instance groups need to be recreated to "fix" the name
* Impact: Controller and worker instance groups are deleted
and recreated, deleting data on controllers and bringing
down clusters
* Fix: Explicitly set the disk_name to persistent-disk-0 so
that terraform finds no diff needs to be applied.
* https://github.com/poseidon/typhoon/issues/34
* https://github.com/terraform-providers/terraform-provider-google/issues/574
2017-10-11 18:04:39 -07:00
a5290dac32 Update docs to show Digital Ocean with on-host etcd 2017-10-09 23:47:32 -07:00
308c7dfb6e digital-ocean: Run etcd cluster on-host, across controllers
* Run etcd peers with TLS across controller nodes
* Deprecate self-hosted-etcd on the Digital Ocean platform
* Distribute etcd TLS certificates as part of initial provisioning
* Check the status of etcd by running `systemctl status etcd-member`
2017-10-09 22:43:23 -07:00
da63c89d71 Remove mention of ct plugin in bare-metal docs 2017-10-08 23:37:41 -07:00
62d7ccfff3 Add docs on provision time and network performance 2017-10-04 00:05:43 -07:00
105 changed files with 2642 additions and 532 deletions

View File

@ -2,6 +2,70 @@
Notable changes between versions.
## Latest
## v1.8.4
* Kubernetes v1.8.4
* Calico related bug fixes
* Update Calico from v2.6.1 to v2.6.3
* Update flannel from v0.9.0 to v0.9.1
* Service accounts for kube-proxy and pod-checkpointer
* Use kubernetes-incubator/bootkube v0.9.0
## v1.8.3
* Kubernetes v1.8.3
* Run etcd on-host, across controllers
* Promote AWS platform to beta
* Use kubernetes-incubator/bootkube v0.8.2
#### Google Cloud
* Add required variable `region` (e.g. "us-central1")
* Reduce time to bootstrap a cluster
* Change etcd to run on-host, across controllers (etcd-member.service)
* Change controller instances to automatically span zones in the region
* Change worker managed instance group to automatically span zones in the region
* Improve internal firewall rules and use tag-based firewall policies
* Remove support for self-hosted etcd
* Remove the `zone` required variable
* Remove the `controller_preemptible` optional variable
#### AWS
* Promote AWS platform to beta
* Reduce time to bootstrap a cluster
* Change etcd to run on-host, across controllers (etcd-member.service)
* Fix firewall rules for multi-controller kubelet scraping and node-exporter
* Remove support for self-hosted etcd
#### Addons
* Add Prometheus 2.0 addon with alerting rules
* Add Grafana dashboard for observing metrics
## v1.8.2
* Kubernetes v1.8.2
* Fixes a memory leak in the v1.8.1 apiserver ([kubernetes#53485](https://github.com/kubernetes/kubernetes/issues/53485))
* Switch to using the `gcr.io/google_containers/hyperkube`
* Update flannel from v0.8.0 to v0.9.0
* Add `hairpinMode` to flannel CNI config
* Add `--no-negcache` to kube-dns dnsmasq
* Use kubernetes-incubator/bootkube v0.8.1
## v1.8.1
* Kubernetes v1.8.1
* Use kubernetes-incubator/bootkube v0.8.0
#### Digital Ocean
* Run etcd cluster across controller nodes (etcd-member.service)
* Remove support for self-hosted etcd
* Reduce time to bootstrap a cluster
## v1.7.7
* Kubernetes v1.7.7

View File

@ -1,4 +1,4 @@
# Typhoon [![IRC](https://img.shields.io/badge/freenode-%23typhoon-0099ef.svg)]() <img align="right" src="https://storage.googleapis.com/dghubble/spin.png">
# Typhoon [![IRC](https://img.shields.io/badge/freenode-%23typhoon-0099ef.svg)]() <img align="right" src="https://storage.googleapis.com/poseidon/typhoon-logo.png">
Typhoon is a minimal and free Kubernetes distribution.
@ -9,9 +9,9 @@ Typhoon is a minimal and free Kubernetes distribution.
Typhoon distributes upstream Kubernetes, architectural conventions, and cluster addons, much like a GNU/Linux distribution provides the Linux kernel and userspace components.
## Features
## Features <a href="https://www.cncf.io/certification/software-conformance/"><img align="right" src="https://storage.googleapis.com/poseidon/certified-kubernetes.png"></a>
* Kubernetes v1.7.7 (upstream, via [kubernetes-incubator/bootkube](https://github.com/kubernetes-incubator/bootkube))
* Kubernetes v1.8.4 (upstream, via [kubernetes-incubator/bootkube](https://github.com/kubernetes-incubator/bootkube))
* Single or multi-master, workloads isolated on workers, [Calico](https://www.projectcalico.org/) or [flannel](https://github.com/coreos/flannel) networking
* On-cluster etcd with TLS, [RBAC](https://kubernetes.io/docs/admin/authorization/rbac/)-enabled, [network policy](https://kubernetes.io/docs/concepts/services-networking/network-policies/)
* Ready for Ingress, Dashboards, Metrics, and other optional [addons](https://typhoon.psdn.io/addons/overview/)
@ -22,10 +22,10 @@ Typhoon provides a Terraform Module for each supported operating system and plat
| Platform | Operating System | Terraform Module | Status |
|---------------|------------------|------------------|--------|
| AWS | Container Linux | [aws/container-linux/kubernetes](aws/container-linux/kubernetes) | alpha |
| Bare-Metal | Container Linux | [bare-metal/container-linux/kubernetes](bare-metal/container-linux/kubernetes) | production |
| AWS | Container Linux | [aws/container-linux/kubernetes](aws/container-linux/kubernetes) | beta |
| Bare-Metal | Container Linux | [bare-metal/container-linux/kubernetes](bare-metal/container-linux/kubernetes) | stable |
| Digital Ocean | Container Linux | [digital-ocean/container-linux/kubernetes](digital-ocean/container-linux/kubernetes) | beta |
| Google Cloud | Container Linux | [google-cloud/container-linux/kubernetes](google-cloud/container-linux/kubernetes) | production |
| Google Cloud | Container Linux | [google-cloud/container-linux/kubernetes](google-cloud/container-linux/kubernetes) | beta |
## Usage
@ -46,7 +46,7 @@ module "google-cloud-yavin" {
source = "git::https://github.com/poseidon/typhoon//google-cloud/container-linux/kubernetes"
# Google Cloud
zone = "us-central1-c"
region = "us-central1"
dns_zone = "example.com"
dns_zone_name = "example-zone"
os_image = "coreos-stable-1465-6-0-v20170817"
@ -72,15 +72,15 @@ $ terraform apply
Apply complete! Resources: 37 added, 0 changed, 0 destroyed.
```
In 5-10 minutes (varies by platform), the cluster will be ready. This Google Cloud example creates a `yavin.example.com` DNS record to resolve to a network load balancer across controller nodes.
In 4-8 minutes (varies by platform), the cluster will be ready. This Google Cloud example creates a `yavin.example.com` DNS record to resolve to a network load balancer across controller nodes.
```sh
$ KUBECONFIG=/home/user/.secrets/clusters/yavin/auth/kubeconfig
$ kubectl get nodes
NAME STATUS AGE VERSION
yavin-controller-1682.c.example-com.internal Ready 6m v1.7.7+coreos.0
yavin-worker-jrbf.c.example-com.internal Ready 5m v1.7.7+coreos.0
yavin-worker-mzdm.c.example-com.internal Ready 5m v1.7.7+coreos.0
yavin-controller-0.c.example-com.internal Ready 6m v1.8.4
yavin-worker-jrbf.c.example-com.internal Ready 5m v1.8.4
yavin-worker-mzdm.c.example-com.internal Ready 5m v1.8.4
```
List the pods.
@ -91,13 +91,10 @@ NAMESPACE NAME READY STATUS RESTART
kube-system calico-node-1cs8z 2/2 Running 0 6m
kube-system calico-node-d1l5b 2/2 Running 0 6m
kube-system calico-node-sp9ps 2/2 Running 0 6m
kube-system etcd-operator-3329263108-f443m 1/1 Running 1 6m
kube-system kube-apiserver-zppls 1/1 Running 0 6m
kube-system kube-controller-manager-3271970485-gh9kt 1/1 Running 0 6m
kube-system kube-controller-manager-3271970485-h90v8 1/1 Running 1 6m
kube-system kube-dns-1187388186-zj5dl 3/3 Running 0 6m
kube-system kube-etcd-0000 1/1 Running 0 5m
kube-system kube-etcd-network-checkpointer-crznb 1/1 Running 0 6m
kube-system kube-proxy-117v6 1/1 Running 0 6m
kube-system kube-proxy-9886n 1/1 Running 0 6m
kube-system kube-proxy-njn47 1/1 Running 0 6m

View File

@ -0,0 +1,12 @@
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: reboot-coordinator
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: reboot-coordinator
subjects:
- kind: ServiceAccount
namespace: reboot-coordinator
name: default

View File

@ -0,0 +1,45 @@
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: reboot-coordinator
rules:
- apiGroups:
- ""
resources:
- nodes
verbs:
- get
- list
- watch
- update
- apiGroups:
- ""
resources:
- configmaps
verbs:
- create
- get
- update
- list
- watch
- apiGroups:
- ""
resources:
- events
verbs:
- create
- watch
- apiGroups:
- ""
resources:
- pods
verbs:
- get
- list
- delete
- apiGroups:
- "extensions"
resources:
- daemonsets
verbs:
- get

View File

@ -0,0 +1,4 @@
apiVersion: v1
kind: Namespace
metadata:
name: reboot-coordinator

View File

@ -2,7 +2,7 @@ apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: container-linux-update-agent
namespace: kube-system
namespace: reboot-coordinator
spec:
updateStrategy:
type: RollingUpdate
@ -15,7 +15,7 @@ spec:
spec:
containers:
- name: update-agent
image: quay.io/coreos/container-linux-update-operator:v0.3.1
image: quay.io/coreos/container-linux-update-operator:v0.4.1
command:
- "/bin/update-agent"
volumeMounts:

View File

@ -2,7 +2,7 @@ apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: container-linux-update-operator
namespace: kube-system
namespace: reboot-coordinator
spec:
replicas: 1
template:
@ -12,12 +12,15 @@ spec:
spec:
containers:
- name: update-operator
image: quay.io/coreos/container-linux-update-operator:v0.3.1
image: quay.io/coreos/container-linux-update-operator:v0.4.1
command:
- "/bin/update-operator"
- "--analytics=false"
env:
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule

View File

@ -0,0 +1,46 @@
apiVersion: apps/v1beta2
kind: Deployment
metadata:
name: grafana
namespace: monitoring
spec:
replicas: 1
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
selector:
matchLabels:
name: grafana
phase: prod
template:
metadata:
labels:
name: grafana
phase: prod
spec:
containers:
- name: grafana
image: grafana/grafana:4.6.2
env:
- name: GF_SERVER_HTTP_PORT
value: "8080"
- name: GF_AUTH_BASIC_ENABLED
value: "false"
- name: GF_AUTH_ANONYMOUS_ENABLED
value: "true"
- name: GF_AUTH_ANONYMOUS_ORG_ROLE
value: Admin
ports:
- name: http
containerPort: 8080
resources:
requests:
memory: 100Mi
cpu: 100m
limits:
memory: 200Mi
cpu: 200m
volumes:
- name: grafana-storage
emptyDir: {}

View File

@ -0,0 +1,15 @@
apiVersion: v1
kind: Service
metadata:
name: grafana
namespace: monitoring
spec:
type: ClusterIP
selector:
name: grafana
phase: prod
ports:
- name: http
protocol: TCP
port: 80
targetPort: 8080

View File

@ -3,27 +3,23 @@ kind: Deployment
metadata:
name: heapster
namespace: kube-system
labels:
k8s-app: heapster
kubernetes.io/cluster-service: "true"
version: v1.4.0
spec:
replicas: 1
selector:
matchLabels:
k8s-app: heapster
version: v1.4.0
name: heapster
phase: prod
template:
metadata:
labels:
k8s-app: heapster
version: v1.4.0
name: heapster
phase: prod
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ''
spec:
containers:
- name: heapster
image: gcr.io/google_containers/heapster-amd64:v1.4.0
image: gcr.io/google_containers/heapster-amd64:v1.4.3
command:
- /heapster
- --source=kubernetes.summary_api:''
@ -42,7 +38,7 @@ spec:
- --extra-cpu=0.5m
- --memory=140Mi
- --extra-memory=4Mi
- --deployment=heapster-v1.4.0
- --deployment=heapster
- --container=heapster
- --poll-period=300000
env:

View File

@ -3,13 +3,10 @@ kind: Service
metadata:
name: heapster
namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
kubernetes.io/name: "Heapster"
spec:
type: ClusterIP
selector:
k8s-app: heapster
name: heapster
ports:
- port: 80
targetPort: 8082

View File

@ -16,7 +16,7 @@ spec:
# Any image is permissable as long as:
# 1. It serves a 404 page at /
# 2. It serves 200 on a /healthz endpoint
image: gcr.io/google_containers/defaultbackend:1.0
image: gcr.io/google_containers/defaultbackend:1.4
ports:
- containerPort: 8080
resources:

View File

@ -16,9 +16,10 @@ spec:
spec:
nodeSelector:
node-role.kubernetes.io/node: ""
hostNetwork: true
containers:
- name: nginx-ingress-controller
image: gcr.io/google_containers/nginx-ingress-controller:0.9.0-beta.11
image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.9.0-beta.19
args:
- /nginx-ingress-controller
- --default-backend-service=$(POD_NAMESPACE)/default-backend
@ -43,19 +44,24 @@ spec:
- name: health
containerPort: 10254
hostPort: 10254
readinessProbe:
httpGet:
path: /healthz
port: 10254
scheme: HTTP
livenessProbe:
initialDelaySeconds: 11
timeoutSeconds: 1
failureThreshold: 3
httpGet:
path: /healthz
port: 10254
scheme: HTTP
hostNetwork: true
dnsPolicy: ClusterFirst
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
readinessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 10254
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
restartPolicy: Always
terminationGracePeriodSeconds: 60

View File

@ -16,9 +16,10 @@ spec:
spec:
nodeSelector:
node-role.kubernetes.io/node: ""
hostNetwork: true
containers:
- name: nginx-ingress-controller
image: gcr.io/google_containers/nginx-ingress-controller:0.9.0-beta.11
image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.9.0-beta.19
args:
- /nginx-ingress-controller
- --default-backend-service=$(POD_NAMESPACE)/default-backend
@ -43,19 +44,24 @@ spec:
- name: health
containerPort: 10254
hostPort: 10254
readinessProbe:
httpGet:
path: /healthz
port: 10254
scheme: HTTP
livenessProbe:
initialDelaySeconds: 10
timeoutSeconds: 1
failureThreshold: 3
httpGet:
path: /healthz
port: 10254
scheme: HTTP
hostNetwork: true
dnsPolicy: ClusterFirst
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
readinessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 10254
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
restartPolicy: Always
terminationGracePeriodSeconds: 60

View File

@ -16,7 +16,7 @@ spec:
# Any image is permissable as long as:
# 1. It serves a 404 page at /
# 2. It serves 200 on a /healthz endpoint
image: gcr.io/google_containers/defaultbackend:1.0
image: gcr.io/google_containers/defaultbackend:1.4
ports:
- containerPort: 8080
resources:

View File

@ -16,7 +16,7 @@ spec:
# Any image is permissable as long as:
# 1. It serves a 404 page at /
# 2. It serves 200 on a /healthz endpoint
image: gcr.io/google_containers/defaultbackend:1.0
image: gcr.io/google_containers/defaultbackend:1.4
ports:
- containerPort: 8080
resources:

View File

@ -16,9 +16,10 @@ spec:
spec:
nodeSelector:
node-role.kubernetes.io/node: ""
hostNetwork: true
containers:
- name: nginx-ingress-controller
image: gcr.io/google_containers/nginx-ingress-controller:0.9.0-beta.11
image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.9.0-beta.17
args:
- /nginx-ingress-controller
- --default-backend-service=$(POD_NAMESPACE)/default-backend
@ -43,19 +44,24 @@ spec:
- name: health
containerPort: 10254
hostPort: 10254
readinessProbe:
httpGet:
path: /healthz
port: 10254
scheme: HTTP
livenessProbe:
initialDelaySeconds: 10
timeoutSeconds: 1
failureThreshold: 3
httpGet:
path: /healthz
port: 10254
scheme: HTTP
hostNetwork: true
dnsPolicy: ClusterFirst
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
readinessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 10254
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
restartPolicy: Always
terminationGracePeriodSeconds: 60

View File

@ -0,0 +1,226 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yaml: |-
# Global config
global:
scrape_interval: 15s
# AlertManager
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# Scrape configs for running Prometheus on a Kubernetes cluster.
# This uses separate scrape configs for cluster components (i.e. API server, node)
# and services to allow each to use different authentication configs.
#
# Kubernetes labels will be added as Prometheus labels on metrics via the
# `labelmap` relabeling action.
scrape_configs:
# Scrape config for API servers.
#
# Kubernetes exposes API servers as endpoints to the default/kubernetes
# service so this uses `endpoints` role and uses relabelling to only keep
# the endpoints associated with the default/kubernetes service using the
# default named port `https`. This works for single API server deployments as
# well as HA API server deployments.
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
# Using endpoints to discover kube-apiserver targets finds the pod IP
# (host IP since apiserver is uses host network) which is not used in
# the server certificate.
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
# Keep only the default/kubernetes service endpoints for the https port. This
# will add targets for each API server which Kubernetes adds an endpoint to
# the default/kubernetes service.
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
# Scrape config for node (i.e. kubelet) /metrics (e.g. 'kubelet_'). Explore
# metrics from a node by scraping kubelet (127.0.0.1:10255/metrics).
#
# Rather than connecting directly to the node, the scrape is proxied though the
# Kubernetes apiserver. This means it will work if Prometheus is running out of
# cluster, or can't connect to nodes for some other reason (e.g. because of
# firewalling).
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
# Scrape config for Kubelet cAdvisor. Explore metrics from a node by
# scraping kubelet (127.0.0.1:10255/metrics/cadvisor).
#
# This is required for Kubernetes 1.7.3 and later, where cAdvisor metrics
# (those whose names begin with 'container_') have been removed from the
# Kubelet metrics endpoint. This job scrapes the cAdvisor endpoint to
# retrieve those metrics.
#
# Rather than connecting directly to the node, the scrape is proxied though the
# Kubernetes apiserver. This means it will work if Prometheus is running out of
# cluster, or can't connect to nodes for some other reason (e.g. because of
# firewalling).
- job_name: 'kubernetes-cadvisor'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
# Scrape config for service endpoints.
#
# The relabeling allows the actual service scrape endpoint to be configured
# via the following annotations:
#
# * `prometheus.io/scrape`: Only scrape services that have a value of `true`
# * `prometheus.io/scheme`: If the metrics endpoint is secured then you will need
# to set this to `https` & most likely set the `tls_config` of the scrape config.
# * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
# * `prometheus.io/port`: If the metrics are exposed on a different port to the
# service then set this appropriately.
- job_name: 'kubernetes-service-endpoints'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
# Example scrape config for probing services via the Blackbox Exporter.
#
# The relabeling allows the actual service scrape endpoint to be configured
# via the following annotations:
#
# * `prometheus.io/probe`: Only probe services that have a value of `true`
- job_name: 'kubernetes-services'
metrics_path: /probe
params:
module: [http_2xx]
kubernetes_sd_configs:
- role: service
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
action: keep
regex: true
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: blackbox
- source_labels: [__param_target]
target_label: instance
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
target_label: kubernetes_name
# Example scrape config for pods
#
# The relabeling allows the actual pod scrape endpoint to be configured via the
# following annotations:
#
# * `prometheus.io/scrape`: Only scrape pods that have a value of `true`
# * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
# * `prometheus.io/port`: Scrape the pod on the indicated port instead of the
# pod's declared ports (default is a port-free target if none are declared).
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
# Rule files
rule_files:
- "/etc/prometheus/rules/*.rules"
- "/etc/prometheus/rules/*.yaml"
- "/etc/prometheus/rules/*.yml"

View File

@ -0,0 +1,43 @@
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: prometheus
namespace: monitoring
spec:
replicas: 1
strategy:
rollingUpdate:
maxUnavailable: 1
template:
metadata:
labels:
name: prometheus
phase: prod
spec:
containers:
- name: prometheus
image: quay.io/prometheus/prometheus:v2.0.0
args:
- '--config.file=/etc/prometheus/prometheus.yaml'
ports:
- name: web
containerPort: 9090
volumeMounts:
- name: config
mountPath: /etc/prometheus
- name: rules
mountPath: /etc/prometheus/rules
- name: data
mountPath: /var/lib/prometheus
dnsPolicy: ClusterFirst
restartPolicy: Always
terminationGracePeriodSeconds: 30
volumes:
- name: config
configMap:
name: prometheus-config
- name: rules
configMap:
name: prometheus-rules
- name: data
emptyDir: {}

View File

@ -0,0 +1,18 @@
apiVersion: v1
kind: Service
metadata:
name: kube-controller-manager
namespace: kube-system
annotations:
prometheus.io/scrape: 'true'
spec:
type: ClusterIP
# service is created to allow prometheus to scrape endpoints
clusterIP: None
selector:
k8s-app: kube-controller-manager
ports:
- name: metrics
protocol: TCP
port: 10252
targetPort: 10252

View File

@ -0,0 +1,18 @@
apiVersion: v1
kind: Service
metadata:
name: kube-scheduler
namespace: kube-system
annotations:
prometheus.io/scrape: 'true'
spec:
type: ClusterIP
# service is created to allow prometheus to scrape endpoints
clusterIP: None
selector:
k8s-app: kube-scheduler
ports:
- name: metrics
protocol: TCP
port: 10251
targetPort: 10251

View File

@ -0,0 +1,12 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: kube-state-metrics
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kube-state-metrics
subjects:
- kind: ServiceAccount
name: kube-state-metrics
namespace: monitoring

View File

@ -0,0 +1,31 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kube-state-metrics
rules:
- apiGroups: [""]
resources:
- nodes
- pods
- services
- resourcequotas
- replicationcontrollers
- limitranges
- persistentvolumeclaims
- namespaces
verbs: ["list", "watch"]
- apiGroups: ["extensions"]
resources:
- daemonsets
- deployments
- replicasets
verbs: ["list", "watch"]
- apiGroups: ["apps"]
resources:
- statefulsets
verbs: ["list", "watch"]
- apiGroups: ["batch"]
resources:
- cronjobs
- jobs
verbs: ["list", "watch"]

View File

@ -0,0 +1,61 @@
apiVersion: apps/v1beta2
kind: Deployment
metadata:
name: kube-state-metrics
namespace: monitoring
spec:
replicas: 1
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
selector:
matchLabels:
name: kube-state-metrics
phase: prod
template:
metadata:
labels:
name: kube-state-metrics
phase: prod
spec:
serviceAccountName: kube-state-metrics
containers:
- name: kube-state-metrics
image: quay.io/coreos/kube-state-metrics:v1.1.0
ports:
- name: metrics
containerPort: 8080
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
timeoutSeconds: 5
- name: addon-resizer
image: gcr.io/google_containers/addon-resizer:1.0
resources:
limits:
cpu: 100m
memory: 30Mi
requests:
cpu: 100m
memory: 30Mi
env:
- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: MY_POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
command:
- /pod_nanny
- --container=kube-state-metrics
- --cpu=100m
- --extra-cpu=1m
- --memory=100Mi
- --extra-memory=2Mi
- --threshold=5
- --deployment=kube-state-metrics

View File

@ -0,0 +1,13 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: kube-state-metrics
namespace: monitoring
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: kube-state-metrics-resizer
subjects:
- kind: ServiceAccount
name: kube-state-metrics
namespace: monitoring

View File

@ -0,0 +1,15 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: kube-state-metrics-resizer
namespace: monitoring
rules:
- apiGroups: [""]
resources:
- pods
verbs: ["get"]
- apiGroups: ["extensions"]
resources:
- deployments
resourceNames: ["kube-state-metrics"]
verbs: ["get", "update"]

View File

@ -0,0 +1,5 @@
apiVersion: v1
kind: ServiceAccount
metadata:
name: kube-state-metrics
namespace: monitoring

View File

@ -0,0 +1,19 @@
apiVersion: v1
kind: Service
metadata:
name: kube-state-metrics
namespace: monitoring
annotations:
prometheus.io/scrape: 'true'
spec:
type: ClusterIP
# service is created to allow prometheus to scape endpoints
clusterIP: None
selector:
name: kube-state-metrics
phase: prod
ports:
- name: metrics
protocol: TCP
port: 80
targetPort: 8080

View File

@ -0,0 +1,57 @@
apiVersion: apps/v1beta2
kind: DaemonSet
metadata:
name: node-exporter
namespace: monitoring
spec:
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
selector:
matchLabels:
name: node-exporter
phase: prod
template:
metadata:
labels:
name: node-exporter
phase: prod
spec:
hostNetwork: true
hostPID: true
containers:
- name: node-exporter
image: quay.io/prometheus/node-exporter:v0.15.0
args:
- "--path.procfs=/host/proc"
- "--path.sysfs=/host/sys"
ports:
- name: metrics
containerPort: 9100
hostPort: 9100
resources:
requests:
memory: 30Mi
cpu: 100m
limits:
memory: 50Mi
cpu: 200m
volumeMounts:
- name: proc
mountPath: /host/proc
readOnly: true
- name: sys
mountPath: /host/sys
readOnly: true
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
volumes:
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sys

View File

@ -0,0 +1,19 @@
apiVersion: v1
kind: Service
metadata:
name: node-exporter
namespace: monitoring
annotations:
prometheus.io/scrape: 'true'
spec:
type: ClusterIP
# service is created to allow prometheus to scape endpoints
clusterIP: None
selector:
name: node-exporter
phase: prod
ports:
- name: metrics
protocol: TCP
port: 80
targetPort: 9100

View File

@ -0,0 +1,4 @@
apiVersion: v1
kind: Namespace
metadata:
name: monitoring

View File

@ -0,0 +1,12 @@
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: default
namespace: monitoring

View File

@ -0,0 +1,15 @@
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/proxy
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]

View File

@ -0,0 +1,546 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-rules
namespace: monitoring
data:
# Rules adapted from those provided by coreos/prometheus-operator and SoundCloud
alertmanager.rules.yaml: |+
groups:
- name: alertmanager.rules
rules:
- alert: AlertmanagerConfigInconsistent
expr: count_values("config_hash", alertmanager_config_hash) BY (service) / ON(service)
GROUP_LEFT() label_replace(prometheus_operator_alertmanager_spec_replicas, "service",
"alertmanager-$1", "alertmanager", "(.*)") != 1
for: 5m
labels:
severity: critical
annotations:
description: The configuration of the instances of the Alertmanager cluster
`{{$labels.service}}` are out of sync.
- alert: AlertmanagerDownOrMissing
expr: label_replace(prometheus_operator_alertmanager_spec_replicas, "job", "alertmanager-$1",
"alertmanager", "(.*)") / ON(job) GROUP_RIGHT() sum(up) BY (job) != 1
for: 5m
labels:
severity: warning
annotations:
description: An unexpected number of Alertmanagers are scraped or Alertmanagers
disappeared from discovery.
- alert: AlertmanagerFailedReload
expr: alertmanager_config_last_reload_successful == 0
for: 10m
labels:
severity: warning
annotations:
description: Reloading Alertmanager's configuration has failed for {{ $labels.namespace
}}/{{ $labels.pod}}.
etcd3.rules.yaml: |+
groups:
- name: ./etcd3.rules
rules:
- alert: InsufficientMembers
expr: count(up{job="etcd"} == 0) > (count(up{job="etcd"}) / 2 - 1)
for: 3m
labels:
severity: critical
annotations:
description: If one more etcd member goes down the cluster will be unavailable
summary: etcd cluster insufficient members
- alert: NoLeader
expr: etcd_server_has_leader{job="etcd"} == 0
for: 1m
labels:
severity: critical
annotations:
description: etcd member {{ $labels.instance }} has no leader
summary: etcd member has no leader
- alert: HighNumberOfLeaderChanges
expr: increase(etcd_server_leader_changes_seen_total{job="etcd"}[1h]) > 3
labels:
severity: warning
annotations:
description: etcd instance {{ $labels.instance }} has seen {{ $value }} leader
changes within the last hour
summary: a high number of leader changes within the etcd cluster are happening
- alert: HighNumberOfFailedGRPCRequests
expr: sum(rate(etcd_grpc_requests_failed_total{job="etcd"}[5m])) BY (grpc_method)
/ sum(rate(etcd_grpc_total{job="etcd"}[5m])) BY (grpc_method) > 0.01
for: 10m
labels:
severity: warning
annotations:
description: '{{ $value }}% of requests for {{ $labels.grpc_method }} failed
on etcd instance {{ $labels.instance }}'
summary: a high number of gRPC requests are failing
- alert: HighNumberOfFailedGRPCRequests
expr: sum(rate(etcd_grpc_requests_failed_total{job="etcd"}[5m])) BY (grpc_method)
/ sum(rate(etcd_grpc_total{job="etcd"}[5m])) BY (grpc_method) > 0.05
for: 5m
labels:
severity: critical
annotations:
description: '{{ $value }}% of requests for {{ $labels.grpc_method }} failed
on etcd instance {{ $labels.instance }}'
summary: a high number of gRPC requests are failing
- alert: GRPCRequestsSlow
expr: histogram_quantile(0.99, rate(etcd_grpc_unary_requests_duration_seconds_bucket[5m]))
> 0.15
for: 10m
labels:
severity: critical
annotations:
description: on etcd instance {{ $labels.instance }} gRPC requests to {{ $labels.grpc_method
}} are slow
summary: slow gRPC requests
- alert: HighNumberOfFailedHTTPRequests
expr: sum(rate(etcd_http_failed_total{job="etcd"}[5m])) BY (method) / sum(rate(etcd_http_received_total{job="etcd"}[5m]))
BY (method) > 0.01
for: 10m
labels:
severity: warning
annotations:
description: '{{ $value }}% of requests for {{ $labels.method }} failed on etcd
instance {{ $labels.instance }}'
summary: a high number of HTTP requests are failing
- alert: HighNumberOfFailedHTTPRequests
expr: sum(rate(etcd_http_failed_total{job="etcd"}[5m])) BY (method) / sum(rate(etcd_http_received_total{job="etcd"}[5m]))
BY (method) > 0.05
for: 5m
labels:
severity: critical
annotations:
description: '{{ $value }}% of requests for {{ $labels.method }} failed on etcd
instance {{ $labels.instance }}'
summary: a high number of HTTP requests are failing
- alert: HTTPRequestsSlow
expr: histogram_quantile(0.99, rate(etcd_http_successful_duration_seconds_bucket[5m]))
> 0.15
for: 10m
labels:
severity: warning
annotations:
description: on etcd instance {{ $labels.instance }} HTTP requests to {{ $labels.method
}} are slow
summary: slow HTTP requests
- alert: EtcdMemberCommunicationSlow
expr: histogram_quantile(0.99, rate(etcd_network_member_round_trip_time_seconds_bucket[5m]))
> 0.15
for: 10m
labels:
severity: warning
annotations:
description: etcd instance {{ $labels.instance }} member communication with
{{ $labels.To }} is slow
summary: etcd member communication is slow
- alert: HighNumberOfFailedProposals
expr: increase(etcd_server_proposals_failed_total{job="etcd"}[1h]) > 5
labels:
severity: warning
annotations:
description: etcd instance {{ $labels.instance }} has seen {{ $value }} proposal
failures within the last hour
summary: a high number of proposals within the etcd cluster are failing
- alert: HighFsyncDurations
expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m]))
> 0.5
for: 10m
labels:
severity: warning
annotations:
description: etcd instance {{ $labels.instance }} fync durations are high
summary: high fsync durations
- alert: HighCommitDurations
expr: histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m]))
> 0.25
for: 10m
labels:
severity: warning
annotations:
description: etcd instance {{ $labels.instance }} commit durations are high
summary: high commit durations
general.rules.yaml: |+
groups:
- name: general.rules
rules:
- alert: TargetDown
expr: 100 * (count(up == 0) BY (job) / count(up) BY (job)) > 10
for: 10m
labels:
severity: warning
annotations:
description: '{{ $value }}% of {{ $labels.job }} targets are down.'
summary: Targets are down
- record: fd_utilization
expr: process_open_fds / process_max_fds
- alert: FdExhaustionClose
expr: predict_linear(fd_utilization[1h], 3600 * 4) > 1
for: 10m
labels:
severity: warning
annotations:
description: '{{ $labels.job }}: {{ $labels.namespace }}/{{ $labels.pod }} instance
will exhaust in file/socket descriptors within the next 4 hours'
summary: file descriptors soon exhausted
- alert: FdExhaustionClose
expr: predict_linear(fd_utilization[10m], 3600) > 1
for: 10m
labels:
severity: critical
annotations:
description: '{{ $labels.job }}: {{ $labels.namespace }}/{{ $labels.pod }} instance
will exhaust in file/socket descriptors within the next hour'
summary: file descriptors soon exhausted
kube-controller-manager.rules.yaml: |+
groups:
- name: kube-controller-manager.rules
rules:
- alert: K8SControllerManagerDown
expr: absent(up{kubernetes_name="kube-controller-manager"} == 1)
for: 5m
labels:
severity: critical
annotations:
description: There is no running K8S controller manager. Deployments and replication
controllers are not making progress.
summary: Controller manager is down
kube-scheduler.rules.yaml: |+
groups:
- name: kube-scheduler.rules
rules:
- record: cluster:scheduler_e2e_scheduling_latency_seconds:quantile
expr: histogram_quantile(0.99, sum(scheduler_e2e_scheduling_latency_microseconds_bucket)
BY (le, cluster)) / 1e+06
labels:
quantile: "0.99"
- record: cluster:scheduler_e2e_scheduling_latency_seconds:quantile
expr: histogram_quantile(0.9, sum(scheduler_e2e_scheduling_latency_microseconds_bucket)
BY (le, cluster)) / 1e+06
labels:
quantile: "0.9"
- record: cluster:scheduler_e2e_scheduling_latency_seconds:quantile
expr: histogram_quantile(0.5, sum(scheduler_e2e_scheduling_latency_microseconds_bucket)
BY (le, cluster)) / 1e+06
labels:
quantile: "0.5"
- record: cluster:scheduler_scheduling_algorithm_latency_seconds:quantile
expr: histogram_quantile(0.99, sum(scheduler_scheduling_algorithm_latency_microseconds_bucket)
BY (le, cluster)) / 1e+06
labels:
quantile: "0.99"
- record: cluster:scheduler_scheduling_algorithm_latency_seconds:quantile
expr: histogram_quantile(0.9, sum(scheduler_scheduling_algorithm_latency_microseconds_bucket)
BY (le, cluster)) / 1e+06
labels:
quantile: "0.9"
- record: cluster:scheduler_scheduling_algorithm_latency_seconds:quantile
expr: histogram_quantile(0.5, sum(scheduler_scheduling_algorithm_latency_microseconds_bucket)
BY (le, cluster)) / 1e+06
labels:
quantile: "0.5"
- record: cluster:scheduler_binding_latency_seconds:quantile
expr: histogram_quantile(0.99, sum(scheduler_binding_latency_microseconds_bucket)
BY (le, cluster)) / 1e+06
labels:
quantile: "0.99"
- record: cluster:scheduler_binding_latency_seconds:quantile
expr: histogram_quantile(0.9, sum(scheduler_binding_latency_microseconds_bucket)
BY (le, cluster)) / 1e+06
labels:
quantile: "0.9"
- record: cluster:scheduler_binding_latency_seconds:quantile
expr: histogram_quantile(0.5, sum(scheduler_binding_latency_microseconds_bucket)
BY (le, cluster)) / 1e+06
labels:
quantile: "0.5"
- alert: K8SSchedulerDown
expr: absent(up{kubernetes_name="kube-scheduler"} == 1)
for: 5m
labels:
severity: critical
annotations:
description: There is no running K8S scheduler. New pods are not being assigned
to nodes.
summary: Scheduler is down
kube-state-metrics.rules.yaml: |+
groups:
- name: kube-state-metrics.rules
rules:
- alert: DeploymentGenerationMismatch
expr: kube_deployment_status_observed_generation != kube_deployment_metadata_generation
for: 15m
labels:
severity: warning
annotations:
description: Observed deployment generation does not match expected one for
deployment {{$labels.namespaces}}{{$labels.deployment}}
- alert: DeploymentReplicasNotUpdated
expr: ((kube_deployment_status_replicas_updated != kube_deployment_spec_replicas)
or (kube_deployment_status_replicas_available != kube_deployment_spec_replicas))
unless (kube_deployment_spec_paused == 1)
for: 15m
labels:
severity: warning
annotations:
description: Replicas are not updated and available for deployment {{$labels.namespaces}}/{{$labels.deployment}}
- alert: DaemonSetRolloutStuck
expr: kube_daemonset_status_current_number_ready / kube_daemonset_status_desired_number_scheduled
* 100 < 100
for: 15m
labels:
severity: warning
annotations:
description: Only {{$value}}% of desired pods scheduled and ready for daemon
set {{$labels.namespaces}}/{{$labels.daemonset}}
- alert: K8SDaemonSetsNotScheduled
expr: kube_daemonset_status_desired_number_scheduled - kube_daemonset_status_current_number_scheduled
> 0
for: 10m
labels:
severity: warning
annotations:
description: A number of daemonsets are not scheduled.
summary: Daemonsets are not scheduled correctly
- alert: DaemonSetsMissScheduled
expr: kube_daemonset_status_number_misscheduled > 0
for: 10m
labels:
severity: warning
annotations:
description: A number of daemonsets are running where they are not supposed
to run.
summary: Daemonsets are not scheduled correctly
- alert: PodFrequentlyRestarting
expr: increase(kube_pod_container_status_restarts[1h]) > 5
for: 10m
labels:
severity: warning
annotations:
description: Pod {{$labels.namespaces}}/{{$labels.pod}} is was restarted {{$value}}
times within the last hour
kubelet.rules.yaml: |+
groups:
- name: kubelet.rules
rules:
- alert: K8SNodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 1h
labels:
severity: warning
annotations:
description: The Kubelet on {{ $labels.node }} has not checked in with the API,
or has set itself to NotReady, for more than an hour
summary: Node status is NotReady
- alert: K8SManyNodesNotReady
expr: count(kube_node_status_condition{condition="Ready",status="true"} == 0)
> 1 and (count(kube_node_status_condition{condition="Ready",status="true"} ==
0) / count(kube_node_status_condition{condition="Ready",status="true"})) > 0.2
for: 1m
labels:
severity: critical
annotations:
description: '{{ $value }}% of Kubernetes nodes are not ready'
- alert: K8SKubeletDown
expr: count(up{job="kubernetes-nodes"} == 0) / count(up{job="kubernetes-nodes"}) * 100 > 3
for: 1h
labels:
severity: warning
annotations:
description: Prometheus failed to scrape {{ $value }}% of kubelets.
- alert: K8SKubeletDown
expr: (absent(up{job="kubernetes-nodes"} == 1) or count(up{job="kubernetes-nodes"} == 0) / count(up{job="kubernetes-nodes"}))
* 100 > 1
for: 1h
labels:
severity: critical
annotations:
description: Prometheus failed to scrape {{ $value }}% of kubelets, or all Kubelets
have disappeared from service discovery.
summary: Many Kubelets cannot be scraped
- alert: K8SKubeletTooManyPods
expr: kubelet_running_pod_count > 100
for: 10m
labels:
severity: warning
annotations:
description: Kubelet {{$labels.instance}} is running {{$value}} pods, close
to the limit of 110
summary: Kubelet is close to pod limit
kubernetes.rules.yaml: |+
groups:
- name: kubernetes.rules
rules:
- record: pod_name:container_memory_usage_bytes:sum
expr: sum(container_memory_usage_bytes{container_name!="POD",pod_name!=""}) BY
(pod_name)
- record: pod_name:container_spec_cpu_shares:sum
expr: sum(container_spec_cpu_shares{container_name!="POD",pod_name!=""}) BY (pod_name)
- record: pod_name:container_cpu_usage:sum
expr: sum(rate(container_cpu_usage_seconds_total{container_name!="POD",pod_name!=""}[5m]))
BY (pod_name)
- record: pod_name:container_fs_usage_bytes:sum
expr: sum(container_fs_usage_bytes{container_name!="POD",pod_name!=""}) BY (pod_name)
- record: namespace:container_memory_usage_bytes:sum
expr: sum(container_memory_usage_bytes{container_name!=""}) BY (namespace)
- record: namespace:container_spec_cpu_shares:sum
expr: sum(container_spec_cpu_shares{container_name!=""}) BY (namespace)
- record: namespace:container_cpu_usage:sum
expr: sum(rate(container_cpu_usage_seconds_total{container_name!="POD"}[5m]))
BY (namespace)
- record: cluster:memory_usage:ratio
expr: sum(container_memory_usage_bytes{container_name!="POD",pod_name!=""}) BY
(cluster) / sum(machine_memory_bytes) BY (cluster)
- record: cluster:container_spec_cpu_shares:ratio
expr: sum(container_spec_cpu_shares{container_name!="POD",pod_name!=""}) / 1000
/ sum(machine_cpu_cores)
- record: cluster:container_cpu_usage:ratio
expr: sum(rate(container_cpu_usage_seconds_total{container_name!="POD",pod_name!=""}[5m]))
/ sum(machine_cpu_cores)
- record: apiserver_latency_seconds:quantile
expr: histogram_quantile(0.99, rate(apiserver_request_latencies_bucket[5m])) /
1e+06
labels:
quantile: "0.99"
- record: apiserver_latency:quantile_seconds
expr: histogram_quantile(0.9, rate(apiserver_request_latencies_bucket[5m])) /
1e+06
labels:
quantile: "0.9"
- record: apiserver_latency_seconds:quantile
expr: histogram_quantile(0.5, rate(apiserver_request_latencies_bucket[5m])) /
1e+06
labels:
quantile: "0.5"
- alert: APIServerLatencyHigh
expr: apiserver_latency_seconds:quantile{quantile="0.99",subresource!="log",verb!~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"}
> 1
for: 10m
labels:
severity: warning
annotations:
description: the API server has a 99th percentile latency of {{ $value }} seconds
for {{$labels.verb}} {{$labels.resource}}
- alert: APIServerLatencyHigh
expr: apiserver_latency_seconds:quantile{quantile="0.99",subresource!="log",verb!~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"}
> 4
for: 10m
labels:
severity: critical
annotations:
description: the API server has a 99th percentile latency of {{ $value }} seconds
for {{$labels.verb}} {{$labels.resource}}
- alert: APIServerErrorsHigh
expr: rate(apiserver_request_count{code=~"^(?:5..)$"}[5m]) / rate(apiserver_request_count[5m])
* 100 > 2
for: 10m
labels:
severity: warning
annotations:
description: API server returns errors for {{ $value }}% of requests
- alert: APIServerErrorsHigh
expr: rate(apiserver_request_count{code=~"^(?:5..)$"}[5m]) / rate(apiserver_request_count[5m])
* 100 > 5
for: 10m
labels:
severity: critical
annotations:
description: API server returns errors for {{ $value }}% of requests
- alert: K8SApiserverDown
expr: absent(up{job="kubernetes-apiservers"} == 1)
for: 20m
labels:
severity: critical
annotations:
description: No API servers are reachable or all have disappeared from service
discovery
node.rules.yaml: |+
groups:
- name: node.rules
rules:
- record: instance:node_cpu:rate:sum
expr: sum(rate(node_cpu{mode!="idle",mode!="iowait",mode!~"^(?:guest.*)$"}[3m]))
BY (instance)
- record: instance:node_filesystem_usage:sum
expr: sum((node_filesystem_size{mountpoint="/"} - node_filesystem_free{mountpoint="/"}))
BY (instance)
- record: instance:node_network_receive_bytes:rate:sum
expr: sum(rate(node_network_receive_bytes[3m])) BY (instance)
- record: instance:node_network_transmit_bytes:rate:sum
expr: sum(rate(node_network_transmit_bytes[3m])) BY (instance)
- record: instance:node_cpu:ratio
expr: sum(rate(node_cpu{mode!="idle"}[5m])) WITHOUT (cpu, mode) / ON(instance)
GROUP_LEFT() count(sum(node_cpu) BY (instance, cpu)) BY (instance)
- record: cluster:node_cpu:sum_rate5m
expr: sum(rate(node_cpu{mode!="idle"}[5m]))
- record: cluster:node_cpu:ratio
expr: cluster:node_cpu:rate5m / count(sum(node_cpu) BY (instance, cpu))
- alert: NodeExporterDown
expr: absent(up{kubernetes_name="node-exporter"} == 1)
for: 10m
labels:
severity: warning
annotations:
description: Prometheus could not scrape a node-exporter for more than 10m,
or node-exporters have disappeared from discovery
- alert: NodeDiskRunningFull
expr: predict_linear(node_filesystem_free[6h], 3600 * 24) < 0
for: 30m
labels:
severity: warning
annotations:
description: device {{$labels.device}} on node {{$labels.instance}} is running
full within the next 24 hours (mounted at {{$labels.mountpoint}})
- alert: NodeDiskRunningFull
expr: predict_linear(node_filesystem_free[30m], 3600 * 2) < 0
for: 10m
labels:
severity: critical
annotations:
description: device {{$labels.device}} on node {{$labels.instance}} is running
full within the next 2 hours (mounted at {{$labels.mountpoint}})
prometheus.rules.yaml: |+
groups:
- name: prometheus.rules
rules:
- alert: PrometheusConfigReloadFailed
expr: prometheus_config_last_reload_successful == 0
for: 10m
labels:
severity: warning
annotations:
description: Reloading Prometheus' configuration has failed for {{$labels.namespace}}/{{$labels.pod}}
- alert: PrometheusNotificationQueueRunningFull
expr: predict_linear(prometheus_notifications_queue_length[5m], 60 * 30) > prometheus_notifications_queue_capacity
for: 10m
labels:
severity: warning
annotations:
description: Prometheus' alert notification queue is running full for {{$labels.namespace}}/{{
$labels.pod}}
- alert: PrometheusErrorSendingAlerts
expr: rate(prometheus_notifications_errors_total[5m]) / rate(prometheus_notifications_sent_total[5m])
> 0.01
for: 10m
labels:
severity: warning
annotations:
description: Errors while sending alerts from Prometheus {{$labels.namespace}}/{{
$labels.pod}} to Alertmanager {{$labels.Alertmanager}}
- alert: PrometheusErrorSendingAlerts
expr: rate(prometheus_notifications_errors_total[5m]) / rate(prometheus_notifications_sent_total[5m])
> 0.03
for: 10m
labels:
severity: critical
annotations:
description: Errors while sending alerts from Prometheus {{$labels.namespace}}/{{
$labels.pod}} to Alertmanager {{$labels.Alertmanager}}
- alert: PrometheusNotConnectedToAlertmanagers
expr: prometheus_notifications_alertmanagers_discovered < 1
for: 10m
labels:
severity: warning
annotations:
description: Prometheus {{ $labels.namespace }}/{{ $labels.pod}} is not connected
to any Alertmanagers

View File

@ -0,0 +1,15 @@
apiVersion: v1
kind: Service
metadata:
name: prometheus
namespace: monitoring
spec:
type: ClusterIP
selector:
name: prometheus
phase: prod
ports:
- name: web
protocol: TCP
port: 80
targetPort: 9090

View File

@ -1,4 +1,4 @@
# Typhoon
# Typhoon <img align="right" src="https://storage.googleapis.com/poseidon/typhoon-logo.png">
Typhoon is a minimal and free Kubernetes distribution.
@ -9,9 +9,9 @@ Typhoon is a minimal and free Kubernetes distribution.
Typhoon distributes upstream Kubernetes, architectural conventions, and cluster addons, much like a GNU/Linux distribution provides the Linux kernel and userspace components.
## Features
## Features <a href="https://www.cncf.io/certification/software-conformance/"><img align="right" src="https://storage.googleapis.com/poseidon/certified-kubernetes.png"></a>
* Kubernetes v1.7.7 (upstream, via [kubernetes-incubator/bootkube](https://github.com/kubernetes-incubator/bootkube))
* Kubernetes v1.8.4 (upstream, via [kubernetes-incubator/bootkube](https://github.com/kubernetes-incubator/bootkube))
* Single or multi-master, workloads isolated on workers, [Calico](https://www.projectcalico.org/) or [flannel](https://github.com/coreos/flannel) networking
* On-cluster etcd with TLS, [RBAC](https://kubernetes.io/docs/admin/authorization/rbac/)-enabled, [network policy](https://kubernetes.io/docs/concepts/services-networking/network-policies/)
* Ready for Ingress, Dashboards, Metrics, and other optional [addons](https://typhoon.psdn.io/addons/overview/)

View File

@ -1,14 +1,13 @@
# Self-hosted Kubernetes assets (kubeconfig, manifests)
module "bootkube" {
source = "git::https://github.com/poseidon/terraform-render-bootkube.git?ref=v0.7.0"
source = "git::https://github.com/poseidon/terraform-render-bootkube.git?ref=v0.9.0"
cluster_name = "${var.cluster_name}"
api_servers = ["${format("%s.%s", var.cluster_name, var.dns_zone)}"]
etcd_servers = ["http://127.0.0.1:2379"]
asset_dir = "${var.asset_dir}"
networking = "${var.networking}"
network_mtu = "${var.network_mtu}"
pod_cidr = "${var.pod_cidr}"
service_cidr = "${var.service_cidr}"
experimental_self_hosted_etcd = "true"
cluster_name = "${var.cluster_name}"
api_servers = ["${format("%s.%s", var.cluster_name, var.dns_zone)}"]
etcd_servers = ["${aws_route53_record.etcds.*.fqdn}"]
asset_dir = "${var.asset_dir}"
networking = "${var.networking}"
network_mtu = "${var.network_mtu}"
pod_cidr = "${var.pod_cidr}"
service_cidr = "${var.service_cidr}"
}

View File

@ -1,6 +1,29 @@
---
systemd:
units:
- name: etcd-member.service
enable: true
dropins:
- name: 40-etcd-cluster.conf
contents: |
[Service]
Environment="ETCD_IMAGE_TAG=v3.2.0"
Environment="ETCD_NAME=${etcd_name}"
Environment="ETCD_ADVERTISE_CLIENT_URLS=https://${etcd_domain}:2379"
Environment="ETCD_INITIAL_ADVERTISE_PEER_URLS=https://${etcd_domain}:2380"
Environment="ETCD_LISTEN_CLIENT_URLS=https://0.0.0.0:2379"
Environment="ETCD_LISTEN_PEER_URLS=https://0.0.0.0:2380"
Environment="ETCD_INITIAL_CLUSTER=${etcd_initial_cluster}"
Environment="ETCD_STRICT_RECONFIG_CHECK=true"
Environment="ETCD_SSL_DIR=/etc/ssl/etcd"
Environment="ETCD_TRUSTED_CA_FILE=/etc/ssl/certs/etcd/server-ca.crt"
Environment="ETCD_CERT_FILE=/etc/ssl/certs/etcd/server.crt"
Environment="ETCD_KEY_FILE=/etc/ssl/certs/etcd/server.key"
Environment="ETCD_CLIENT_CERT_AUTH=true"
Environment="ETCD_PEER_TRUSTED_CA_FILE=/etc/ssl/certs/etcd/peer-ca.crt"
Environment="ETCD_PEER_CERT_FILE=/etc/ssl/certs/etcd/peer.crt"
Environment="ETCD_PEER_KEY_FILE=/etc/ssl/certs/etcd/peer.key"
Environment="ETCD_PEER_CLIENT_CERT_AUTH=true"
- name: docker.service
enable: true
- name: locksmithd.service
@ -34,7 +57,8 @@ systemd:
--volume opt-cni-bin,kind=host,source=/opt/cni/bin \
--mount volume=opt-cni-bin,target=/opt/cni/bin \
--volume var-log,kind=host,source=/var/log \
--mount volume=var-log,target=/var/log"
--mount volume=var-log,target=/var/log \
--insecure-options=image"
ExecStartPre=/bin/mkdir -p /opt/cni/bin
ExecStartPre=/bin/mkdir -p /etc/kubernetes/manifests
ExecStartPre=/bin/mkdir -p /etc/kubernetes/cni/net.d
@ -56,8 +80,7 @@ systemd:
--network-plugin=cni \
--node-labels=node-role.kubernetes.io/master \
--pod-manifest-path=/etc/kubernetes/manifests \
--register-with-taints=node-role.kubernetes.io/master=:NoSchedule \
--require-kubeconfig
--register-with-taints=node-role.kubernetes.io/master=:NoSchedule
ExecStop=-/usr/bin/rkt stop --uuid-file=/var/cache/kubelet-pod.uuid
Restart=always
RestartSec=10
@ -104,8 +127,8 @@ storage:
mode: 0644
contents:
inline: |
KUBELET_IMAGE_URL=quay.io/coreos/hyperkube
KUBELET_IMAGE_TAG=v1.7.7_coreos.0
KUBELET_IMAGE_URL=docker://gcr.io/google_containers/hyperkube
KUBELET_IMAGE_TAG=v1.8.4
- path: /etc/sysctl.d/max-user-watches.conf
filesystem: root
contents:
@ -128,7 +151,7 @@ storage:
[ -d /opt/bootkube/assets/experimental/manifests ] && mv /opt/bootkube/assets/experimental/manifests/* /opt/bootkube/assets/manifests && rm -r /opt/bootkube/assets/experimental/manifests
[ -d /opt/bootkube/assets/experimental/bootstrap-manifests ] && mv /opt/bootkube/assets/experimental/bootstrap-manifests/* /opt/bootkube/assets/bootstrap-manifests && rm -r /opt/bootkube/assets/experimental/bootstrap-manifests
BOOTKUBE_ACI="$${BOOTKUBE_ACI:-quay.io/coreos/bootkube}"
BOOTKUBE_VERSION="$${BOOTKUBE_VERSION:-v0.7.0}"
BOOTKUBE_VERSION="$${BOOTKUBE_VERSION:-v0.9.0}"
BOOTKUBE_ASSETS="$${BOOTKUBE_ASSETS:-/opt/bootkube/assets}"
exec /usr/bin/rkt run \
--trust-keys-from-https \

View File

@ -34,7 +34,8 @@ systemd:
--volume opt-cni-bin,kind=host,source=/opt/cni/bin \
--mount volume=opt-cni-bin,target=/opt/cni/bin \
--volume var-log,kind=host,source=/var/log \
--mount volume=var-log,target=/var/log"
--mount volume=var-log,target=/var/log \
--insecure-options=image"
ExecStartPre=/bin/mkdir -p /opt/cni/bin
ExecStartPre=/bin/mkdir -p /etc/kubernetes/manifests
ExecStartPre=/bin/mkdir -p /etc/kubernetes/cni/net.d
@ -55,8 +56,7 @@ systemd:
--lock-file=/var/run/lock/kubelet.lock \
--network-plugin=cni \
--node-labels=node-role.kubernetes.io/node \
--pod-manifest-path=/etc/kubernetes/manifests \
--require-kubeconfig
--pod-manifest-path=/etc/kubernetes/manifests
ExecStop=-/usr/bin/rkt stop --uuid-file=/var/cache/kubelet-pod.uuid
Restart=always
RestartSec=5
@ -102,8 +102,8 @@ storage:
mode: 0644
contents:
inline: |
KUBELET_IMAGE_URL=quay.io/coreos/hyperkube
KUBELET_IMAGE_TAG=v1.7.7_coreos.0
KUBELET_IMAGE_URL=docker://gcr.io/google_containers/hyperkube
KUBELET_IMAGE_TAG=v1.8.4
- path: /etc/sysctl.d/max-user-watches.conf
filesystem: root
contents:
@ -120,7 +120,8 @@ storage:
--trust-keys-from-https \
--volume config,kind=host,source=/etc/kubernetes \
--mount volume=config,target=/etc/kubernetes \
quay.io/coreos/hyperkube:v1.7.7_coreos.0 \
--insecure-options=image \
docker://gcr.io/google_containers/hyperkube:v1.8.4 \
--net=host \
--dns=host \
--exec=/kubectl -- --kubeconfig=/etc/kubernetes/kubeconfig delete node $(hostname)

View File

@ -1,39 +1,30 @@
# Controllers AutoScaling Group
resource "aws_autoscaling_group" "controllers" {
name = "${var.cluster_name}-controller"
load_balancers = ["${aws_elb.controllers.id}"]
# Discrete DNS records for each controller's private IPv4 for etcd usage
resource "aws_route53_record" "etcds" {
count = "${var.controller_count}"
# count
desired_capacity = "${var.controller_count}"
min_size = "${var.controller_count}"
max_size = "${var.controller_count}"
# DNS Zone where record should be created
zone_id = "${var.dns_zone_id}"
# network
vpc_zone_identifier = ["${aws_subnet.public.*.id}"]
name = "${format("%s-etcd%d.%s.", var.cluster_name, count.index, var.dns_zone)}"
type = "A"
ttl = 300
# template
launch_configuration = "${aws_launch_configuration.controller.name}"
lifecycle {
# override the default destroy and replace update behavior
create_before_destroy = true
ignore_changes = ["image_id"]
}
tags = [{
key = "Name"
value = "${var.cluster_name}-controller"
propagate_at_launch = true
}]
# private IPv4 address for etcd
records = ["${element(aws_instance.controllers.*.private_ip, count.index)}"]
}
# Controller template
resource "aws_launch_configuration" "controller" {
name_prefix = "${var.cluster_name}-controller-template-"
image_id = "${data.aws_ami.coreos.image_id}"
# Controller instances
resource "aws_instance" "controllers" {
count = "${var.controller_count}"
tags = {
Name = "${var.cluster_name}-controller-${count.index}"
}
instance_type = "${var.controller_type}"
user_data = "${data.ct_config.controller_ign.rendered}"
ami = "${data.aws_ami.coreos.image_id}"
user_data = "${element(data.ct_config.controller_ign.*.rendered, count.index)}"
# storage
root_block_device {
@ -43,21 +34,25 @@ resource "aws_launch_configuration" "controller" {
# network
associate_public_ip_address = true
security_groups = ["${aws_security_group.controller.id}"]
lifecycle {
// Override the default destroy and replace update behavior
create_before_destroy = true
}
subnet_id = "${element(aws_subnet.public.*.id, count.index)}"
vpc_security_group_ids = ["${aws_security_group.controller.id}"]
}
# Controller Container Linux Config
data "template_file" "controller_config" {
count = "${var.controller_count}"
template = "${file("${path.module}/cl/controller.yaml.tmpl")}"
vars = {
# Cannot use cyclic dependencies on controllers or their DNS records
etcd_name = "etcd${count.index}"
etcd_domain = "${var.cluster_name}-etcd${count.index}.${var.dns_zone}"
# etcd0=https://cluster-etcd0.example.com,etcd1=https://cluster-etcd1.example.com,...
etcd_initial_cluster = "${join(",", formatlist("%s=https://%s:2380", null_resource.repeat.*.triggers.name, null_resource.repeat.*.triggers.domain))}"
k8s_dns_service_ip = "${cidrhost(var.service_cidr, 10)}"
k8s_etcd_service_ip = "${cidrhost(var.service_cidr, 15)}"
ssh_authorized_key = "${var.ssh_authorized_key}"
kubeconfig_ca_cert = "${module.bootkube.ca_cert}"
kubeconfig_kubelet_cert = "${module.bootkube.kubelet_cert}"
@ -66,8 +61,20 @@ data "template_file" "controller_config" {
}
}
# Horrible hack to generate a Terraform list of a desired length without dependencies.
# Ideal ${repeat("etcd", 3) -> ["etcd", "etcd", "etcd"]}
resource null_resource "repeat" {
count = "${var.controller_count}"
triggers {
name = "etcd${count.index}"
domain = "${var.cluster_name}-etcd${count.index}.${var.dns_zone}"
}
}
data "ct_config" "controller_ign" {
content = "${data.template_file.controller_config.rendered}"
count = "${var.controller_count}"
content = "${element(data.template_file.controller_config.*.rendered, count.index)}"
pretty_print = false
}
@ -122,16 +129,6 @@ resource "aws_security_group_rule" "controller-etcd" {
self = true
}
resource "aws_security_group_rule" "controller-bootstrap-etcd" {
security_group_id = "${aws_security_group.controller.id}"
type = "ingress"
protocol = "tcp"
from_port = 12379
to_port = 12380
self = true
}
resource "aws_security_group_rule" "controller-flannel" {
security_group_id = "${aws_security_group.controller.id}"
@ -152,6 +149,26 @@ resource "aws_security_group_rule" "controller-flannel-self" {
self = true
}
resource "aws_security_group_rule" "controller-node-exporter" {
security_group_id = "${aws_security_group.controller.id}"
type = "ingress"
protocol = "tcp"
from_port = 9100
to_port = 9100
source_security_group_id = "${aws_security_group.worker.id}"
}
resource "aws_security_group_rule" "controller-kubelet-self" {
security_group_id = "${aws_security_group.controller.id}"
type = "ingress"
protocol = "tcp"
from_port = 10250
to_port = 10250
self = true
}
resource "aws_security_group_rule" "controller-kubelet-read" {
security_group_id = "${aws_security_group.controller.id}"

View File

@ -1,5 +1,5 @@
# Controller Network Load Balancer DNS Record
resource "aws_route53_record" "controllers" {
# kube-apiserver Network Load Balancer DNS Record
resource "aws_route53_record" "apiserver" {
zone_id = "${var.dns_zone_id}"
name = "${format("%s.%s.", var.cluster_name, var.dns_zone)}"
@ -7,25 +7,18 @@ resource "aws_route53_record" "controllers" {
# AWS recommends their special "alias" records for ELBs
alias {
name = "${aws_elb.controllers.dns_name}"
zone_id = "${aws_elb.controllers.zone_id}"
name = "${aws_elb.apiserver.dns_name}"
zone_id = "${aws_elb.apiserver.zone_id}"
evaluate_target_health = true
}
}
# Controller Network Load Balancer
resource "aws_elb" "controllers" {
name = "${var.cluster_name}-controllers"
resource "aws_elb" "apiserver" {
name = "${var.cluster_name}-apiserver"
subnets = ["${aws_subnet.public.*.id}"]
security_groups = ["${aws_security_group.controller.id}"]
listener {
lb_port = 22
lb_protocol = "tcp"
instance_port = 22
instance_protocol = "tcp"
}
listener {
lb_port = 443
lb_protocol = "tcp"
@ -33,16 +26,18 @@ resource "aws_elb" "controllers" {
instance_protocol = "tcp"
}
instances = ["${aws_instance.controllers.*.id}"]
# Kubelet HTTP health check
health_check {
target = "HTTP:10255/healthz"
target = "SSL:443"
healthy_threshold = 2
unhealthy_threshold = 4
timeout = 5
interval = 6
}
idle_timeout = 1800
idle_timeout = 3600
connection_draining = true
connection_draining_timeout = 300
}

View File

@ -1,5 +1,4 @@
output "ingress_dns_name" {
value = "${aws_elb.ingress.dns_name}"
value = "${aws_elb.ingress.dns_name}"
description = "DNS name of the ELB for distributing traffic to Ingress controllers"
}

View File

@ -0,0 +1,25 @@
# Terraform version and plugin versions
terraform {
required_version = ">= 0.10.4"
}
provider "aws" {
version = "~> 1.0"
}
provider "local" {
version = "~> 1.0"
}
provider "null" {
version = "~> 1.0"
}
provider "template" {
version = "~> 1.0"
}
provider "tls" {
version = "~> 1.0"
}

View File

@ -1,12 +1,79 @@
# Secure copy etcd TLS assets and kubeconfig to controllers. Activates kubelet.service
resource "null_resource" "copy-secrets" {
count = "${var.controller_count}"
connection {
type = "ssh"
host = "${element(aws_instance.controllers.*.public_ip, count.index)}"
user = "core"
timeout = "15m"
}
provisioner "file" {
content = "${module.bootkube.kubeconfig}"
destination = "$HOME/kubeconfig"
}
provisioner "file" {
content = "${module.bootkube.etcd_ca_cert}"
destination = "$HOME/etcd-client-ca.crt"
}
provisioner "file" {
content = "${module.bootkube.etcd_client_cert}"
destination = "$HOME/etcd-client.crt"
}
provisioner "file" {
content = "${module.bootkube.etcd_client_key}"
destination = "$HOME/etcd-client.key"
}
provisioner "file" {
content = "${module.bootkube.etcd_server_cert}"
destination = "$HOME/etcd-server.crt"
}
provisioner "file" {
content = "${module.bootkube.etcd_server_key}"
destination = "$HOME/etcd-server.key"
}
provisioner "file" {
content = "${module.bootkube.etcd_peer_cert}"
destination = "$HOME/etcd-peer.crt"
}
provisioner "file" {
content = "${module.bootkube.etcd_peer_key}"
destination = "$HOME/etcd-peer.key"
}
provisioner "remote-exec" {
inline = [
"sudo mkdir -p /etc/ssl/etcd/etcd",
"sudo mv etcd-client* /etc/ssl/etcd/",
"sudo cp /etc/ssl/etcd/etcd-client-ca.crt /etc/ssl/etcd/etcd/server-ca.crt",
"sudo mv etcd-server.crt /etc/ssl/etcd/etcd/server.crt",
"sudo mv etcd-server.key /etc/ssl/etcd/etcd/server.key",
"sudo cp /etc/ssl/etcd/etcd-client-ca.crt /etc/ssl/etcd/etcd/peer-ca.crt",
"sudo mv etcd-peer.crt /etc/ssl/etcd/etcd/peer.crt",
"sudo mv etcd-peer.key /etc/ssl/etcd/etcd/peer.key",
"sudo chown -R etcd:etcd /etc/ssl/etcd",
"sudo chmod -R 500 /etc/ssl/etcd",
"sudo mv /home/core/kubeconfig /etc/kubernetes/kubeconfig",
]
}
}
# Secure copy bootkube assets to ONE controller and start bootkube to perform
# one-time self-hosted cluster bootstrapping.
resource "null_resource" "bootkube-start" {
depends_on = ["module.bootkube", "aws_autoscaling_group.controllers"]
depends_on = ["module.bootkube", "null_resource.copy-secrets", "aws_route53_record.apiserver"]
# TODO: SSH to a controller's IP instead of waiting on DNS resolution
connection {
type = "ssh"
host = "${aws_route53_record.controllers.fqdn}"
host = "${aws_instance.controllers.0.public_ip}"
user = "core"
timeout = "15m"
}

View File

@ -88,7 +88,7 @@ variable "pod_cidr" {
variable "service_cidr" {
description = <<EOD
CIDR IPv4 range to assign Kubernetes services.
The 1st IP will be reserved for kube_apiserver, the 10th IP will be reserved for kube-dns, the 15th IP will be reserved for self-hosted etcd, and the 200th IP will be reserved for bootstrap self-hosted etcd.
The 1st IP will be reserved for kube_apiserver, the 10th IP will be reserved for kube-dns.
EOD
type = "string"

View File

@ -142,6 +142,16 @@ resource "aws_security_group_rule" "worker-flannel-self" {
self = true
}
resource "aws_security_group_rule" "worker-node-exporter" {
security_group_id = "${aws_security_group.worker.id}"
type = "ingress"
protocol = "tcp"
from_port = 9100
to_port = 9100
self = true
}
resource "aws_security_group_rule" "worker-kubelet" {
security_group_id = "${aws_security_group.worker.id}"

View File

@ -1,4 +1,4 @@
# Typhoon
# Typhoon <img align="right" src="https://storage.googleapis.com/poseidon/typhoon-logo.png">
Typhoon is a minimal and free Kubernetes distribution.
@ -9,9 +9,9 @@ Typhoon is a minimal and free Kubernetes distribution.
Typhoon distributes upstream Kubernetes, architectural conventions, and cluster addons, much like a GNU/Linux distribution provides the Linux kernel and userspace components.
## Features
## Features <a href="https://www.cncf.io/certification/software-conformance/"><img align="right" src="https://storage.googleapis.com/poseidon/certified-kubernetes.png"></a>
* Kubernetes v1.7.7 (upstream, via [kubernetes-incubator/bootkube](https://github.com/kubernetes-incubator/bootkube))
* Kubernetes v1.8.4 (upstream, via [kubernetes-incubator/bootkube](https://github.com/kubernetes-incubator/bootkube))
* Single or multi-master, workloads isolated on workers, [Calico](https://www.projectcalico.org/) or [flannel](https://github.com/coreos/flannel) networking
* On-cluster etcd with TLS, [RBAC](https://kubernetes.io/docs/admin/authorization/rbac/)-enabled, [network policy](https://kubernetes.io/docs/concepts/services-networking/network-policies/)
* Ready for Ingress, Dashboards, Metrics, and other optional [addons](https://typhoon.psdn.io/addons/overview/)

View File

@ -1,13 +1,13 @@
# Self-hosted Kubernetes assets (kubeconfig, manifests)
module "bootkube" {
source = "git::https://github.com/poseidon/terraform-render-bootkube.git?ref=v0.7.0"
source = "git::https://github.com/poseidon/terraform-render-bootkube.git?ref=v0.9.0"
cluster_name = "${var.cluster_name}"
api_servers = ["${var.k8s_domain_name}"]
etcd_servers = ["${var.controller_domains}"]
asset_dir = "${var.asset_dir}"
networking = "${var.networking}"
network_mtu = "${var.network_mtu}"
pod_cidr = "${var.pod_cidr}"
service_cidr = "${var.service_cidr}"
cluster_name = "${var.cluster_name}"
api_servers = ["${var.k8s_domain_name}"]
etcd_servers = ["${var.controller_domains}"]
asset_dir = "${var.asset_dir}"
networking = "${var.networking}"
network_mtu = "${var.network_mtu}"
pod_cidr = "${var.pod_cidr}"
service_cidr = "${var.service_cidr}"
}

View File

@ -32,6 +32,11 @@ storage:
systemctl reboot
passwd:
users:
- name: core
# Avoid using standard name "core" so terraform apply cannot SSH until post-install.
- name: debug
create:
groups:
- sudo
- docker
ssh_authorized_keys:
- {{.ssh_authorized_key}}

View File

@ -65,7 +65,8 @@ systemd:
--volume opt-cni-bin,kind=host,source=/opt/cni/bin \
--mount volume=opt-cni-bin,target=/opt/cni/bin \
--volume var-log,kind=host,source=/var/log \
--mount volume=var-log,target=/var/log"
--mount volume=var-log,target=/var/log \
--insecure-options=image"
ExecStartPre=/bin/mkdir -p /opt/cni/bin
ExecStartPre=/bin/mkdir -p /etc/kubernetes/manifests
ExecStartPre=/bin/mkdir -p /etc/kubernetes/cni/net.d
@ -88,8 +89,7 @@ systemd:
--network-plugin=cni \
--node-labels=node-role.kubernetes.io/master \
--pod-manifest-path=/etc/kubernetes/manifests \
--register-with-taints=node-role.kubernetes.io/master=:NoSchedule \
--require-kubeconfig
--register-with-taints=node-role.kubernetes.io/master=:NoSchedule
ExecStop=-/usr/bin/rkt stop --uuid-file=/var/cache/kubelet-pod.uuid
Restart=always
RestartSec=10
@ -107,30 +107,14 @@ systemd:
ExecStart=/opt/bootkube/bootkube-start
ExecStartPost=/bin/touch /opt/bootkube/init_bootkube.done
storage:
{{ if index . "pxe" }}
disks:
- device: /dev/sda
wipe_table: true
partitions:
- label: ROOT
filesystems:
- name: root
mount:
device: "/dev/sda1"
format: "ext4"
create:
force: true
options:
- "-LROOT"
{{end}}
files:
- path: /etc/kubernetes/kubelet.env
filesystem: root
mode: 0644
contents:
inline: |
KUBELET_IMAGE_URL=quay.io/coreos/hyperkube
KUBELET_IMAGE_TAG=v1.7.7_coreos.0
KUBELET_IMAGE_URL=docker://gcr.io/google_containers/hyperkube
KUBELET_IMAGE_TAG=v1.8.4
- path: /etc/hostname
filesystem: root
mode: 0644
@ -159,7 +143,7 @@ storage:
[ -d /opt/bootkube/assets/experimental/manifests ] && mv /opt/bootkube/assets/experimental/manifests/* /opt/bootkube/assets/manifests && rm -r /opt/bootkube/assets/experimental/manifests
[ -d /opt/bootkube/assets/experimental/bootstrap-manifests ] && mv /opt/bootkube/assets/experimental/bootstrap-manifests/* /opt/bootkube/assets/bootstrap-manifests && rm -r /opt/bootkube/assets/experimental/bootstrap-manifests
BOOTKUBE_ACI="$${BOOTKUBE_ACI:-quay.io/coreos/bootkube}"
BOOTKUBE_VERSION="$${BOOTKUBE_VERSION:-v0.7.0}"
BOOTKUBE_VERSION="$${BOOTKUBE_VERSION:-v0.9.0}"
BOOTKUBE_ASSETS="$${BOOTKUBE_ASSETS:-/opt/bootkube/assets}"
exec /usr/bin/rkt run \
--trust-keys-from-https \
@ -172,6 +156,8 @@ storage:
--net=host \
--dns=host \
--exec=/bootkube -- start --asset-dir=/assets "$@"
networkd:
${networkd_content}
passwd:
users:
- name: core

View File

@ -42,7 +42,8 @@ systemd:
--volume opt-cni-bin,kind=host,source=/opt/cni/bin \
--mount volume=opt-cni-bin,target=/opt/cni/bin \
--volume var-log,kind=host,source=/var/log \
--mount volume=var-log,target=/var/log"
--mount volume=var-log,target=/var/log \
--insecure-options=image"
ExecStartPre=/bin/mkdir -p /opt/cni/bin
ExecStartPre=/bin/mkdir -p /etc/kubernetes/manifests
ExecStartPre=/bin/mkdir -p /etc/kubernetes/cni/net.d
@ -64,8 +65,7 @@ systemd:
--lock-file=/var/run/lock/kubelet.lock \
--network-plugin=cni \
--node-labels=node-role.kubernetes.io/node \
--pod-manifest-path=/etc/kubernetes/manifests \
--require-kubeconfig
--pod-manifest-path=/etc/kubernetes/manifests
ExecStop=-/usr/bin/rkt stop --uuid-file=/var/cache/kubelet-pod.uuid
Restart=always
RestartSec=5
@ -73,30 +73,14 @@ systemd:
WantedBy=multi-user.target
storage:
{{ if index . "pxe" }}
disks:
- device: /dev/sda
wipe_table: true
partitions:
- label: ROOT
filesystems:
- name: root
mount:
device: "/dev/sda1"
format: "ext4"
create:
force: true
options:
- "-LROOT"
{{end}}
files:
- path: /etc/kubernetes/kubelet.env
filesystem: root
mode: 0644
contents:
inline: |
KUBELET_IMAGE_URL=quay.io/coreos/hyperkube
KUBELET_IMAGE_TAG=v1.7.7_coreos.0
KUBELET_IMAGE_URL=docker://gcr.io/google_containers/hyperkube
KUBELET_IMAGE_TAG=v1.8.4
- path: /etc/hostname
filesystem: root
mode: 0644
@ -108,6 +92,8 @@ storage:
contents:
inline: |
fs.inotify.max_user_watches=16184
networkd:
${networkd_content}
passwd:
users:
- name: core

View File

@ -8,10 +8,12 @@ resource "matchbox_profile" "container-linux-install" {
]
args = [
"initrd=coreos_production_pxe_image.cpio.gz",
"coreos.config.url=${var.matchbox_http_endpoint}/ignition?uuid=$${uuid}&mac=$${mac:hexhyp}",
"coreos.first_boot=yes",
"console=tty0",
"console=ttyS0",
"${var.kernel_args}",
]
container_linux_config = "${data.template_file.container-linux-install-config.rendered}"
@ -43,10 +45,12 @@ resource "matchbox_profile" "cached-container-linux-install" {
]
args = [
"initrd=coreos_production_pxe_image.cpio.gz",
"coreos.config.url=${var.matchbox_http_endpoint}/ignition?uuid=$${uuid}&mac=$${mac:hexhyp}",
"coreos.first_boot=yes",
"console=tty0",
"console=ttyS0",
"${var.kernel_args}",
]
container_linux_config = "${data.template_file.cached-container-linux-install-config.rendered}"
@ -85,6 +89,9 @@ data "template_file" "controller-configs" {
etcd_initial_cluster = "${join(",", formatlist("%s=https://%s:2380", var.controller_names, var.controller_domains))}"
k8s_dns_service_ip = "${module.bootkube.kube_dns_service_ip}"
ssh_authorized_key = "${var.ssh_authorized_key}"
# Terraform evaluates both sides regardless and element cannot be used on 0 length lists
networkd_content = "${length(var.controller_networkds) == 0 ? "" : element(concat(var.controller_networkds, list("")), count.index)}"
}
}
@ -104,5 +111,8 @@ data "template_file" "worker-configs" {
domain_name = "${element(var.worker_domains, count.index)}"
k8s_dns_service_ip = "${module.bootkube.kube_dns_service_ip}"
ssh_authorized_key = "${var.ssh_authorized_key}"
# Terraform evaluates both sides regardless and element cannot be used on 0 length lists
networkd_content = "${length(var.worker_networkds) == 0 ? "" : element(concat(var.worker_networkds, list("")), count.index)}"
}
}

View File

@ -0,0 +1,21 @@
# Terraform version and plugin versions
terraform {
required_version = ">= 0.10.4"
}
provider "local" {
version = "~> 1.0"
}
provider "null" {
version = "~> 1.0"
}
provider "template" {
version = "~> 1.0"
}
provider "tls" {
version = "~> 1.0"
}

View File

@ -1,10 +1,10 @@
# Secure copy etcd TLS assets and kubeconfig to all nodes. Activates kubelet.service
resource "null_resource" "copy-secrets" {
count = "${length(var.controller_names) + length(var.worker_names)}"
# Secure copy etcd TLS assets and kubeconfig to controllers. Activates kubelet.service
resource "null_resource" "copy-etcd-secrets" {
count = "${length(var.controller_names)}"
connection {
type = "ssh"
host = "${element(concat(var.controller_domains, var.worker_domains), count.index)}"
host = "${element(var.controller_domains, count.index)}"
user = "core"
timeout = "60m"
}
@ -66,19 +66,42 @@ resource "null_resource" "copy-secrets" {
}
}
# Secure copy kubeconfig to all workers. Activates kubelet.service
resource "null_resource" "copy-kubeconfig" {
count = "${length(var.worker_names)}"
connection {
type = "ssh"
host = "${element(var.worker_domains, count.index)}"
user = "core"
timeout = "60m"
}
provisioner "file" {
content = "${module.bootkube.kubeconfig}"
destination = "$HOME/kubeconfig"
}
provisioner "remote-exec" {
inline = [
"sudo mv /home/core/kubeconfig /etc/kubernetes/kubeconfig",
]
}
}
# Secure copy bootkube assets to ONE controller and start bootkube to perform
# one-time self-hosted cluster bootstrapping.
resource "null_resource" "bootkube-start" {
# Without depends_on, this remote-exec may start before the kubeconfig copy.
# Terraform only does one task at a time, so it would try to bootstrap
# while no Kubelets are running.
depends_on = ["null_resource.copy-secrets"]
depends_on = ["null_resource.copy-etcd-secrets", "null_resource.copy-kubeconfig"]
connection {
type = "ssh"
host = "${element(var.controller_domains, 0)}"
user = "core"
timeout = "60m"
timeout = "30m"
}
provisioner "file" {

View File

@ -83,7 +83,7 @@ variable "pod_cidr" {
variable "service_cidr" {
description = <<EOD
CIDR IP range to assign Kubernetes services.
The 1st IP will be reserved for kube_apiserver, the 10th IP will be reserved for kube-dns, the 15th IP will be reserved for self-hosted etcd, and the 200th IP will be reserved for bootstrap self-hosted etcd.
The 1st IP will be reserved for kube_apiserver, the 10th IP will be reserved for kube-dns.
EOD
type = "string"
@ -109,3 +109,23 @@ variable "container_linux_oem" {
default = ""
description = "Specify an OEM image id to use as base for the installation (e.g. ami, vmware_raw, xen) or leave blank for the default image"
}
variable "kernel_args" {
description = "Additional kernel arguments to provide at PXE boot."
type = "list"
default = []
}
# unofficial, undocumented, unsupported, temporary
variable "controller_networkds" {
type = "list"
description = "Controller Container Linux config networkd section"
default = []
}
variable "worker_networkds" {
type = "list"
description = "Worker Container Linux config networkd section"
default = []
}

View File

@ -42,7 +42,8 @@ systemd:
--volume opt-cni-bin,kind=host,source=/opt/cni/bin \
--mount volume=opt-cni-bin,target=/opt/cni/bin \
--volume var-log,kind=host,source=/var/log \
--mount volume=var-log,target=/var/log"
--mount volume=var-log,target=/var/log \
--insecure-options=image"
ExecStartPre=/bin/mkdir -p /opt/cni/bin
ExecStartPre=/bin/mkdir -p /etc/kubernetes/manifests
ExecStartPre=/bin/mkdir -p /etc/kubernetes/cni/net.d
@ -64,8 +65,7 @@ systemd:
--lock-file=/var/run/lock/kubelet.lock \
--network-plugin=cni \
--node-labels=node-role.kubernetes.io/node \
--pod-manifest-path=/etc/kubernetes/manifests \
--require-kubeconfig
--pod-manifest-path=/etc/kubernetes/manifests
ExecStop=-/usr/bin/rkt stop --uuid-file=/var/cache/kubelet-pod.uuid
Restart=always
RestartSec=5
@ -95,8 +95,8 @@ storage:
mode: 0644
contents:
inline: |
KUBELET_IMAGE_URL=quay.io/coreos/hyperkube
KUBELET_IMAGE_TAG=v1.7.7_coreos.0
KUBELET_IMAGE_URL=docker://gcr.io/google_containers/hyperkube
KUBELET_IMAGE_TAG=v1.8.4
- path: /etc/hostname
filesystem: root
mode: 0644

View File

@ -8,12 +8,12 @@ resource "matchbox_profile" "bootkube-worker-pxe" {
]
args = [
"root=/dev/sda1",
"initrd=coreos_production_pxe_image.cpio.gz",
"coreos.config.url=${var.matchbox_http_endpoint}/ignition?uuid=$${uuid}&mac=$${mac:hexhyp}",
"coreos.first_boot=yes",
"console=tty0",
"console=ttyS0",
"kvm-intel.nested=1",
"${var.kernel_args}",
]
container_linux_config = "${file("${path.module}/cl/bootkube-worker.yaml.tmpl")}"

View File

@ -53,3 +53,14 @@ variable "kube_dns_service_ip" {
type = "string"
default = "10.3.0.10"
}
# optional
variable "kernel_args" {
description = "Additional kernel arguments to provide at PXE boot."
type = "list"
default = [
"root=/dev/sda1",
]
}

View File

@ -1,4 +1,4 @@
# Typhoon
# Typhoon <img align="right" src="https://storage.googleapis.com/poseidon/typhoon-logo.png">
Typhoon is a minimal and free Kubernetes distribution.
@ -9,9 +9,9 @@ Typhoon is a minimal and free Kubernetes distribution.
Typhoon distributes upstream Kubernetes, architectural conventions, and cluster addons, much like a GNU/Linux distribution provides the Linux kernel and userspace components.
## Features
## Features <a href="https://www.cncf.io/certification/software-conformance/"><img align="right" src="https://storage.googleapis.com/poseidon/certified-kubernetes.png"></a>
* Kubernetes v1.7.7 (upstream, via [kubernetes-incubator/bootkube](https://github.com/kubernetes-incubator/bootkube))
* Kubernetes v1.8.4 (upstream, via [kubernetes-incubator/bootkube](https://github.com/kubernetes-incubator/bootkube))
* Single or multi-master, workloads isolated on workers, [Calico](https://www.projectcalico.org/) or [flannel](https://github.com/coreos/flannel) networking
* On-cluster etcd with TLS, [RBAC](https://kubernetes.io/docs/admin/authorization/rbac/)-enabled, [network policy](https://kubernetes.io/docs/concepts/services-networking/network-policies/)
* Ready for Ingress, Dashboards, Metrics, and other optional [addons](https://typhoon.psdn.io/addons/overview/)

View File

@ -1,14 +1,13 @@
# Self-hosted Kubernetes assets (kubeconfig, manifests)
module "bootkube" {
source = "git::https://github.com/poseidon/terraform-render-bootkube.git?ref=v0.7.0"
source = "git::https://github.com/poseidon/terraform-render-bootkube.git?ref=v0.9.0"
cluster_name = "${var.cluster_name}"
api_servers = ["${format("%s.%s", var.cluster_name, var.dns_zone)}"]
etcd_servers = ["http://127.0.0.1:2379"]
asset_dir = "${var.asset_dir}"
networking = "${var.networking}"
network_mtu = 1440
pod_cidr = "${var.pod_cidr}"
service_cidr = "${var.service_cidr}"
experimental_self_hosted_etcd = "true"
cluster_name = "${var.cluster_name}"
api_servers = ["${format("%s.%s", var.cluster_name, var.dns_zone)}"]
etcd_servers = "${digitalocean_record.etcds.*.fqdn}"
asset_dir = "${var.asset_dir}"
networking = "${var.networking}"
network_mtu = 1440
pod_cidr = "${var.pod_cidr}"
service_cidr = "${var.service_cidr}"
}

View File

@ -1,6 +1,29 @@
---
systemd:
units:
- name: etcd-member.service
enable: true
dropins:
- name: 40-etcd-cluster.conf
contents: |
[Service]
Environment="ETCD_IMAGE_TAG=v3.2.0"
Environment="ETCD_NAME=${etcd_name}"
Environment="ETCD_ADVERTISE_CLIENT_URLS=https://${etcd_domain}:2379"
Environment="ETCD_INITIAL_ADVERTISE_PEER_URLS=https://${etcd_domain}:2380"
Environment="ETCD_LISTEN_CLIENT_URLS=https://0.0.0.0:2379"
Environment="ETCD_LISTEN_PEER_URLS=https://0.0.0.0:2380"
Environment="ETCD_INITIAL_CLUSTER=${etcd_initial_cluster}"
Environment="ETCD_STRICT_RECONFIG_CHECK=true"
Environment="ETCD_SSL_DIR=/etc/ssl/etcd"
Environment="ETCD_TRUSTED_CA_FILE=/etc/ssl/certs/etcd/server-ca.crt"
Environment="ETCD_CERT_FILE=/etc/ssl/certs/etcd/server.crt"
Environment="ETCD_KEY_FILE=/etc/ssl/certs/etcd/server.key"
Environment="ETCD_CLIENT_CERT_AUTH=true"
Environment="ETCD_PEER_TRUSTED_CA_FILE=/etc/ssl/certs/etcd/peer-ca.crt"
Environment="ETCD_PEER_CERT_FILE=/etc/ssl/certs/etcd/peer.crt"
Environment="ETCD_PEER_KEY_FILE=/etc/ssl/certs/etcd/peer.key"
Environment="ETCD_PEER_CLIENT_CERT_AUTH=true"
- name: docker.service
enable: true
- name: locksmithd.service
@ -45,7 +68,8 @@ systemd:
--volume opt-cni-bin,kind=host,source=/opt/cni/bin \
--mount volume=opt-cni-bin,target=/opt/cni/bin \
--volume var-log,kind=host,source=/var/log \
--mount volume=var-log,target=/var/log"
--mount volume=var-log,target=/var/log \
--insecure-options=image"
ExecStartPre=/bin/mkdir -p /opt/cni/bin
ExecStartPre=/bin/mkdir -p /etc/kubernetes/manifests
ExecStartPre=/bin/mkdir -p /etc/kubernetes/cni/net.d
@ -68,8 +92,7 @@ systemd:
--network-plugin=cni \
--node-labels=node-role.kubernetes.io/master \
--pod-manifest-path=/etc/kubernetes/manifests \
--register-with-taints=node-role.kubernetes.io/master=:NoSchedule \
--require-kubeconfig
--register-with-taints=node-role.kubernetes.io/master=:NoSchedule
ExecStop=-/usr/bin/rkt stop --uuid-file=/var/cache/kubelet-pod.uuid
Restart=always
RestartSec=10
@ -95,8 +118,8 @@ storage:
mode: 0644
contents:
inline: |
KUBELET_IMAGE_URL=quay.io/coreos/hyperkube
KUBELET_IMAGE_TAG=v1.7.7_coreos.0
KUBELET_IMAGE_URL=docker://gcr.io/google_containers/hyperkube
KUBELET_IMAGE_TAG=v1.8.4
- path: /etc/sysctl.d/max-user-watches.conf
filesystem: root
contents:
@ -119,7 +142,7 @@ storage:
[ -d /opt/bootkube/assets/experimental/manifests ] && mv /opt/bootkube/assets/experimental/manifests/* /opt/bootkube/assets/manifests && rm -r /opt/bootkube/assets/experimental/manifests
[ -d /opt/bootkube/assets/experimental/bootstrap-manifests ] && mv /opt/bootkube/assets/experimental/bootstrap-manifests/* /opt/bootkube/assets/bootstrap-manifests && rm -r /opt/bootkube/assets/experimental/bootstrap-manifests
BOOTKUBE_ACI="$${BOOTKUBE_ACI:-quay.io/coreos/bootkube}"
BOOTKUBE_VERSION="$${BOOTKUBE_VERSION:-v0.7.0}"
BOOTKUBE_VERSION="$${BOOTKUBE_VERSION:-v0.9.0}"
BOOTKUBE_ASSETS="$${BOOTKUBE_ASSETS:-/opt/bootkube/assets}"
exec /usr/bin/rkt run \
--trust-keys-from-https \

View File

@ -45,7 +45,8 @@ systemd:
--volume opt-cni-bin,kind=host,source=/opt/cni/bin \
--mount volume=opt-cni-bin,target=/opt/cni/bin \
--volume var-log,kind=host,source=/var/log \
--mount volume=var-log,target=/var/log"
--mount volume=var-log,target=/var/log \
--insecure-options=image"
ExecStartPre=/bin/mkdir -p /opt/cni/bin
ExecStartPre=/bin/mkdir -p /etc/kubernetes/manifests
ExecStartPre=/bin/mkdir -p /etc/kubernetes/cni/net.d
@ -67,8 +68,7 @@ systemd:
--lock-file=/var/run/lock/kubelet.lock \
--network-plugin=cni \
--node-labels=node-role.kubernetes.io/node \
--pod-manifest-path=/etc/kubernetes/manifests \
--require-kubeconfig
--pod-manifest-path=/etc/kubernetes/manifests
ExecStop=-/usr/bin/rkt stop --uuid-file=/var/cache/kubelet-pod.uuid
Restart=always
RestartSec=5
@ -93,8 +93,8 @@ storage:
mode: 0644
contents:
inline: |
KUBELET_IMAGE_URL=quay.io/coreos/hyperkube
KUBELET_IMAGE_TAG=v1.7.7_coreos.0
KUBELET_IMAGE_URL=docker://gcr.io/google_containers/hyperkube
KUBELET_IMAGE_TAG=v1.8.4
- path: /etc/sysctl.d/max-user-watches.conf
filesystem: root
contents:
@ -111,7 +111,8 @@ storage:
--trust-keys-from-https \
--volume config,kind=host,source=/etc/kubernetes \
--mount volume=config,target=/etc/kubernetes \
quay.io/coreos/hyperkube:v1.7.7_coreos.0 \
--insecure-options=image \
docker://gcr.io/google_containers/hyperkube:v1.8.4 \
--net=host \
--dns=host \
--exec=/kubectl -- --kubeconfig=/etc/kubernetes/kubeconfig delete node $(hostname)

View File

@ -14,6 +14,22 @@ resource "digitalocean_record" "controllers" {
value = "${element(digitalocean_droplet.controllers.*.ipv4_address, count.index)}"
}
# Discrete DNS records for each controller's private IPv4 for etcd usage
resource "digitalocean_record" "etcds" {
count = "${var.controller_count}"
# DNS zone where record should be created
domain = "${var.dns_zone}"
# DNS record (will be prepended to domain)
name = "${var.cluster_name}-etcd${count.index}"
type = "A"
ttl = 300
# private IPv4 address for etcd
value = "${element(digitalocean_droplet.controllers.*.ipv4_address_private, count.index)}"
}
# Controller droplet instances
resource "digitalocean_droplet" "controllers" {
count = "${var.controller_count}"
@ -28,7 +44,7 @@ resource "digitalocean_droplet" "controllers" {
ipv6 = true
private_networking = true
user_data = "${data.ct_config.controller_ign.rendered}"
user_data = "${element(data.ct_config.controller_ign.*.rendered, count.index)}"
ssh_keys = "${var.ssh_fingerprints}"
tags = [
@ -43,15 +59,34 @@ resource "digitalocean_tag" "controllers" {
# Controller Container Linux Config
data "template_file" "controller_config" {
count = "${var.controller_count}"
template = "${file("${path.module}/cl/controller.yaml.tmpl")}"
vars = {
k8s_dns_service_ip = "${cidrhost(var.service_cidr, 10)}"
k8s_etcd_service_ip = "${cidrhost(var.service_cidr, 15)}"
# Cannot use cyclic dependencies on controllers or their DNS records
etcd_name = "etcd${count.index}"
etcd_domain = "${var.cluster_name}-etcd${count.index}.${var.dns_zone}"
# etcd0=https://cluster-etcd0.example.com,etcd1=https://cluster-etcd1.example.com,...
etcd_initial_cluster = "${join(",", formatlist("%s=https://%s:2380", null_resource.repeat.*.triggers.name, null_resource.repeat.*.triggers.domain))}"
k8s_dns_service_ip = "${cidrhost(var.service_cidr, 10)}"
}
}
# Horrible hack to generate a Terraform list of a desired length without dependencies.
# Ideal ${repeat("etcd", 3) -> ["etcd", "etcd", "etcd"]}
resource null_resource "repeat" {
count = "${var.controller_count}"
triggers {
name = "etcd${count.index}"
domain = "${var.cluster_name}-etcd${count.index}.${var.dns_zone}"
}
}
data "ct_config" "controller_ign" {
content = "${data.template_file.controller_config.rendered}"
count = "${var.controller_count}"
content = "${element(data.template_file.controller_config.*.rendered, count.index)}"
pretty_print = false
}

View File

@ -0,0 +1,25 @@
# Terraform version and plugin versions
terraform {
required_version = ">= 0.10.4"
}
provider "digitalocean" {
version = "0.1.2"
}
provider "local" {
version = "~> 1.0"
}
provider "null" {
version = "~> 1.0"
}
provider "template" {
version = "~> 1.0"
}
provider "tls" {
version = "~> 1.0"
}

View File

@ -14,8 +14,53 @@ resource "null_resource" "copy-secrets" {
destination = "$HOME/kubeconfig"
}
provisioner "file" {
content = "${module.bootkube.etcd_ca_cert}"
destination = "$HOME/etcd-client-ca.crt"
}
provisioner "file" {
content = "${module.bootkube.etcd_client_cert}"
destination = "$HOME/etcd-client.crt"
}
provisioner "file" {
content = "${module.bootkube.etcd_client_key}"
destination = "$HOME/etcd-client.key"
}
provisioner "file" {
content = "${module.bootkube.etcd_server_cert}"
destination = "$HOME/etcd-server.crt"
}
provisioner "file" {
content = "${module.bootkube.etcd_server_key}"
destination = "$HOME/etcd-server.key"
}
provisioner "file" {
content = "${module.bootkube.etcd_peer_cert}"
destination = "$HOME/etcd-peer.crt"
}
provisioner "file" {
content = "${module.bootkube.etcd_peer_key}"
destination = "$HOME/etcd-peer.key"
}
provisioner "remote-exec" {
inline = [
"sudo mkdir -p /etc/ssl/etcd/etcd",
"sudo mv etcd-client* /etc/ssl/etcd/",
"sudo cp /etc/ssl/etcd/etcd-client-ca.crt /etc/ssl/etcd/etcd/server-ca.crt",
"sudo mv etcd-server.crt /etc/ssl/etcd/etcd/server.crt",
"sudo mv etcd-server.key /etc/ssl/etcd/etcd/server.key",
"sudo cp /etc/ssl/etcd/etcd-client-ca.crt /etc/ssl/etcd/etcd/peer-ca.crt",
"sudo mv etcd-peer.crt /etc/ssl/etcd/etcd/peer.crt",
"sudo mv etcd-peer.key /etc/ssl/etcd/etcd/peer.key",
"sudo chown -R etcd:etcd /etc/ssl/etcd",
"sudo chmod -R 500 /etc/ssl/etcd",
"sudo mv /home/core/kubeconfig /etc/kubernetes/kubeconfig",
]
}
@ -24,7 +69,7 @@ resource "null_resource" "copy-secrets" {
# Secure copy bootkube assets to ONE controller and start bootkube to perform
# one-time self-hosted cluster bootstrapping.
resource "null_resource" "bootkube-start" {
depends_on = ["module.bootkube", "null_resource.copy-secrets", "digitalocean_droplet.controllers"]
depends_on = ["module.bootkube", "null_resource.copy-secrets"]
connection {
type = "ssh"

View File

@ -70,7 +70,7 @@ variable "pod_cidr" {
variable "service_cidr" {
description = <<EOD
CIDR IP range to assign Kubernetes services.
The 1st IP will be reserved for kube_apiserver, the 10th IP will be reserved for kube-dns, the 15th IP will be reserved for self-hosted etcd, and the 200th IP will be reserved for bootstrap self-hosted etcd.
The 1st IP will be reserved for kube_apiserver, the 10th IP will be reserved for kube-dns.
EOD
type = "string"

View File

@ -9,7 +9,7 @@ On AWS, an elastic load balancer distributes traffic across worker nodes (i.e. a
Create the Ingress controller deployment, service, RBAC roles, RBAC bindings, default backend, and namespace.
```
kubectl apply -R addons/nginx-ingress/aws
kubectl apply -R -f addons/nginx-ingress/aws
```
For each application, add a DNS CNAME resolving to the ELB's DNS record.
@ -42,7 +42,7 @@ On Digital Ocean, a DNS A record (e.g. `nemo-workers.example.com`) resolves to e
Create the Ingress controller daemonset, service, RBAC roles, RBAC bindings, default backend, and namespace.
```
kubectl apply -R addons/nginx-ingress/digital-ocean
kubectl apply -R -f addons/nginx-ingress/digital-ocean
```
For each application, add a CNAME record resolving to the worker(s) DNS record. Use the Typhoon module's output `workers_dns` to find the worker DNS value. For example, you might use Terraform to manage a Google Cloud DNS record:
@ -69,7 +69,7 @@ On Google Cloud, a network load balancer distributes traffic across worker nodes
Create the Ingress controller deployment, service, RBAC roles, RBAC bindings, default backend, and namespace.
```
kubectl apply -R addons/nginx-ingress/google-cloud
kubectl apply -R -f addons/nginx-ingress/google-cloud
```
For each application, add a DNS record resolving to the network load balancer's IPv4 address.

View File

@ -2,10 +2,10 @@
Every Typhoon cluster is verified to work well with several post-install addons.
* [CLUO](cluo.md) (Container Linux only)
* Nginx [Ingress Controller](ingress.md)
* [Heapster](heapster.md)
* [Prometheus](prometheus.md)
* [Grafana](prometheus.md#grafana)
* Kubernetes [Dashboard](dashboard.md)
* [CLUO](cluo.md) (Container Linux only)
* Prometheus
* Grafana

67
docs/addons/prometheus.md Normal file
View File

@ -0,0 +1,67 @@
# Prometheus
Prometheus collects metrics (e.g. `node_memory_usage_bytes`) from *targets* by scraping their HTTP metrics endpoints. Targets are organized into *jobs*, defined in the Prometheus config. Targets may expose counter, gauge, histogram, or summary metrics.
Here's a simple config from the Prometheus [tutorial](https://prometheus.io/docs/introduction/getting_started/).
```
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'prometheus'
scrape_interval: 5s
static_configs:
- targets: ['localhost:9090']
```
On Kubernetes clusters, Prometheus is run as a Deployment, configured with a ConfigMap, and accessed via a Service or Ingress.
```
kubectl apply -f addons/prometheus -R
```
The ConfigMap configures Prometheus to target apiserver endpoints, node metrics, cAdvisor metrics, and exporters. By default, data is kept in an `emptyDir` so it is persisted until the pod is rescheduled.
### Exporters
Exporters expose metrics for 3rd-party systems that don't natively expose Prometheus metrics.
* [node_exporter](https://github.com/prometheus/node_exporter) - DaemonSet that exposes a machine's hardware and OS metrics
* [kube-state-metrics](https://github.com/kubernetes/kube-state-metrics) - Deployment that exposes Kubernetes object metrics
* [blackbox_exporter](https://github.com/prometheus/blackbox_exporter) - Scrapes HTTP, HTTPS, DNS, TCP, or ICMP endpoints and exposes availability as metrics
### Queries and Alerts
Prometheus provides a simplistic UI for querying metrics and viewing alerts. Use `kubectl` to authenticate to the apiserver and create a local port-forward to the Prometheus pod.
```
kubectl get pods -n monitoring
kubectl port-forward prometheus-POD-ID 9090 -n monitoring
```
Visit [127.0.0.1:9090](http://127.0.0.1:9090) to query [expressions](http://127.0.0.1:9090/graph), view [targets](http://127.0.0.1:9090/targets), or check [alerts](http://127.0.0.1:9090/alerts).
![Prometheus Graph](/img/prometheus-graph.png)
<br/>
![Prometheus Targets](/img/prometheus-targets.png)
<br/>
![Prometheus Alerts](/img/prometheus-alerts.png)
## Grafana
Grafana can be used to build dashboards and rich visualizations that use Prometheus as the datasource. Create the grafana deployment and service.
```
kubectl apply -f addons/grafana -R
```
Use `kubectl` to authenticate to the apiserver and create a local port-forward to the Grafana pod.
```
kubectl port-forward grafana-POD-ID 8080 -n monitoring
```
Visit [127.0.0.1:8080](http://127.0.0.1:8080), add the prometheus data-source (http://prometheus.monitoring.svc.cluster.local), and import your desired dashboard (e.g. 315).
![Grafana Dashboard](/img/grafana-dashboard.png)

View File

@ -1,30 +1,24 @@
# AWS
In this tutorial, we'll create a Kubernetes v1.7.7 cluster on AWS.
In this tutorial, we'll create a Kubernetes v1.8.4 cluster on AWS.
We'll declare a Kubernetes cluster in Terraform using the Typhoon Terraform module. On apply, a VPC, gateway, subnets, auto-scaling groups of controllers and workers, network load balancers for controllers and workers, and security groups will be created.
Controllers and workers are provisioned to run a `kubelet`. A one-time [bootkube](https://github.com/kubernetes-incubator/bootkube) bootstrap schedules `etcd`, `apiserver`, `scheduler`, `controller-manager`, and `kube-dns` on controllers and runs `kube-proxy` and `flannel` or `calico` on each node. A generated `kubeconfig` provides `kubectl` access to the cluster.
!!! warning "Alpha"
Typhoon Kubernetes clusters on AWS are marked as "alpha".
!!! warning "Disabled"
Clusters do not use EC2 instances with elevated IAM roles. Kubernetes AWS integrations are not enabled.
Controllers and workers are provisioned to run a `kubelet`. A one-time [bootkube](https://github.com/kubernetes-incubator/bootkube) bootstrap schedules an `apiserver`, `scheduler`, `controller-manager`, and `kube-dns` on controllers and runs `kube-proxy` and `calico` or `flannel` on each node. A generated `kubeconfig` provides `kubectl` access to the cluster.
## Requirements
* AWS Account and IAM credentials
* AWS Route53 DNS Zone (registered Domain Name or delegated subdomain)
* Terraform v0.10.1+ and [terraform-provider-ct](https://github.com/coreos/terraform-provider-ct) installed locally
* Terraform v0.10.x and [terraform-provider-ct](https://github.com/coreos/terraform-provider-ct) installed locally
## Terraform Setup
Install [Terraform](https://www.terraform.io/downloads.html) v0.10.1 on your system.
Install [Terraform](https://www.terraform.io/downloads.html) v0.10.x on your system.
```sh
$ terraform version
Terraform v0.10.1
Terraform v0.10.7
```
Add the [terraform-provider-ct](https://github.com/coreos/terraform-provider-ct) plugin binary for your system.
@ -87,7 +81,7 @@ module "aws-tempest" {
dns_zone = "aws.example.com"
dns_zone_id = "Z3PAABBCFAKEC0"
controller_count = 1
controller_type = "t2.small"
controller_type = "t2.medium"
worker_count = 2
worker_type = "t2.small"
ssh_authorized_key = "ssh-rsa AAAAB3Nz..."
@ -125,7 +119,7 @@ Get or update Terraform modules.
$ terraform get # downloads missing modules
$ terraform get --update # updates all modules
Get: git::https://github.com/poseidon/typhoon (update)
Get: git::https://github.com/poseidon/bootkube-terraform.git?ref=v0.7.0 (update)
Get: git::https://github.com/poseidon/bootkube-terraform.git?ref=v0.9.0 (update)
```
Plan the resources to be created.
@ -140,17 +134,14 @@ Apply the changes to create the cluster.
```sh
$ terraform apply
...
module.aws-tempest.null_resource.bootkube-start: Still creating... (10m50s elapsed)
module.aws-tempest.null_resource.bootkube-start: Still creating... (11m0s elapsed)
module.aws-tempest.null_resource.bootkube-start: Still creating... (4m50s elapsed)
module.aws-tempest.null_resource.bootkube-start: Still creating... (5m0s elapsed)
module.aws-tempest.null_resource.bootkube-start: Creation complete after 11m8s (ID: 3961816482286168143)
Apply complete! Resources: 98 added, 0 changed, 0 destroyed.
```
In 10-20 minutes, the Kubernetes cluster will be ready.
!!! bug ""
Typhoon clusters on AWS take much longer to create than clusters on other platforms. This is related to DNS resolution time to the ASG, which will be resolved in a future version that uses static controllers.
In 4-8 minutes, the Kubernetes cluster will be ready.
## Verify
@ -160,9 +151,9 @@ In 10-20 minutes, the Kubernetes cluster will be ready.
$ KUBECONFIG=/home/user/.secrets/clusters/tempest/auth/kubeconfig
$ kubectl get nodes
NAME STATUS AGE VERSION
ip-10-0-12-221 Ready 34m v1.7.7+coreos.0
ip-10-0-19-112 Ready 34m v1.7.7+coreos.0
ip-10-0-4-22 Ready 34m v1.7.7+coreos.0
ip-10-0-12-221 Ready 34m v1.8.4
ip-10-0-19-112 Ready 34m v1.8.4
ip-10-0-4-22 Ready 34m v1.8.4
```
List the pods.
@ -173,13 +164,10 @@ NAMESPACE NAME READY STATUS RESTART
kube-system calico-node-1m5bf 2/2 Running 0 34m
kube-system calico-node-7jmr1 2/2 Running 0 34m
kube-system calico-node-bknc8 2/2 Running 0 34m
kube-system etcd-operator-2287495111-br512 1/1 Running 1 34m
kube-system kube-apiserver-4mjbk 1/1 Running 0 34m
kube-system kube-controller-manager-3597210155-j2jbt 1/1 Running 1 34m
kube-system kube-controller-manager-3597210155-j7g7x 1/1 Running 0 34m
kube-system kube-dns-1187388186-wx1lg 3/3 Running 0 34m
kube-system kube-etcd-0000 1/1 Running 0 32m
kube-system kube-etcd-network-checkpointer-dt5pt 1/1 Running 0 34m
kube-system kube-proxy-14wxv 1/1 Running 0 34m
kube-system kube-proxy-9vxh2 1/1 Running 0 34m
kube-system kube-proxy-sbbsh 1/1 Running 0 34m

View File

@ -1,10 +1,10 @@
# Bare-Metal
In this tutorial, we'll network boot and provison a Kubernetes v1.7.7 cluster on bare-metal.
In this tutorial, we'll network boot and provison a Kubernetes v1.8.4 cluster on bare-metal.
First, we'll deploy a [Matchbox](https://github.com/coreos/matchbox) service and setup a network boot environment. Then, we'll declare a Kubernetes cluster in Terraform using the Typhoon Terraform module and power on machines. On PXE boot, machines will install Container Linux to disk, reboot into the disk install, and provision themselves as Kubernetes controllers or workers.
Controllers are provisioned as etcd peers and run `etcd-member` (etcd3) and `kubelet`. Workers are provisioned to run a `kubelet`. A one-time [bootkube](https://github.com/kubernetes-incubator/bootkube) bootstrap schedules an `apiserver`, `scheduler`, `controller-manager`, and `kube-dns` on controllers and runs `kube-proxy` and `flannel` or `calico` on each node. A generated `kubeconfig` provides `kubectl` access to the cluster.
Controllers are provisioned as etcd peers and run `etcd-member` (etcd3) and `kubelet`. Workers are provisioned to run a `kubelet`. A one-time [bootkube](https://github.com/kubernetes-incubator/bootkube) bootstrap schedules an `apiserver`, `scheduler`, `controller-manager`, and `kube-dns` on controllers and runs `kube-proxy` and `calico` or `flannel` on each node. A generated `kubeconfig` provides `kubectl` access to the cluster.
## Requirements
@ -12,7 +12,7 @@ Controllers are provisioned as etcd peers and run `etcd-member` (etcd3) and `kub
* PXE-enabled [network boot](https://coreos.com/matchbox/docs/latest/network-setup.html) environment
* Matchbox v0.6+ deployment with API enabled
* Matchbox credentials `client.crt`, `client.key`, `ca.crt`
* Terraform v0.9.2+ and [terraform-provider-matchbox](https://github.com/coreos/terraform-provider-matchbox) installed locally
* Terraform v0.10.x and [terraform-provider-matchbox](https://github.com/coreos/terraform-provider-matchbox) installed locally
## Machines
@ -94,7 +94,7 @@ For networks already supporting iPXE clients, you can add a `default.ipxe` confi
chain http://matchbox.foo:8080/boot.ipxe
```
For networks with Ubiquiti Routers, you can [configure the router](TODO) itself to chainload machines to iPXE and Matchbox.
For networks with Ubiquiti Routers, you can [configure the router](/topics/hardware.md#ubiquiti) itself to chainload machines to iPXE and Matchbox.
For a small lab, you may wish to checkout the [quay.io/coreos/dnsmasq](https://quay.io/repository/coreos/dnsmasq) container image and [copy-paste examples](https://github.com/coreos/matchbox/blob/master/Documentation/network-setup.md#coreosdnsmasq).
@ -109,11 +109,11 @@ Read about the [many ways](https://coreos.com/matchbox/docs/latest/network-setup
## Terraform Setup
Install [Terraform](https://www.terraform.io/downloads.html) v0.9.2+ on your system.
Install [Terraform](https://www.terraform.io/downloads.html) v0.10.x on your system.
```sh
$ terraform version
Terraform v0.10.1
Terraform v0.10.7
```
Add the [terraform-provider-matchbox](https://github.com/coreos/terraform-provider-matchbox) plugin binary for your system.
@ -124,19 +124,10 @@ tar xzf terraform-provider-matchbox-v0.2.2-linux-amd64.tar.gz
sudo mv terraform-provider-matchbox-v0.2.2-linux-amd64/terraform-provider-matchbox /usr/local/bin/
```
Add the [terraform-provider-ct](https://github.com/coreos/terraform-provider-ct) plugin binary for your system.
```sh
wget https://github.com/coreos/terraform-provider-ct/releases/download/v0.2.0/terraform-provider-ct-v0.2.0-linux-amd64.tar.gz
tar xzf terraform-provider-ct-v0.2.0-linux-amd64.tar.gz
sudo mv terraform-provider-ct-v0.2.0-linux-amd64/terraform-provider-ct /usr/local/bin/
```
Add the plugins to your `~/.terraformrc`.
Add the plugin to your `~/.terraformrc`.
```
providers {
ct = "/usr/local/bin/terraform-provider-ct"
matchbox = "/usr/local/bin/terraform-provider-matchbox"
}
```
@ -171,7 +162,7 @@ module "bare-metal-mercury" {
# install
matchbox_http_endpoint = "http://matchbox.example.com"
container_linux_channel = "stable"
container_linux_version = "1465.6.0"
container_linux_version = "1520.6.0"
ssh_authorized_key = "ssh-rsa AAAAB3Nz..."
# cluster
@ -228,7 +219,7 @@ Get or update Terraform modules.
$ terraform get # downloads missing modules
$ terraform get --update # updates all modules
Get: git::https://github.com/poseidon/typhoon (update)
Get: git::https://github.com/poseidon/bootkube-terraform.git?ref=v0.7.0 (update)
Get: git::https://github.com/poseidon/bootkube-terraform.git?ref=v0.9.0 (update)
```
Plan the resources to be created.
@ -241,20 +232,15 @@ Plan: 55 to add, 0 to change, 0 to destroy.
Apply the changes. Terraform will generate bootkube assets to `asset_dir` and create Matchbox profiles (e.g. controller, worker) and matching rules via the Matchbox API.
```sh
module.bare-metal-mercury.null_resource.copy-secrets.0: Provisioning with 'file'...
module.bare-metal-mercury.null_resource.copy-secrets.2: Provisioning with 'file'...
module.bare-metal-mercury.null_resource.copy-secrets.1: Provisioning with 'file'...
module.bare-metal-mercury.null_resource.copy-secrets.0: Still creating... (10s elapsed)
module.bare-metal-mercury.null_resource.copy-secrets.2: Still creating... (10s elapsed)
module.bare-metal-mercury.null_resource.copy-secrets.1: Still creating... (10s elapsed)
module.bare-metal-mercury.null_resource.copy-kubeconfig.0: Provisioning with 'file'...
module.bare-metal-mercury.null_resource.copy-etcd-secrets.0: Provisioning with 'file'...
module.bare-metal-mercury.null_resource.copy-kubeconfig.0: Still creating... (10s elapsed)
module.bare-metal-mercury.null_resource.copy-etcd-secrets.0: Still creating... (10s elapsed)
...
```
Apply will then loop until it can successfully copy credentials to each machine and start the one-time Kubernetes bootstrap service. Proceed to the next step while this loops.
!!! note ""
You may see `terraform apply` fail to `copy-secrets` if it connects before the disk install has completed. Run terraform apply until it reconciles successfully.
### Power
Power on each machine with the boot device set to `pxe` for the next boot only.
@ -304,9 +290,9 @@ bootkube[5]: Tearing down temporary bootstrap control plane...
$ KUBECONFIG=/home/user/.secrets/clusters/mercury/auth/kubeconfig
$ kubectl get nodes
NAME STATUS AGE VERSION
node1.example.com Ready 11m v1.7.7+coreos.0
node2.example.com Ready 11m v1.7.7+coreos.0
node3.example.com Ready 11m v1.7.7+coreos.0
node1.example.com Ready 11m v1.8.4
node2.example.com Ready 11m v1.8.4
node3.example.com Ready 11m v1.8.4
```
List the pods.
@ -369,4 +355,5 @@ Learn about [version pinning](concepts.md#versioning), maintenance, and [addons]
| network_mtu | CNI interface MTU (calico-only) | 1480 | - |
| pod_cidr | CIDR range to assign to Kubernetes pods | "10.2.0.0/16" | "10.22.0.0/16" |
| service_cidr | CIDR range to assgin to Kubernetes services | "10.3.0.0/16" | "10.3.0.0/24" |
| kernel_args | Additional kernel args to provide at PXE boot | [] | "kvm-intel.nested=1" |

View File

@ -60,7 +60,7 @@ Modules are updated regularly, set the version to a [release tag](https://github
```tf
...
source = "git:https://github.com/poseidon/typhoon//google-cloud/container-linux/kubernetes?ref=v1.7.7"
source = "git:https://github.com/poseidon/typhoon//google-cloud/container-linux/kubernetes?ref=hash"
```
Module versioning ensures `terraform get --update` only fetches the desired version, so plan and apply don't change cluster resources, unless the version is altered.
@ -69,14 +69,15 @@ Module versioning ensures `terraform get --update` only fetches the desired vers
Maintain Terraform configs for "live" infrastructure in a versioned repository. Seek to organize configs to reflect resources that should be managed together in a `terraform apply` invocation.
You may choose to organize resources all together, by team, by project, or some other scheme. Here's an example that manages three clusters together:
You may choose to organize resources all together, by team, by project, or some other scheme. Here's an example that manages four clusters together:
```sh
.git/
infra/
└── terraform
└── clusters
├── bare-metal-tungsten.tf
├── aws-tempest.tf
├── bare-metal-mercury.tf
├── google-cloud-yavin.tf
├── digital-ocean-nemo.tf
├── providers.tf

View File

@ -1,24 +1,24 @@
# Digital Ocean
In this tutorial, we'll create a Kubernetes v1.7.7 cluster on Digital Ocean.
In this tutorial, we'll create a Kubernetes v1.8.4 cluster on Digital Ocean.
We'll declare a Kubernetes cluster in Terraform using the Typhoon Terraform module. On apply, firewall rules, DNS records, tags, and droplets for Kubernetes controllers and workers will be created.
Controllers and workers are provisioned to run a `kubelet`. A one-time [bootkube](https://github.com/kubernetes-incubator/bootkube) bootstrap schedules `etcd`, `apiserver`, `scheduler`, `controller-manager`, and `kube-dns` on controllers and runs `kube-proxy` and `flannel` on each node. A generated `kubeconfig` provides `kubectl` access to the cluster.
Controllers and workers are provisioned to run a `kubelet`. A one-time [bootkube](https://github.com/kubernetes-incubator/bootkube) bootstrap schedules an `apiserver`, `scheduler`, `controller-manager`, and `kube-dns` on controllers and runs `kube-proxy` and `flannel` on each node. A generated `kubeconfig` provides `kubectl` access to the cluster.
## Requirements
* Digital Ocean Account and Token
* Digital Ocean Domain (registered Domain Name or delegated subdomain)
* Terraform v0.10.1+ and [terraform-provider-ct](https://github.com/coreos/terraform-provider-ct) installed locally
* Terraform v0.10.x and [terraform-provider-ct](https://github.com/coreos/terraform-provider-ct) installed locally
## Terraform Setup
Install [Terraform](https://www.terraform.io/downloads.html) v0.10.1+ on your system.
Install [Terraform](https://www.terraform.io/downloads.html) v0.10.x on your system.
```sh
$ terraform version
Terraform v0.10.1
Terraform v0.10.7
```
Add the [terraform-provider-ct](https://github.com/coreos/terraform-provider-ct) plugin binary for your system.
@ -114,7 +114,7 @@ Get or update Terraform modules.
$ terraform get # downloads missing modules
$ terraform get --update # updates all modules
Get: git::https://github.com/poseidon/typhoon (update)
Get: git::https://github.com/poseidon/bootkube-terraform.git?ref=v0.7.0 (update)
Get: git::https://github.com/poseidon/bootkube-terraform.git?ref=v0.9.0 (update)
```
Plan the resources to be created.
@ -137,7 +137,7 @@ module.digital-ocean-nemo.null_resource.bootkube-start: Creation complete (ID: 7
Apply complete! Resources: 54 added, 0 changed, 0 destroyed.
```
In 5-10 minutes, the Kubernetes cluster will be ready.
In 3-6 minutes, the Kubernetes cluster will be ready.
## Verify
@ -147,22 +147,19 @@ In 5-10 minutes, the Kubernetes cluster will be ready.
$ KUBECONFIG=/home/user/.secrets/clusters/nemo/auth/kubeconfig
$ kubectl get nodes
NAME STATUS AGE VERSION
10.132.110.130 Ready 10m v1.7.7+coreos.0
10.132.115.81 Ready 10m v1.7.7+coreos.0
10.132.124.107 Ready 10m v1.7.7+coreos.0
10.132.110.130 Ready 10m v1.8.4
10.132.115.81 Ready 10m v1.8.4
10.132.124.107 Ready 10m v1.8.4
```
List the pods.
```
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system etcd-operator-3329263108-sgsbl 1/1 Running 1 11m
kube-system kube-apiserver-n10qr 1/1 Running 0 11m
kube-system kube-controller-manager-3271970485-37gtw 1/1 Running 1 11m
kube-system kube-controller-manager-3271970485-p52t5 1/1 Running 0 11m
kube-system kube-dns-1187388186-ld1j7 3/3 Running 0 11m
kube-system kube-etcd-0000 1/1 Running 0 9m
kube-system kube-etcd-network-checkpointer-n9xsk 1/1 Running 0 11m
kube-system kube-flannel-1cq1v 2/2 Running 0 11m
kube-system kube-flannel-hq9t0 2/2 Running 1 11m
kube-system kube-flannel-v0g9w 2/2 Running 0 11m

View File

@ -6,15 +6,6 @@ Typhoon provides a Terraform Module for each supported operating system and plat
Formats rise and evolve. Typhoon may choose to adapt the format over time (with lots of forewarning). However, the authors' have built several Kubernetes "distros" before and learned from mistakes - Terraform modules are the right format for now.
## Self-hosted etcd
Typhoon clusters on cloud providers run etcd as "self-hosted" pods, managed by the [etcd-operator](https://github.com/coreos/etcd-operator). By contrast, Typhoon bare-metal runs an etcd peer as a systemd `etcd-member.service` on each controller (i.e. on-host).
In practice, self-hosted etcd has proven to be *ok*, but not ideal. Running the apiserver's etcd atop Kubernetes itself is inherently complex, but works suitably in most cases. It can be opaque to debug if complex edge cases with upstream Kubernetes bugs arise.
!!! note ""
Typhoon clusters and their defaults power the maintainers' clusters. The edge cases are sufficiently rare that self-hosted etcd is not a pressing issue, but cloud clusters may switch back to on-host etcd in the future.
## Operating Systems
Only Container Linux is supported currently. This just due to operational familiarity, rather than intentional exclusion. It's important that another operating system be added, to reduce the risk of making narrowly-scoped design decisions.
@ -27,7 +18,7 @@ Ask questions on the IRC #typhoon channel on [freenode.net](http://freenode.net/
## Security Issues
If you find security issues, please see [security disclosures](/topics/security).
If you find security issues, please see [security disclosures](/topics/security.md#disclosures).
## Maintainers

View File

@ -1,24 +1,24 @@
# Google Cloud
In this tutorial, we'll create a Kubernetes v1.7.7 cluster on Google Compute Engine (not GKE).
In this tutorial, we'll create a Kubernetes v1.8.4 cluster on Google Compute Engine (not GKE).
We'll declare a Kubernetes cluster in Terraform using the Typhoon Terraform module. On apply, a network, firewall rules, managed instance groups of Kubernetes controllers and workers, network load balancers for controllers and workers, and health checks will be created.
Controllers and workers are provisioned to run a `kubelet`. A one-time [bootkube](https://github.com/kubernetes-incubator/bootkube) bootstrap schedules `etcd`, `apiserver`, `scheduler`, `controller-manager`, and `kube-dns` on controllers and runs `kube-proxy` and `flannel` on each node. A generated `kubeconfig` provides `kubectl` access to the cluster.
Controllers and workers are provisioned to run a `kubelet`. A one-time [bootkube](https://github.com/kubernetes-incubator/bootkube) bootstrap schedules an `apiserver`, `scheduler`, `controller-manager`, and `kube-dns` on controllers and runs `kube-proxy` and `calico` or `flannel` on each node. A generated `kubeconfig` provides `kubectl` access to the cluster.
## Requirements
* Google Cloud Account and Service Account
* Google Cloud DNS Zone (registered Domain Name or delegated subdomain)
* Terraform v0.9.2+ and [terraform-provider-ct](https://github.com/coreos/terraform-provider-ct) installed locally
* Terraform v0.10.x and [terraform-provider-ct](https://github.com/coreos/terraform-provider-ct) installed locally
## Terraform Setup
Install [Terraform](https://www.terraform.io/downloads.html) v0.9.2+ on your system.
Install [Terraform](https://www.terraform.io/downloads.html) v0.10.x on your system.
```sh
$ terraform version
Terraform v0.10.1
Terraform v0.10.7
```
Add the [terraform-provider-ct](https://github.com/coreos/terraform-provider-ct) plugin binary for your system.
@ -77,10 +77,10 @@ module "google-cloud-yavin" {
source = "git::https://github.com/poseidon/typhoon//google-cloud/container-linux/kubernetes"
# Google Cloud
zone = "us-central1-c"
region = "us-central1"
dns_zone = "example.com"
dns_zone_name = "example-zone"
os_image = "coreos-stable-1465-6-0-v20170817"
os_image = "coreos-stable-1520-6-0-v20171012"
cluster_name = "yavin"
controller_count = 1
@ -120,7 +120,7 @@ Get or update Terraform modules.
$ terraform get # downloads missing modules
$ terraform get --update # updates all modules
Get: git::https://github.com/poseidon/typhoon (update)
Get: git::https://github.com/poseidon/bootkube-terraform.git?ref=v0.7.0 (update)
Get: git::https://github.com/poseidon/bootkube-terraform.git?ref=v0.9.0 (update)
```
Plan the resources to be created.
@ -137,14 +137,14 @@ $ terraform apply
module.google-cloud-yavin.null_resource.bootkube-start: Still creating... (10s elapsed)
...
module.google-cloud-yavin.null_resource.bootkube-start: Still creating... (8m30s elapsed)
module.google-cloud-yavin.null_resource.bootkube-start: Still creating... (8m40s elapsed)
module.google-cloud-yavin.null_resource.bootkube-start: Still creating... (5m30s elapsed)
module.google-cloud-yavin.null_resource.bootkube-start: Still creating... (5m40s elapsed)
module.google-cloud-yavin.null_resource.bootkube-start: Creation complete (ID: 5768638456220583358)
Apply complete! Resources: 64 added, 0 changed, 0 destroyed.
```
In 5-10 minutes, the Kubernetes cluster will be ready.
In 4-8 minutes, the Kubernetes cluster will be ready.
## Verify
@ -154,9 +154,9 @@ In 5-10 minutes, the Kubernetes cluster will be ready.
$ KUBECONFIG=/home/user/.secrets/clusters/yavin/auth/kubeconfig
$ kubectl get nodes
NAME STATUS AGE VERSION
yavin-controller-1682.c.example-com.internal Ready 6m v1.7.7+coreos.0
yavin-worker-jrbf.c.example-com.internal Ready 5m v1.7.7+coreos.0
yavin-worker-mzdm.c.example-com.internal Ready 5m v1.7.7+coreos.0
yavin-controller-0.c.example-com.internal Ready 6m v1.8.4
yavin-worker-jrbf.c.example-com.internal Ready 5m v1.8.4
yavin-worker-mzdm.c.example-com.internal Ready 5m v1.8.4
```
List the pods.
@ -167,13 +167,10 @@ NAMESPACE NAME READY STATUS RESTART
kube-system calico-node-1cs8z 2/2 Running 0 6m
kube-system calico-node-d1l5b 2/2 Running 0 6m
kube-system calico-node-sp9ps 2/2 Running 0 6m
kube-system etcd-operator-3329263108-f443m 1/1 Running 1 6m
kube-system kube-apiserver-zppls 1/1 Running 0 6m
kube-system kube-controller-manager-3271970485-gh9kt 1/1 Running 0 6m
kube-system kube-controller-manager-3271970485-h90v8 1/1 Running 1 6m
kube-system kube-dns-1187388186-zj5dl 3/3 Running 0 6m
kube-system kube-etcd-0000 1/1 Running 0 5m
kube-system kube-etcd-network-checkpointer-crznb 1/1 Running 0 6m
kube-system kube-proxy-117v6 1/1 Running 0 6m
kube-system kube-proxy-9886n 1/1 Running 0 6m
kube-system kube-proxy-njn47 1/1 Running 0 6m
@ -196,14 +193,14 @@ Learn about [version pinning](concepts.md#versioning), maintenance, and [addons]
| Name | Description | Example |
|:-----|:------------|:--------|
| cluster_name | Unique cluster name (prepended to dns_zone) | "yavin" |
| zone | Google Cloud zone | "us-central1-f" |
| region | Google Cloud region | "us-central1" |
| dns_zone | Google Cloud DNS zone | "google-cloud.example.com" |
| dns_zone_name | Google Cloud DNS zone name | "example-zone" |
| ssh_authorized_key | SSH public key for ~/.ssh_authorized_keys | "ssh-rsa AAAAB3NZ..." |
| os_image | OS image for compute instances | "coreos-stable-1465-6-0-v20170817" |
| asset_dir | Path to a directory where generated assets should be placed (contains secrets) | "/home/user/.secrets/clusters/yavin" |
Check the list of valid [zones](https://cloud.google.com/compute/docs/regions-zones/regions-zones) and list Container Linux [images](https://cloud.google.com/compute/docs/images) with `gcloud compute images list | grep coreos`.
Check the list of valid [regions](https://cloud.google.com/compute/docs/regions-zones/regions-zones) and list Container Linux [images](https://cloud.google.com/compute/docs/images) with `gcloud compute images list | grep coreos`.
#### DNS Zone
@ -236,7 +233,10 @@ resource "google_dns_managed_zone" "zone-for-clusters" {
Check the list of valid [machine types](https://cloud.google.com/compute/docs/machine-types).
!!! warning
Set controller_count to 1. A bug in Google Cloud network load balancer health checking prevents multiple controllers from bootstrapping. There are workarounds, but they all involve tradeoffs we're uncomfortable recommending. See [#54](https://github.com/poseidon/typhoon/issues/54).
#### Preemption
Add `worker_premeptible = "true"` to allow worker nodes to be [preempted](https://cloud.google.com/compute/docs/instances/preemptible) at random, but pay [significantly](https://cloud.google.com/compute/pricing) less. Clusters tolerate stopping instances fairly well (reschedules pods, but cannot drain) and preemption provides a nice reward for running fault-tolerant cluster systems.`
Add `worker_preemeptible = "true"` to allow worker nodes to be [preempted](https://cloud.google.com/compute/docs/instances/preemptible) at random, but pay [significantly](https://cloud.google.com/compute/pricing) less. Clusters tolerate stopping instances fairly well (reschedules pods, but cannot drain) and preemption provides a nice reward for running fault-tolerant cluster systems.`

Binary file not shown.

Before

Width:  |  Height:  |  Size: 361 KiB

After

Width:  |  Height:  |  Size: 3.4 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 212 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 138 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 224 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 404 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 22 KiB

After

Width:  |  Height:  |  Size: 2.3 KiB

View File

@ -1 +0,0 @@
<?xml version="1.0" ?><svg enable-background="new 0 0 100 100" version="1.1" viewBox="0 0 100 100" xml:space="preserve" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><g id="Merged"><g><path d="M91.532,51.789c0.988-1.976,0.187-4.379-1.789-5.367L65.272,34.186c-2.099-2.028-4.599-3.642-7.369-4.712l22.729-7.576 c-0.698-2.096-2.964-3.229-5.059-2.53l-25.924,8.641c-3.048,0.048-5.946,0.719-8.575,1.891L51.789,8.468 c-1.976-0.988-4.379-0.187-5.367,1.789L34.186,34.728c-2.028,2.099-3.642,4.599-4.712,7.369l-7.576-22.729 c-2.096,0.698-3.229,2.964-2.53,5.059l8.641,25.924c0.048,3.048,0.719,5.946,1.891,8.575L8.468,48.211 c-0.988,1.976-0.187,4.379,1.789,5.367l24.471,12.236c2.099,2.028,4.599,3.642,7.369,4.712l-22.729,7.576 c0.698,2.096,2.964,3.229,5.059,2.53l25.924-8.641c3.048-0.048,5.946-0.719,8.575-1.891L48.211,91.532 c1.976,0.988,4.379,0.187,5.367-1.789l12.236-24.471c2.028-2.099,3.642-4.599,4.712-7.369l7.576,22.729 c2.096-0.698,3.229-2.964,2.53-5.06l-8.641-25.924c-0.048-3.048-0.719-5.946-1.891-8.575L91.532,51.789z M50,68 c-9.925,0-18-8.075-18-18s8.075-18,18-18s18,8.075,18,18S59.925,68,50,68z"/><path d="M50,38c-6.617,0-12,5.383-12,12s5.383,12,12,12s12-5.383,12-12S56.617,38,50,38z M50,58c-4.411,0-8-3.589-8-8s3.589-8,8-8 s8,3.589,8,8S54.411,58,50,58z"/></g></g></svg>

Before

Width:  |  Height:  |  Size: 1.3 KiB

BIN
docs/img/typhoon-logo.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 20 KiB

BIN
docs/img/typhoon.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 22 KiB

View File

@ -1,4 +1,4 @@
# Typhoon <img align="right" src="https://storage.googleapis.com/dghubble/spin.png">
# Typhoon <img align="right" src="https://storage.googleapis.com/poseidon/typhoon-logo.png">
Typhoon is a minimal and free Kubernetes distribution.
@ -9,9 +9,9 @@ Typhoon is a minimal and free Kubernetes distribution.
Typhoon distributes upstream Kubernetes, architectural conventions, and cluster addons, much like a GNU/Linux distribution provides the Linux kernel and userspace components.
## Features
## Features <a href="https://www.cncf.io/certification/software-conformance/"><img align="right" src="https://storage.googleapis.com/poseidon/certified-kubernetes.png"></a>
* Kubernetes v1.7.7 (upstream, via [kubernetes-incubator/bootkube](https://github.com/kubernetes-incubator/bootkube))
* Kubernetes v1.8.4 (upstream, via [kubernetes-incubator/bootkube](https://github.com/kubernetes-incubator/bootkube))
* Single or multi-master, workloads isolated on workers, [Calico](https://www.projectcalico.org/) or [flannel](https://github.com/coreos/flannel) networking
* On-cluster etcd with TLS, [RBAC](https://kubernetes.io/docs/admin/authorization/rbac/)-enabled, [network policy](https://kubernetes.io/docs/concepts/services-networking/network-policies/)
* Ready for Ingress, Dashboards, Metrics and other optional [addons](addons/overview.md)
@ -23,10 +23,10 @@ Typhoon provides a Terraform Module for each supported operating system and plat
| Platform | Operating System | Terraform Module | Status |
|---------------|------------------|------------------|--------|
| AWS | Container Linux | [aws/container-linux/kubernetes](aws.md) | alpha |
| Bare-Metal | Container Linux | [bare-metal/container-linux/kubernetes](bare-metal.md) | production |
| AWS | Container Linux | [aws/container-linux/kubernetes](aws.md) | beta |
| Bare-Metal | Container Linux | [bare-metal/container-linux/kubernetes](bare-metal.md) | stable |
| Digital Ocean | Container Linux | [digital-ocean/container-linux/kubernetes](digital-ocean.md) | beta |
| Google Cloud | Container Linux | [google-cloud/container-linux/kubernetes](google-cloud.md) | production |
| Google Cloud | Container Linux | [google-cloud/container-linux/kubernetes](google-cloud.md) | beta |
## Usage
@ -46,7 +46,7 @@ module "google-cloud-yavin" {
source = "git::https://github.com/poseidon/typhoon//google-cloud/container-linux/kubernetes"
# Google Cloud
zone = "us-central1-c"
region = "us-central1"
dns_zone = "example.com"
dns_zone_name = "example-zone"
os_image = "coreos-stable-1465-6-0-v20170817"
@ -71,15 +71,15 @@ $ terraform apply
Apply complete! Resources: 64 added, 0 changed, 0 destroyed.
```
In 5-10 minutes (varies by platform), the cluster will be ready. This Google Cloud example creates a `yavin.example.com` DNS record to resolve to a network load balancer across controller nodes.
In 4-8 minutes (varies by platform), the cluster will be ready. This Google Cloud example creates a `yavin.example.com` DNS record to resolve to a network load balancer across controller nodes.
```
$ KUBECONFIG=/home/user/.secrets/clusters/yavin/auth/kubeconfig
$ kubectl get nodes
NAME STATUS AGE VERSION
yavin-controller-1682.c.example-com.internal Ready 6m v1.7.7+coreos.0
yavin-worker-jrbf.c.example-com.internal Ready 5m v1.7.7+coreos.0
yavin-worker-mzdm.c.example-com.internal Ready 5m v1.7.7+coreos.0
yavin-controller-0.c.example-com.internal Ready 6m v1.8.4
yavin-worker-jrbf.c.example-com.internal Ready 5m v1.8.4
yavin-worker-mzdm.c.example-com.internal Ready 5m v1.8.4
```
List the pods.
@ -90,13 +90,10 @@ NAMESPACE NAME READY STATUS RESTART
kube-system calico-node-1cs8z 2/2 Running 0 6m
kube-system calico-node-d1l5b 2/2 Running 0 6m
kube-system calico-node-sp9ps 2/2 Running 0 6m
kube-system etcd-operator-3329263108-f443m 1/1 Running 1 6m
kube-system kube-apiserver-zppls 1/1 Running 0 6m
kube-system kube-controller-manager-3271970485-gh9kt 1/1 Running 0 6m
kube-system kube-controller-manager-3271970485-h90v8 1/1 Running 1 6m
kube-system kube-dns-1187388186-zj5dl 3/3 Running 0 6m
kube-system kube-etcd-0000 1/1 Running 0 5m
kube-system kube-etcd-network-checkpointer-crznb 1/1 Running 0 6m
kube-system kube-proxy-117v6 1/1 Running 0 6m
kube-system kube-proxy-9886n 1/1 Running 0 6m
kube-system kube-proxy-njn47 1/1 Running 0 6m

175
docs/topics/hardware.md Normal file
View File

@ -0,0 +1,175 @@
# Hardware
While bare-metal Kubernetes clusters have no special hardware requirements (beyond the [min reqs](/bare-metal.md#requirements)), Typhoon does ensure certain router and server hardware integrates well with Kubernetes.
## Ubiquitiy
Ubiquity EdgeRouters work well with bare-metal Kubernetes clusters. Knowledge about how to setup an EdgeRouter and use the CLI is required.
### PXE
Ubiquiti EdgeRouters can provide a PXE-enabled network boot environment for client machines.
#### ISC DHCP
Add a subnet parameter to the LAN DHCP server to include an ISC DHCP config file.
```
configure
show service dhcp-server shared-network-name NAME subnet SUBNET
set service dhcp-server shared-network-name NAME subnet SUBNET subnet-parameters "include &quot;/config/scripts/ipxe.conf&quot;;"
commit-confirm
```
Switch to root (i.e. `sudo -i`) and write the ISC DHCP config `/config/scripts/ipxe.conf`. iPXE client machines will chainload to `matchbox.example.com`, while non-iPXE clients will chainload to `undionly.kpxe` (requires TFTP to be enabled).
```
allow bootp;
allow booting;
next-server ADD_ROUTER_IP_HERE;
if exists user-class and option user-class = "iPXE" {
filename "http://matchbox.example.com/boot.ipxe";
} else {
filename "undionly.kpxe";
}
```
### TFTP
Use `dnsmasq` as a TFTP server to serve [undionly.kpxe](http://boot.ipxe.org/undionly.kpxe).
```
sudo -i
mkdir /var/lib/tftpboot
cd /var/lib/tftpboot
curl http://boot.ipxe.org/undionly.kpxe -o undionly.kpxe
```
Add `dnsmasq` command line options to enable the TFTP file server.
```
configure
show servce dns forwarding
set service dns forwarding options enable-tftp
set service dns forwarding options tftp-root=/var/lib/tftpboot
commit-confirm
```
!!! warning
After firmware upgrades, the `/var/lib/tftpboot` directory will not exist and dnsmasq will not start properly. Repeat this process following an upgrade.
### DHCP
Assign static IPs to clients with known MAC addresses. This is called a static mapping by EdgeOS. Configure the router with the commands based on region inventory.
```
configure
show service dhcp-server shared-network
set service dhcp-server shared-network-name LAN subnet SUBNET static-mapping NAME mac-address MACADDR
set service dhcp-server shared-network-name LAN subnet SUBNET static-mapping NAME ip-address 10.0.0.20
```
### DNS
Assign DNS A records to nodes as options to `dnsmasq`.
```
configure
set service dns forwarding options host-record=node.example.com,10.0.0.20
```
Restart `dnsmasq`.
```
sudo /etc/init.d/dnsmasq restart
```
Configure queries for `*.svc.cluster.local` to be forwarded to a Kubernetes `kube-dns` service IP to allow hosts to resolve cluster-local Kubernetes names.
```
configure
show service dns forwarding
set service dns forwarding options server=/svc.cluster.local/10.3.0.10
commit-confirm
```
### Kubernetes Services
Add static routes for the Kubernetes IPv4 service range to Kubernetes node(s) so hosts can route to Kubernetes services (default: 10.3.0.0/16).
```
configure
show protocols static route
set protocols static route 10.3.0.0/16 next-hop NODE_IP
...
commit-confirm
```
### Port Forwarding
Expose the [Ingress Controller](/addons/ingress.md#bare-metal) by adding `port-forward` rules that DNAT a port on the router's WAN interface to an internal IP and port. By convention, a public Ingress controller is assigned a fixed service IP like kube-dns (e.g. 10.3.0.12).
```
configure
set port-forward wan-interface eth0
set port-forward lan-interface eth1
set port-forward auto-firewall enable
set port-forward hairpin-nat enable
set port-forward rule 1 description 'ingress http'
set port-forward rule 1 forward-to address 10.3.0.12
set port-forward rule 1 forward-to port 80
set port-forward rule 1 original-port 80
set port-forward rule 1 protocol tcp_udp
set port-forward rule 2 description 'ingress https'
set port-forward rule 2 forward-to address 10.3.0.12
set port-forward rule 2 forward-to port 443
set port-forward rule 2 original-port 443
set port-forward rule 2 protocol tcp_udp
commit-confirm
```
### Web UI
The web UI is often accessible from the LAN on ports 80/443 by default. Edit the ports to 8080 and 4443 to avoid a conflict.
```
configure
show service gui
set service gui http-port 8080
set service gui https-port 4443
commit-confirm
```
### BGP
Add the EdgeRouter as a global BGP peer for nodes in a Kubernetes cluster (requires Calico). Neighbors will exchange `podCIDR` routes and individual pods will become routeable on the LAN.
Configure node(s) as BGP neighbors.
```
show protocols bgp 1
set protocols bgp 1 parameters router-id LAN_IP
set protocols bgp 1 neighbor NODE1_IP remote-as 64512
set protocols bgp 1 neighbor NODE2_IP remote-as 64512
set protocols bgp 1 neighbor NODE3_IP remote-as 64512
```
View the neighbors and exchanged routes.
```
show ip bgp neighbors
show ip route bgp
```
Be sure to register the peer by creating a Calico `bgpPeer` CRD with `kubectl apply`.
```
apiVersion: v1
kind: bgpPeer
metadata:
peerIP: LAN_IP
scope: global
spec:
asNumber: 64512
```

View File

@ -0,0 +1,42 @@
# Performance
## Provision Time
Provisioning times vary based on the platform. Sampling the time to create (apply) and destroy clusters with 1 controller and 2 workers shows (roughly) what to expect.
| Platform | Apply | Destroy |
|---------------|-------|---------|
| AWS | 6 min | 5 min |
| Bare-Metal | 10-14 min | NA |
| Digital Ocean | 3 min 30 sec | 20 sec |
| Google Cloud | 4 min | 4 min 30 sec |
Notes:
* SOA TTL and NXDOMAIN caching can have a large impact on provision time
* Platforms with auto-scaling take more time to provision (AWS, Google)
* Bare-metal POST times and network bandwidth will affect provision times
## Network Performance
Network performance varies based on the platform and CNI plugin. `iperf` was used to measture the bandwidth between different hosts and different pods. Host-to-host shows typical bandwidth between host machines. Pod-to-pod shows the bandwidth between two `iperf` containers.
| Platform / Plugin | Theory | Host to Host | Pod to Pod |
|----------------------------|-------:|-------------:|-------------:|
| AWS (flannel) | ? | 976 MB/s | 900-999 MB/s |
| AWS (calico, MTU 1480) | ? | 976 MB/s | 100-350 MB/s |
| AWS (calico, MTU 8991) | ? | 976 MB/s | 900-999 MB/s |
| Bare-Metal (flannel) | 1 GB/s | 934 MB/s | 903 MB/s |
| Bare-Metal (calico) | 1 GB/s | 941 MB/s | 931 MB/s |
| Bare-Metal (flannel, bond) | 3 GB/s | 2.3 GB/s | 1.17 GB/s |
| Bare-Metal (calico, bond) | 3 GB/s | 2.3 GB/s | 1.17 GB/s |
| Digital Ocean | ? | 938 MB/s | 820-880 MB/s |
| Google Cloud (flannel) | ? | 1.94 GB/s | 1.76 GB/s |
| Google Cloud (calico) | ? | 1.94 GB/s | 1.81 GB/s |
Notes:
* Calico and Flannel have comparable performance. Platform and configuration differenes dominate.
* Neither CNI provider seems to be able to leverage bonded NICs (bare-metal)
* AWS and Digital Ocean network bandwidth fluctuates more than on other platforms.
* Only [certain AWS EC2 instance types](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/network_mtu.html#jumbo_frame_instances) allow jumbo frames. This is why the default MTU on AWS must be 1480.

View File

@ -1,4 +1,4 @@
# Typhoon
# Typhoon <img align="right" src="https://storage.googleapis.com/poseidon/typhoon-logo.png">
Typhoon is a minimal and free Kubernetes distribution.
@ -9,9 +9,9 @@ Typhoon is a minimal and free Kubernetes distribution.
Typhoon distributes upstream Kubernetes, architectural conventions, and cluster addons, much like a GNU/Linux distribution provides the Linux kernel and userspace components.
## Features
## Features <a href="https://www.cncf.io/certification/software-conformance/"><img align="right" src="https://storage.googleapis.com/poseidon/certified-kubernetes.png"></a>
* Kubernetes v1.7.7 (upstream, via [kubernetes-incubator/bootkube](https://github.com/kubernetes-incubator/bootkube))
* Kubernetes v1.8.4 (upstream, via [kubernetes-incubator/bootkube](https://github.com/kubernetes-incubator/bootkube))
* Single or multi-master, workloads isolated on workers, [Calico](https://www.projectcalico.org/) or [flannel](https://github.com/coreos/flannel) networking
* On-cluster etcd with TLS, [RBAC](https://kubernetes.io/docs/admin/authorization/rbac/)-enabled, [network policy](https://kubernetes.io/docs/concepts/services-networking/network-policies/)
* Ready for Ingress, Dashboards, Metrics, and other optional [addons](https://typhoon.psdn.io/addons/overview/)

View File

@ -1,14 +1,13 @@
# Self-hosted Kubernetes assets (kubeconfig, manifests)
module "bootkube" {
source = "git::https://github.com/poseidon/terraform-render-bootkube.git?ref=v0.7.0"
source = "git::https://github.com/poseidon/terraform-render-bootkube.git?ref=v0.9.0"
cluster_name = "${var.cluster_name}"
api_servers = ["${format("%s.%s", var.cluster_name, var.dns_zone)}"]
etcd_servers = ["http://127.0.0.1:2379"]
asset_dir = "${var.asset_dir}"
networking = "${var.networking}"
network_mtu = 1440
pod_cidr = "${var.pod_cidr}"
service_cidr = "${var.service_cidr}"
experimental_self_hosted_etcd = "true"
cluster_name = "${var.cluster_name}"
api_servers = ["${format("%s.%s", var.cluster_name, var.dns_zone)}"]
etcd_servers = "${module.controllers.etcd_fqdns}"
asset_dir = "${var.asset_dir}"
networking = "${var.networking}"
network_mtu = 1440
pod_cidr = "${var.pod_cidr}"
service_cidr = "${var.service_cidr}"
}

View File

@ -6,12 +6,11 @@ module "controllers" {
# GCE
network = "${google_compute_network.network.name}"
count = "${var.controller_count}"
zone = "${var.zone}"
region = "${var.region}"
dns_zone = "${var.dns_zone}"
dns_zone_name = "${var.dns_zone_name}"
machine_type = "${var.machine_type}"
os_image = "${var.os_image}"
preemptible = "${var.controller_preemptible}"
# configuration
networking = "${var.networking}"
@ -29,8 +28,8 @@ module "workers" {
# GCE
network = "${google_compute_network.network.name}"
region = "${var.region}"
count = "${var.worker_count}"
zone = "${var.zone}"
machine_type = "${var.machine_type}"
os_image = "${var.os_image}"
preemptible = "${var.worker_preemptible}"

View File

@ -1,6 +1,29 @@
---
systemd:
units:
- name: etcd-member.service
enable: true
dropins:
- name: 40-etcd-cluster.conf
contents: |
[Service]
Environment="ETCD_IMAGE_TAG=v3.2.0"
Environment="ETCD_NAME=${etcd_name}"
Environment="ETCD_ADVERTISE_CLIENT_URLS=https://${etcd_domain}:2379"
Environment="ETCD_INITIAL_ADVERTISE_PEER_URLS=https://${etcd_domain}:2380"
Environment="ETCD_LISTEN_CLIENT_URLS=https://0.0.0.0:2379"
Environment="ETCD_LISTEN_PEER_URLS=https://0.0.0.0:2380"
Environment="ETCD_INITIAL_CLUSTER=${etcd_initial_cluster}"
Environment="ETCD_STRICT_RECONFIG_CHECK=true"
Environment="ETCD_SSL_DIR=/etc/ssl/etcd"
Environment="ETCD_TRUSTED_CA_FILE=/etc/ssl/certs/etcd/server-ca.crt"
Environment="ETCD_CERT_FILE=/etc/ssl/certs/etcd/server.crt"
Environment="ETCD_KEY_FILE=/etc/ssl/certs/etcd/server.key"
Environment="ETCD_CLIENT_CERT_AUTH=true"
Environment="ETCD_PEER_TRUSTED_CA_FILE=/etc/ssl/certs/etcd/peer-ca.crt"
Environment="ETCD_PEER_CERT_FILE=/etc/ssl/certs/etcd/peer.crt"
Environment="ETCD_PEER_KEY_FILE=/etc/ssl/certs/etcd/peer.key"
Environment="ETCD_PEER_CLIENT_CERT_AUTH=true"
- name: docker.service
enable: true
- name: locksmithd.service
@ -34,7 +57,9 @@ systemd:
--volume opt-cni-bin,kind=host,source=/opt/cni/bin \
--mount volume=opt-cni-bin,target=/opt/cni/bin \
--volume var-log,kind=host,source=/var/log \
--mount volume=var-log,target=/var/log"
--mount volume=var-log,target=/var/log \
--hosts-entry=host \
--insecure-options=image"
ExecStartPre=/bin/mkdir -p /opt/cni/bin
ExecStartPre=/bin/mkdir -p /etc/kubernetes/manifests
ExecStartPre=/bin/mkdir -p /etc/kubernetes/cni/net.d
@ -56,8 +81,7 @@ systemd:
--network-plugin=cni \
--node-labels=node-role.kubernetes.io/master \
--pod-manifest-path=/etc/kubernetes/manifests \
--register-with-taints=node-role.kubernetes.io/master=:NoSchedule \
--require-kubeconfig
--register-with-taints=node-role.kubernetes.io/master=:NoSchedule
ExecStop=-/usr/bin/rkt stop --uuid-file=/var/cache/kubelet-pod.uuid
Restart=always
RestartSec=10
@ -104,8 +128,8 @@ storage:
mode: 0644
contents:
inline: |
KUBELET_IMAGE_URL=quay.io/coreos/hyperkube
KUBELET_IMAGE_TAG=v1.7.7_coreos.0
KUBELET_IMAGE_URL=docker://gcr.io/google_containers/hyperkube
KUBELET_IMAGE_TAG=v1.8.4
- path: /etc/sysctl.d/max-user-watches.conf
filesystem: root
contents:
@ -128,7 +152,7 @@ storage:
[ -d /opt/bootkube/assets/experimental/manifests ] && mv /opt/bootkube/assets/experimental/manifests/* /opt/bootkube/assets/manifests && rm -r /opt/bootkube/assets/experimental/manifests
[ -d /opt/bootkube/assets/experimental/bootstrap-manifests ] && mv /opt/bootkube/assets/experimental/bootstrap-manifests/* /opt/bootkube/assets/bootstrap-manifests && rm -r /opt/bootkube/assets/experimental/bootstrap-manifests
BOOTKUBE_ACI="$${BOOTKUBE_ACI:-quay.io/coreos/bootkube}"
BOOTKUBE_VERSION="$${BOOTKUBE_VERSION:-v0.7.0}"
BOOTKUBE_VERSION="$${BOOTKUBE_VERSION:-v0.9.0}"
BOOTKUBE_ASSETS="$${BOOTKUBE_ASSETS:-/opt/bootkube/assets}"
exec /usr/bin/rkt run \
--trust-keys-from-https \
@ -145,4 +169,4 @@ passwd:
users:
- name: core
ssh_authorized_keys:
- "${ssh_authorized_keys}"
- "${ssh_authorized_key}"

View File

@ -1,60 +1,43 @@
# Managed Instance Group
resource "google_compute_instance_group_manager" "controllers" {
name = "${var.cluster_name}-controller-group"
description = "Compute instance group of ${var.cluster_name} controllers"
# Discrete DNS records for each controller's private IPv4 for etcd usage
resource "google_dns_record_set" "etcds" {
count = "${var.count}"
# Instance name prefix for instances in the group
base_instance_name = "${var.cluster_name}-controller"
instance_template = "${google_compute_instance_template.controller.self_link}"
update_strategy = "RESTART"
zone = "${var.zone}"
target_size = "${var.count}"
# DNS Zone name where record should be created
managed_zone = "${var.dns_zone_name}"
# Target pool instances in the group should be added into
target_pools = [
"${google_compute_target_pool.controllers.self_link}",
]
# DNS record
name = "${format("%s-etcd%d.%s.", var.cluster_name, count.index, var.dns_zone)}"
type = "A"
ttl = 300
# private IPv4 address for etcd
rrdatas = ["${element(google_compute_instance.controllers.*.network_interface.0.address, count.index)}"]
}
# Controller Container Linux Config
data "template_file" "controller_config" {
template = "${file("${path.module}/cl/controller.yaml.tmpl")}"
vars = {
k8s_dns_service_ip = "${cidrhost(var.service_cidr, 10)}"
k8s_etcd_service_ip = "${cidrhost(var.service_cidr, 15)}"
ssh_authorized_keys = "${var.ssh_authorized_key}"
kubeconfig_ca_cert = "${var.kubeconfig_ca_cert}"
kubeconfig_kubelet_cert = "${var.kubeconfig_kubelet_cert}"
kubeconfig_kubelet_key = "${var.kubeconfig_kubelet_key}"
kubeconfig_server = "${var.kubeconfig_server}"
}
# Zones in the region
data "google_compute_zones" "all" {
region = "${var.region}"
}
data "ct_config" "controller_ign" {
content = "${data.template_file.controller_config.rendered}"
pretty_print = false
}
# Controller instances
resource "google_compute_instance" "controllers" {
count = "${var.count}"
resource "google_compute_instance_template" "controller" {
name_prefix = "${var.cluster_name}-controller-"
description = "Controller Instance template"
name = "${var.cluster_name}-controller-${count.index}"
zone = "${element(data.google_compute_zones.all.names, count.index)}"
machine_type = "${var.machine_type}"
metadata {
user-data = "${data.ct_config.controller_ign.rendered}"
user-data = "${element(data.ct_config.controller_ign.*.rendered, count.index)}"
}
scheduling {
automatic_restart = "${var.preemptible ? false : true}"
preemptible = "${var.preemptible}"
}
boot_disk {
auto_delete = true
disk {
auto_delete = true
boot = true
source_image = "${var.os_image}"
disk_size_gb = "${var.disk_size}"
initialize_params {
image = "${var.os_image}"
size = "${var.disk_size}"
}
}
network_interface {
@ -65,9 +48,45 @@ resource "google_compute_instance_template" "controller" {
}
can_ip_forward = true
tags = ["${var.cluster_name}-controller"]
}
lifecycle {
# To update an Instance Template, Terraform should replace the existing resource
create_before_destroy = true
# Controller Container Linux Config
data "template_file" "controller_config" {
count = "${var.count}"
template = "${file("${path.module}/cl/controller.yaml.tmpl")}"
vars = {
# Cannot use cyclic dependencies on controllers or their DNS records
etcd_name = "etcd${count.index}"
etcd_domain = "${var.cluster_name}-etcd${count.index}.${var.dns_zone}"
# etcd0=https://cluster-etcd0.example.com,etcd1=https://cluster-etcd1.example.com,...
etcd_initial_cluster = "${join(",", formatlist("%s=https://%s:2380", null_resource.repeat.*.triggers.name, null_resource.repeat.*.triggers.domain))}"
k8s_dns_service_ip = "${cidrhost(var.service_cidr, 10)}"
ssh_authorized_key = "${var.ssh_authorized_key}"
kubeconfig_ca_cert = "${var.kubeconfig_ca_cert}"
kubeconfig_kubelet_cert = "${var.kubeconfig_kubelet_cert}"
kubeconfig_kubelet_key = "${var.kubeconfig_kubelet_key}"
kubeconfig_server = "${var.kubeconfig_server}"
}
}
# Horrible hack to generate a Terraform list of a desired length without dependencies.
# Ideal ${repeat("etcd", 3) -> ["etcd", "etcd", "etcd"]}
resource null_resource "repeat" {
count = "${var.count}"
triggers {
name = "etcd${count.index}"
domain = "${var.cluster_name}-etcd${count.index}.${var.dns_zone}"
}
}
data "ct_config" "controller_ign" {
count = "${var.count}"
content = "${element(data.template_file.controller_config.*.rendered, count.index)}"
pretty_print = false
}

View File

@ -1,4 +1,9 @@
# Controller Network Load balancer DNS record
# Static IPv4 address for the Network Load Balancer
resource "google_compute_address" "controllers-ip" {
name = "${var.cluster_name}-controllers-ip"
}
# DNS record for the Network Load Balancer
resource "google_dns_record_set" "controllers" {
# DNS Zone name where record should be created
managed_zone = "${var.dns_zone_name}"
@ -12,12 +17,7 @@ resource "google_dns_record_set" "controllers" {
rrdatas = ["${google_compute_address.controllers-ip.address}"]
}
# Static IP for the Network Load Balancer
resource "google_compute_address" "controllers-ip" {
name = "${var.cluster_name}-controllers-ip"
}
# Network Load Balancer (i.e. forwarding rules)
# Network Load Balancer (i.e. forwarding rule)
resource "google_compute_forwarding_rule" "controller-https-rule" {
name = "${var.cluster_name}-controller-https-rule"
ip_address = "${google_compute_address.controllers-ip.address}"
@ -25,26 +25,23 @@ resource "google_compute_forwarding_rule" "controller-https-rule" {
target = "${google_compute_target_pool.controllers.self_link}"
}
resource "google_compute_forwarding_rule" "controller-ssh-rule" {
name = "${var.cluster_name}-controller-ssh-rule"
ip_address = "${google_compute_address.controllers-ip.address}"
port_range = "22"
target = "${google_compute_target_pool.controllers.self_link}"
}
# Network Load Balancer target pool of instances.
# Target pool of instances for the controller(s) Network Load Balancer
resource "google_compute_target_pool" "controllers" {
name = "${var.cluster_name}-controller-pool"
instances = [
"${google_compute_instance.controllers.*.self_link}",
]
health_checks = [
"${google_compute_http_health_check.ingress.name}",
"${google_compute_http_health_check.kubelet.name}",
]
session_affinity = "NONE"
}
# Kubelet HTTP Health Check
resource "google_compute_http_health_check" "ingress" {
resource "google_compute_http_health_check" "kubelet" {
name = "${var.cluster_name}-kubelet-health"
description = "Health check Kubelet health host port"

View File

@ -0,0 +1,7 @@
output "etcd_fqdns" {
value = ["${null_resource.repeat.*.triggers.domain}"]
}
output "ipv4_public" {
value = ["${google_compute_instance.controllers.*.network_interface.0.access_config.0.assigned_nat_ip}"]
}

View File

@ -30,9 +30,9 @@ variable "count" {
description = "Number of controller compute instances the instance group should manage"
}
variable "zone" {
variable "region" {
type = "string"
description = "Google zone that compute instances in the group should be created in (e.g. gcloud compute zones list)"
description = "Google Cloud region (e.g. us-central1, see `gcloud compute regions list`)."
}
variable "machine_type" {
@ -51,12 +51,6 @@ variable "disk_size" {
description = "The size of the disk in gigabytes."
}
variable "preemptible" {
type = "string"
default = "false"
description = "If enabled, Compute Engine will terminate instances randomly within 24 hours"
}
// configuration
variable "networking" {
@ -68,7 +62,7 @@ variable "networking" {
variable "service_cidr" {
description = <<EOD
CIDR IP range to assign Kubernetes services.
The 1st IP will be reserved for kube_apiserver, the 10th IP will be reserved for kube-dns, the 15th IP will be reserved for self-hosted etcd, and the 200th IP will be reserved for bootstrap self-hosted etcd.
The 1st IP will be reserved for kube_apiserver, the 10th IP will be reserved for kube-dns.
EOD
type = "string"

View File

@ -4,18 +4,6 @@ resource "google_compute_network" "network" {
auto_create_subnetworks = true
}
resource "google_compute_firewall" "allow-ingress" {
name = "${var.cluster_name}-allow-ingress"
network = "${google_compute_network.network.name}"
allow {
protocol = "tcp"
ports = [80, 443]
}
source_ranges = ["0.0.0.0/0"]
}
resource "google_compute_firewall" "allow-ssh" {
name = "${var.cluster_name}-allow-ssh"
network = "${google_compute_network.network.name}"
@ -26,31 +14,55 @@ resource "google_compute_firewall" "allow-ssh" {
}
source_ranges = ["0.0.0.0/0"]
target_tags = ["${var.cluster_name}-controller", "${var.cluster_name}-worker"]
}
resource "google_compute_firewall" "allow-internal" {
name = "${var.cluster_name}-allow-internal"
resource "google_compute_firewall" "allow-apiserver" {
name = "${var.cluster_name}-allow-apiserver"
network = "${google_compute_network.network.name}"
allow {
protocol = "tcp"
ports = ["1-65535"]
ports = [443]
}
source_ranges = ["0.0.0.0/0"]
target_tags = ["${var.cluster_name}-controller"]
}
resource "google_compute_firewall" "allow-ingress" {
name = "${var.cluster_name}-allow-ingress"
network = "${google_compute_network.network.name}"
allow {
protocol = "udp"
ports = ["1-65535"]
protocol = "tcp"
ports = [80, 443]
}
source_ranges = ["10.0.0.0/8"]
source_ranges = ["0.0.0.0/0"]
target_tags = ["${var.cluster_name}-worker"]
}
resource "google_compute_firewall" "internal-etcd" {
name = "${var.cluster_name}-internal-etcd"
network = "${google_compute_network.network.name}"
allow {
protocol = "tcp"
ports = [2380]
}
source_tags = ["${var.cluster_name}-controller"]
target_tags = ["${var.cluster_name}-controller"]
}
# Calico BGP and IPIP
# https://docs.projectcalico.org/v2.5/reference/public-cloud/gce
resource "google_compute_firewall" "allow-calico" {
resource "google_compute_firewall" "internal-calico" {
count = "${var.networking == "calico" ? 1 : 0}"
name = "${var.cluster_name}-allow-calico"
name = "${var.cluster_name}-internal-calico"
network = "${google_compute_network.network.name}"
allow {
@ -62,5 +74,63 @@ resource "google_compute_firewall" "allow-calico" {
protocol = "ipip"
}
source_ranges = ["10.0.0.0/8"]
source_tags = ["${var.cluster_name}-controller", "${var.cluster_name}-worker"]
target_tags = ["${var.cluster_name}-controller", "${var.cluster_name}-worker"]
}
# flannel
resource "google_compute_firewall" "internal-flannel" {
count = "${var.networking == "flannel" ? 1 : 0}"
name = "${var.cluster_name}-internal-flannel"
network = "${google_compute_network.network.name}"
allow {
protocol = "udp"
ports = [8472]
}
source_tags = ["${var.cluster_name}-controller", "${var.cluster_name}-worker"]
target_tags = ["${var.cluster_name}-controller", "${var.cluster_name}-worker"]
}
# Allow prometheus (workload) to scrape node-exporter daemonset
resource "google_compute_firewall" "internal-node-exporter" {
name = "${var.cluster_name}-internal-node-exporter"
network = "${google_compute_network.network.name}"
allow {
protocol = "tcp"
ports = [9100]
}
source_tags = ["${var.cluster_name}-worker"]
target_tags = ["${var.cluster_name}-controller", "${var.cluster_name}-worker"]
}
# kubelet API to allow kubectl exec and log
resource "google_compute_firewall" "internal-kubelet" {
name = "${var.cluster_name}-internal-kubelet"
network = "${google_compute_network.network.name}"
allow {
protocol = "tcp"
ports = [10250]
}
source_tags = ["${var.cluster_name}-controller"]
target_tags = ["${var.cluster_name}-controller", "${var.cluster_name}-worker"]
}
resource "google_compute_firewall" "internal-kubelet-readonly" {
name = "${var.cluster_name}-internal-kubelet-readonly"
network = "${google_compute_network.network.name}"
allow {
protocol = "tcp"
ports = [10255]
}
source_tags = ["${var.cluster_name}-controller", "${var.cluster_name}-worker"]
target_tags = ["${var.cluster_name}-controller", "${var.cluster_name}-worker"]
}

View File

@ -1,3 +1,15 @@
output "controllers_ipv4_public" {
value = ["${module.controllers.ipv4_public}"]
}
output "ingress_static_ip" {
value = "${module.workers.ingress_static_ip}"
}
output "network_name" {
value = "${google_compute_network.network.name}"
}
output "network_self_link" {
value = "${google_compute_network.network.self_link}"
}

View File

@ -0,0 +1,25 @@
# Terraform version and plugin versions
terraform {
required_version = ">= 0.10.4"
}
provider "google" {
version = "~> 1.2"
}
provider "local" {
version = "~> 1.0"
}
provider "null" {
version = "~> 1.0"
}
provider "template" {
version = "~> 1.0"
}
provider "tls" {
version = "~> 1.0"
}

View File

@ -1,12 +1,80 @@
# Secure copy etcd TLS assets and kubeconfig to controllers. Activates kubelet.service
resource "null_resource" "copy-secrets" {
depends_on = ["module.bootkube"]
count = "${var.controller_count}"
connection {
type = "ssh"
host = "${element(module.controllers.ipv4_public, count.index)}"
user = "core"
timeout = "15m"
}
provisioner "file" {
content = "${module.bootkube.kubeconfig}"
destination = "$HOME/kubeconfig"
}
provisioner "file" {
content = "${module.bootkube.etcd_ca_cert}"
destination = "$HOME/etcd-client-ca.crt"
}
provisioner "file" {
content = "${module.bootkube.etcd_client_cert}"
destination = "$HOME/etcd-client.crt"
}
provisioner "file" {
content = "${module.bootkube.etcd_client_key}"
destination = "$HOME/etcd-client.key"
}
provisioner "file" {
content = "${module.bootkube.etcd_server_cert}"
destination = "$HOME/etcd-server.crt"
}
provisioner "file" {
content = "${module.bootkube.etcd_server_key}"
destination = "$HOME/etcd-server.key"
}
provisioner "file" {
content = "${module.bootkube.etcd_peer_cert}"
destination = "$HOME/etcd-peer.crt"
}
provisioner "file" {
content = "${module.bootkube.etcd_peer_key}"
destination = "$HOME/etcd-peer.key"
}
provisioner "remote-exec" {
inline = [
"sudo mkdir -p /etc/ssl/etcd/etcd",
"sudo mv etcd-client* /etc/ssl/etcd/",
"sudo cp /etc/ssl/etcd/etcd-client-ca.crt /etc/ssl/etcd/etcd/server-ca.crt",
"sudo mv etcd-server.crt /etc/ssl/etcd/etcd/server.crt",
"sudo mv etcd-server.key /etc/ssl/etcd/etcd/server.key",
"sudo cp /etc/ssl/etcd/etcd-client-ca.crt /etc/ssl/etcd/etcd/peer-ca.crt",
"sudo mv etcd-peer.crt /etc/ssl/etcd/etcd/peer.crt",
"sudo mv etcd-peer.key /etc/ssl/etcd/etcd/peer.key",
"sudo chown -R etcd:etcd /etc/ssl/etcd",
"sudo chmod -R 500 /etc/ssl/etcd",
"sudo mv /home/core/kubeconfig /etc/kubernetes/kubeconfig",
]
}
}
# Secure copy bootkube assets to ONE controller and start bootkube to perform
# one-time self-hosted cluster bootstrapping.
resource "null_resource" "bootkube-start" {
depends_on = ["module.controllers", "module.workers", "module.bootkube"]
depends_on = ["module.controllers", "module.bootkube", "module.workers", "null_resource.copy-secrets"]
# TODO: SSH to a controller's IP instead of waiting on DNS resolution
connection {
type = "ssh"
host = "${format("%s.%s", var.cluster_name, var.dns_zone)}"
host = "${element(module.controllers.ipv4_public, 0)}"
user = "core"
timeout = "15m"
}

View File

@ -3,9 +3,9 @@ variable "cluster_name" {
description = "Cluster name"
}
variable "zone" {
variable "region" {
type = "string"
description = "Google Cloud zone (e.g. us-central1-f, see `gcloud compute zones list`)"
description = "Google Cloud Region (e.g. us-central1, see `gcloud compute regions list`)"
}
variable "dns_zone" {
@ -46,12 +46,6 @@ variable "worker_count" {
description = "Number of workers"
}
variable "controller_preemptible" {
type = "string"
default = "false"
description = "If enabled, Compute Engine will terminate controllers randomly within 24 hours"
}
variable "worker_preemptible" {
type = "string"
default = "false"
@ -80,7 +74,7 @@ variable "pod_cidr" {
variable "service_cidr" {
description = <<EOD
CIDR IP range to assign Kubernetes services.
The 1st IP will be reserved for kube_apiserver, the 10th IP will be reserved for kube-dns, the 15th IP will be reserved for self-hosted etcd, and the 200th IP will be reserved for bootstrap self-hosted etcd.
The 1st IP will be reserved for kube_apiserver, the 10th IP will be reserved for kube-dns.
EOD
type = "string"

Some files were not shown because too many files have changed in this diff Show More