Compare commits

...

111 Commits

Author SHA1 Message Date
5066a25d89 Add links and clarifications in CHANGES for release 2019-03-02 11:26:12 -08:00
de251bd94f Update tutorials to prefer newer provider plugins over min version
* Minimum versions of Terraform provider plugins are enforced in
each module already. Its better to provide examples with newer
versions. Some folks don't update them
* Previously, tutorials showed the minimum viable version of each
terraform provider that might be used
2019-03-02 11:07:40 -08:00
fc277eaab6 Document the GCP DNS admin requirement for cluster provisioning
* Configure the google terraform provider to use GCP service
account credentials with compute and dns admin privileges
2019-03-02 10:54:35 -08:00
a08adc92b5 Update nginx-ingress from v0.22.0 to v0.23.0
* https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.23.0
2019-03-01 01:18:54 -08:00
d42f42df4e Re-measure cluster provision times and document 2019-03-01 01:15:08 -08:00
4ff7fe2c29 Update Grafana dashboards from upstreams 2019-02-28 23:22:07 -08:00
f598307998 Update Kubernetes from v1.13.3 to v1.13.4
* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.13.md#v1134
2019-02-28 22:47:43 -08:00
8ae552ebda Update documentation for use with Ubiquiti EdgeOS
* Show creation of a PXE-enabled network boot environment when
using dnsmasq as the DHCP server
* Recommend TFTP be served from /config/tftpboot since /config
is preserved between firmware upgrades
* Recommend compiling undionly.kpxe from source to enable
TLS features
* Add a note that equal-cost multi-path service IP routing
(e.g. for ingress) requires EdgeOS v2.0. Previously, it was known
that TLS handshakes couldn't be completed with packet balacing.
I've verified this is no longer the case when using the v2.0
EdgeOS firmware, ECMP works as expected.
2019-02-27 23:36:27 -08:00
daee5a9d60 Update Grafana from v6.0.0-beta3 to v6.0.0
* https://github.com/grafana/grafana/releases/tag/v6.0.0
* http://docs.grafana.org/guides/whats-new-in-v6-0/
2019-02-25 21:43:43 -08:00
73ae5d5649 Update Calico from v3.5.1 to v3.5.2
* https://docs.projectcalico.org/v3.5/releases/
2019-02-25 21:23:13 -08:00
42d7222f3d Add a readiness probe to CoreDNS
* https://github.com/poseidon/terraform-render-bootkube/pull/115
2019-02-23 13:25:23 -08:00
d10c2b4cb9 Update Grafana from v6.0.0-beta2 to v6.0.0-beta3
* Update Grafana dashboards
2019-02-23 13:03:25 -08:00
7f8572030d Upgrade to support terraform-provider-google v2.0+
* Support terraform-provider-google v1.19.0, v1.19.1, v1.20.0
and v2.0+ (and allow for future 2.x.y releases)
* Require terraform-provider-google v1.19.0 or newer. v1.19.0
introduced `network_interface` fields `network_ip` and `nat_ip`
to deprecate `address` and `assigned_nat_ip`. Those deprecated
fields are removed in terraform-provider-google v2.0
* https://github.com/terraform-providers/terraform-provider-google/releases/tag/v2.0.0
2019-02-20 02:33:32 -08:00
4294bd0292 Assign Pod Priority classes to critical cluster and node components
* Assign pod priorityClassNames to critical cluster and node
components (higher is higher priority) to inform node out-of-resource
eviction order and scheduler preemption and scheduling order
* Priority Admission Controller has been enabled since Typhoon
v1.11.1
2019-02-19 22:21:39 -08:00
ba4c5de052 Set the Google Cloud minimum CPU platform to Intel Haswell
* Intel Haswell or better is available in every zone around the world
* Neither Kubernetes nor Typhoon have a particular minimum processor
family. However, a few Google Cloud zones still default to Sandy/Ivy
bridge (scheduled to shift April 2019). Price is only based on machine
type so it is beneficial to opt for the next processor family
* Intel Haswell is a suitable minimum since it still allows plenty of
liberty in choosing any region or machine type
* Likely a slight increase to preemption probability in a few zones,
but any lower probability on Sandy/Ivy bridge is due to lower
desirability as they're phased out
* https://cloud.google.com/compute/docs/regions-zones/
2019-02-18 12:55:04 -08:00
e483c81ce9 Improve Prometheus rules and alerts and Grafana dashboards
* Collate upstream rules, alerts, and dashboards and tune for use
in Typhoon
* Previously, a well-chosen (but older) set of rules, alerts, and
dashboards were maintained to reflect metric name changes
2019-02-18 12:19:23 -08:00
6fa3b8a13f Upgrade Grafana to v6.0.0-beta2 and enable Explore UI
* Upgrade Grafana from v5.4.3 to v6.0.0-beta2
* Enable Grafana Explore UI while still using only the Viewer
role (inspect/edit without saving)
* http://docs.grafana.org/guides/whats-new-in-v6-0/
2019-02-17 13:26:42 -08:00
ac95e83249 Update mkdocs-material from v3.3.0 to v4.0.1 2019-02-16 15:55:38 -08:00
d988822741 Document and recommend terraform-provider-matchbox v0.2.3
* https://github.com/coreos/terraform-provider-matchbox/releases/tag/v0.2.3
2019-02-16 15:07:49 -08:00
170ef74eea Remove Nginx Ingress default backend
* nginx-ingress no longer requires a configured default-backend,
it will respond with its own 404 page starting in v0.21.0
* https://github.com/kubernetes/ingress-nginx/pull/3196
2019-02-16 14:18:15 -08:00
b13a651cfe Drop metrics that are unset, high cardinality, or extraneous
* https://github.com/coreos/prometheus-operator/pull/2387
* https://github.com/coreos/prometheus-operator/pull/1959
2019-02-10 23:56:11 -08:00
9c59f393a5 Add Kubernetes pod name to metrics discovered from service endpoints
* Prometheus queries from some upstreams use joins of node-exporter
and kube-state-metrics metrics by (namespace,pod). Add the Kubernetes
pod name to service endpoint metrics
* Rename the kubernetes_namespace field to namespace
* Honor labels since kube-state-metrics already include a `pod` field
that should not be overridden
2019-02-10 23:54:30 -08:00
3e4b3bfb04 Raise nginx-ingress liveness/readiness timeout
* Under heavy load, avoid timeouts causing nginx-ingress
restarts https://github.com/kubernetes/ingress-nginx/pull/3737
2019-02-09 12:53:09 -08:00
584088397c Update etcd from v3.3.11 to v3.3.12
* https://github.com/etcd-io/etcd/releases/tag/v3.3.12
2019-02-09 11:54:54 -08:00
0200058e0e Update Calico from v3.5.0 to v3.5.1
* Fix in confd https://github.com/projectcalico/confd/pull/205
2019-02-09 11:49:31 -08:00
d5537405e1 Add CHANGES note about reducing the pod eviciton timeout 2019-02-02 14:54:18 -08:00
949ce21fb2 Update Prometheus from v2.7.0 to v2.7.1
* https://github.com/prometheus/prometheus/releases/tag/v2.7.1
2019-02-02 00:13:24 -08:00
ccd96c37da Update Kubernetes from v1.13.2 to v1.13.3
* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.13.md#v1133
2019-02-01 23:26:13 -08:00
acd539f865 Fix architecture title for DigitalOcean (#390) 2019-02-01 23:20:06 -08:00
244a1a601a Switch CoreDNS to use the forward plugin instead of proxy
* Use the forward plugin to forward to upstream resolvers, instead
of the proxy plugin. The forward plugin is reported to be a faster
alternative since it can re-use open sockets
* https://coredns.io/explugins/forward/
* https://coredns.io/plugins/proxy/
* https://github.com/kubernetes/kubernetes/issues/73254
2019-01-30 22:25:23 -08:00
d02af3d40d Update mkdocs-material from v3.2.0 to v3.3.0
* Fix minor docs typos and errors
* Allow a transient verison of the six PyPi package, the
docs build system can use the 0.12.0 (0.11.0 broke sync
tools so pinning to 0.10.0 was previously needed)
2019-01-29 23:16:57 -08:00
130daeac26 Update Prometheus from v2.6.1 to v2.7.0 2019-01-29 22:31:20 -08:00
1ab06f69d7 Update flannel from v0.10.0 to v0.11.0
* https://github.com/coreos/flannel/releases/tag/v0.11.0
2019-01-29 21:51:25 -08:00
eb08593eae Fix azure provider warning, rename a public_ip field
* azurerm_public_ip (used internally) added a field `allocation_method`
to replace the field `public_ip_address_allocation` (deprecated)
* Require terraform-provider-azurerm v1.21+
* https://github.com/terraform-providers/terraform-provider-azurerm/pull/2576
2019-01-27 17:52:35 -08:00
e9659a8539 Update Calico from v3.4.0 to v3.5.0
* https://docs.projectcalico.org/v3.5/releases/
2019-01-27 16:34:30 -08:00
6b87132aa1 Fix per platform/OS links on the docs home page
* Considering the reader of each, the Github README module links
can go to module source code and docs module links can go to the
associated tutorial docs for the platform/OS
2019-01-26 16:50:00 -08:00
f5ff003d0e Update node-exporter from v0.15.2 to v0.17.0
* node-exporter renamed multiple metrics that are reflected
in changes to Prometheus rules and Grafana dashboard expressions
2019-01-22 01:14:00 -08:00
d697dd46dc Allow kube-state-metrics PodDisruptionBudget metrics
* Update kube-state-metrics ClusterRole to allow collecting
poddisruptionbudget metrics (exported as kube_poddisruptionbudget_*)
* https://github.com/kubernetes/kube-state-metrics/pull/551
* Bump addon-resizer from v1.7 to v1.8.4
2019-01-22 01:12:32 -08:00
2f3097ebea Update nginx-ingress from v0.21.0 to v0.22.0
* https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.22.0
2019-01-16 23:01:22 -08:00
f4d3508578 Update CoreDNS from v1.3.0 to v1.3.1
* https://coredns.io/2019/01/13/coredns-1.3.1-release/
2019-01-15 22:50:25 -08:00
67fb9602e7 Update Prometheus from v2.6.0 to v2.6.1
* https://github.com/prometheus/prometheus/releases/tag/v2.6.1
2019-01-15 21:13:40 -08:00
c8a85fabe1 Update Grafana from v5.4.2 to v5.4.3
* https://github.com/grafana/grafana/releases/tag/v5.4.3
2019-01-15 21:13:16 -08:00
7eafa59d8f Fix instance shutdown automatic worker deletion on clouds
* Fix a regression caused by lowering the Kubelet TLS client
certificate to system:nodes group (#100) since dropping
cluster-admin dropped the Kubelet's ability to delete nodes.
* On clouds where workers can scale down (manual terraform apply,
AWS spot termination, Azure low priority deletion), worker shutdown
runs the delete-node.service to remove a node to prevent NotReady
nodes from accumulating
* Allow Kubelets to delete cluster nodes via system:nodes group. Kubelets
acting with system:node and kubelet-delete ClusterRoles is still an
improvement over acting as cluster-admin
2019-01-14 23:27:48 -08:00
679079b242 Add AWS ingress_zone_id output with NLB DNS name's Route53 zone id
* DNS zones served by AWS Route53 may use AWS's special alias records
(other DNS providers would use a CNAME) to resolve the ingress NLB.
Alias records require the NLB DNS name's DNS zone id (not the cluster
`dns_zone_id`)
2019-01-13 16:45:52 -08:00
1d27dc6528 Update kube-state-metrics exporter from v1.4.0 to v1.5.0
* https://github.com/kubernetes/kube-state-metrics/releases/tag/v1.5.0
2019-01-12 14:24:57 -08:00
b74cc8afd2 Update etcd from v3.3.10 to v3.3.11
* https://github.com/etcd-io/etcd/releases/tag/v3.3.11
2019-01-12 14:17:25 -08:00
1d66ad33f7 Change AWS worker modules' default type from t2.small to t3.small
* Worker instance types weren't updated in #365
2019-01-12 00:07:48 -08:00
4d32b79c6f Update Kubernetes from v1.13.1 to v1.13.2
* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.13.md#v1132
2019-01-12 00:00:53 -08:00
df4c0ba05d Use HTTPS liveness probes for kube-scheduler and kube-controller-manager
* Disable kube-scheduler and kube-controller-manager HTTP ports
2019-01-09 20:56:50 -08:00
bfe0c74793 Enable the certificates.k8s.io API to issue cluster certificates
* System components that require certificates signed by the cluster
CA can submit a CSR to the apiserver, have an administrator inspect
and approve it, and be issued a certificate
* Configure kube-controller-manager to sign Approved CSR's using the
cluster CA private key
* Admins are responsible for approving or denying CSRs, otherwise,
no certificate is issued. Read the Kubernetes docs carefully and
verify the entity making the request and the authorization level
* https://kubernetes.io/docs/tasks/tls/managing-tls-in-a-cluster
2019-01-06 17:33:37 -08:00
60c70797ec Use a single format of the admin kubeconfig
* Use a single admin kubeconfig for initial bootkube bootstrap
and for use by a human admin. Previously, an admin kubeconfig
without a named context was used for bootstrap and direct usage
with KUBECONFIG=path, while one with a named context was used
for `kubectl config use-context` style usage. Confusing.
* Provide the admin kubeconfig via `assets/auth/kubeconfig`,
`assets/auth/CLUSTER-config`, or output `kubeconfig-admin`
2019-01-05 14:57:18 -08:00
6795a753ea Update CoreDNS from v1.2.6 to v1.3.0
* https://coredns.io/2018/12/15/coredns-1.3.0-release/
2019-01-05 13:35:03 -08:00
b57273b6f1 Rename internal kube_dns_service_ip to cluster_dns_service_ip
* terraform-render-bootkube module deprecated kube_dns_service_ip
output in favor of cluster_dns_service_ip
* Rename k8s_dns_service_ip to cluster_dns_service_ip for
consistency too
2019-01-05 13:32:03 -08:00
812a1adb49 Use a lower-privilege Kubelet kubeconfig in system:nodes
* Kubelets can use a lower-privilege TLS client certificate with
Org system:nodes and a binding to the system:node ClusterRole
* Admin kubeconfig's continue to belong to Org system:masters to
provide cluster-admin (available in assets/auth/kubeconfig or as
a Terraform output kubeconfig-admin)
* Remove bare-metal output variable kubeconfig
2019-01-05 13:08:56 -08:00
1c6a0392ad Fix missing slash in links in the AWS tutorial 2019-01-02 23:33:02 -08:00
5263d00a6f Update mkdocs-material from v3.1.0 to v3.2.0 2019-01-02 23:31:49 -08:00
66e1365cc4 Add ServiceAccounts for kube-apiserver and kube-scheduler
* Add ServiceAccounts and ClusterRoleBindings for kube-apiserver
and kube-scheduler
* Remove the ClusterRoleBinding for the kube-system default ServiceAccount
* Rename the CA certificate CommonName for consistency with upstream
2019-01-01 20:16:14 -08:00
ea8b0d1c84 Update Prometheus addon from v2.5.0 to v2.6.0
* https://github.com/prometheus/prometheus/releases/tag/v2.6.0
2018-12-27 07:35:12 -08:00
f2f4deb8bb Change AWS default type from t2.small to t3.small
* T3 is the next generation general purpose burstable
instance type. Compared with t2.small, the t3.small is
cheaper, has 2 vCPU (instead of 1) and provides 5 Gbps
of pod-to-pod bandwidth (instead of 1 Gbps)
2018-12-18 12:38:35 -08:00
4d2f33aee6 Update changelog for v1.13.1 release 2018-12-17 14:28:27 -08:00
d42f47c49e Update terraform-provider-ct plugin from v0.2.1 to v0.3.0
* Provide migration instructions for upgrading terraform-provider-ct
in-place for v1.12.2+ clusters
* Require switching from ~/.terraformrc to the Terraform third-party
plugins directory ~/.terraform.d/plugins/
* Require Container Linux 1688.5.3 or newer
2018-12-17 14:13:50 -08:00
53e549f233 Add Flatcar Linux to the issue template 2018-12-16 10:47:59 -08:00
bcb200186d Add admin kubeconfig as a Terraform output
* May be used to write a local file
2018-12-15 22:52:28 -08:00
479d498024 Update Calico from v3.3.2 to v3.4.0
* https://docs.projectcalico.org/v3.4/releases/
2018-12-15 18:05:16 -08:00
e0c032be94 Increase GCP TCP proxy apiserver backend timeout to 5 minutes
* On GCP, kubectl port-forward connections to pods are closed
after a timeout (unlike AWS NLB's or Azure load balancers)
* Increase the GCP apiserver backend service timeout from 1 minute
to 5 minutes to be more similar to AWS/Azure LB behavior
2018-12-15 17:34:18 -08:00
b74bf11772 Update Grafana from v5.4.0 to v5.4.2
* https://github.com/grafana/grafana/releases/tag/v5.4.2
* https://github.com/grafana/grafana/releases/tag/v5.4.1
2018-12-15 12:39:03 -08:00
018c5edc25 Update Kubernetes from v1.13.0 to v1.13.1
* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.13.md#v1131
2018-12-15 11:44:57 -08:00
8aeec0b9b5 Fix typo in descriptive firewall name (#359) 2018-12-15 11:34:32 -08:00
ff6ab571f3 Update Calico from v3.3.1 to v3.3.2
* https://docs.projectcalico.org/v3.3/releases/
2018-12-06 22:56:55 -08:00
991fb44c37 Update Grafana from v5.3.4 to v5.4.0
* https://github.com/grafana/grafana/releases/tag/v5.4.0
2018-12-06 01:33:50 -08:00
d31f444fcd Update Kubernetes from v1.12.3 to v1.13.0 2018-12-03 20:44:32 -08:00
76d993cdae Add experimental kube-router CNI provider
* Add kube-router for pod networking and NetworkPolicy
as an experiment
* Experiments are not documented or supported in any way,
and may be removed without notice. They have known issues
and aren't enabled without special options.
2018-12-03 19:52:28 -08:00
b6016d0a26 Disable Grafana login form, admin user can't be disabled
* Example manifests aim to provide a read-only dashboard visible
to any users with network access (i.e. kubectl port-forward, LAN)
* Problem: Grafana always has an admin user, even with the user
management system disabled
* Disable the login form to prevent admin login
2018-11-28 22:04:08 -08:00
eec314b52f Update CHANGES changelog for release 2018-11-28 09:23:13 -08:00
bcce02a9ce Add Kubelet /etc/iscsi and iscsiadm mounts on bare-metal
* Allow using iSCSI with Container Linux bare-metal clusters
* Warning, iSCSI isn't part of Kubernetes conformance and isn't
regularly evaluated
2018-11-28 00:28:46 -08:00
42c523e6a2 Recommend switch from ~/.terraformrc to 3rd-party plugin dir
* Switch tutorials from using ~/.terraformrc to using the 3rd-party
plugin directory so 3rd-party plugins can be pinned
* Continue to show using terraform-provider-ct v0.2.2. Updating to
a newer version is only safe once all managed clusters are v1.12.2
or higher
2018-11-28 00:03:15 -08:00
64b4c10418 Improve features and modules list docs
* Remove bullet about isolating workloads on workers, its
now common practice and new users will assume it
* List advanced features available in each module
* Fix erroneous Kubernetes version listing for Google Cloud
Fedora Atomic
2018-11-26 22:58:00 -08:00
872b11b948 Update ngninx-ingress from v0.20.0 to v0.21.0
* https://github.com/kubernetes/ingress-nginx/releases/tag/nginx-0.21.0
2018-11-26 21:57:34 -08:00
5b27d8d889 Update Kubernetes from v1.12.2 to v1.12.3
* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.12.md/#v1123
2018-11-26 21:06:09 -08:00
840b73f9ba Update pod-checkpointer image to query Kubelet secure API
* Updates pod-checkpointer to prefer the Kubelet secure
API (before falling back to the Kubelet read-only API that
is disabled on Typhoon clusters since
https://github.com/poseidon/typhoon/pull/324)
* Previously, pod-checkpointer checkpointed an initial set
of pods during bootstrapping so recovery from power cycling
clusters was unaffected, but logs were noisy
* https://github.com/kubernetes-incubator/bootkube/pull/1027
* https://github.com/kubernetes-incubator/bootkube/pull/1025
2018-11-26 20:24:32 -08:00
915af3c6cc Fix Calico Felix reporting usage data, require opt-in
* Calico Felix has been reporting anonymous usage data about the
version and cluster size, which violates Typhoon's privacy policy
where analytics should be opt-in only
* Add a variable enable_reporting (default: false) to allow opting
in to reporting usage data to Calico (or future components)
2018-11-20 01:03:00 -08:00
c6586b69fd Use eviction policy Delete for Low priority VMSS workers
* Fix issue where Azure defaults to Deallocate eviction policy,
which required manually restarting deallocated workers
* Require terraform-provider-azurerm v1.19+ to support setting
the eviction_policy
2018-11-18 21:04:50 -08:00
ea3fc6d2a7 Update CoreDNS from v1.2.4 to v1.2.6
* https://coredns.io/2018/11/05/coredns-1.2.6-release/
2018-11-18 16:45:53 -08:00
c8c43f3991 Update Grafana from v5.3.2 to v5.3.4
* https://github.com/grafana/grafana/releases/tag/v5.3.3
* https://github.com/grafana/grafana/releases/tag/v5.3.4
2018-11-18 16:42:50 -08:00
58472438ce Update mkdocs-material theme from v3.0.6 to v3.1.0
* https://github.com/squidfunk/mkdocs-material/releases/tag/3.1.0
2018-11-18 16:08:00 -08:00
7f8e781ae4 Measure DigitalOcean network performance
* Measuring pod-to-pod bandwidth in a few regions (NYC3, FRA1,
SFO1) shows DigitalOcean has made some improvements
2018-11-11 21:08:10 -08:00
56e9a82984 Add flannel resource request and mount only /run/flannel 2018-11-11 20:35:21 -08:00
e95b856a22 Enable CoreDNS loop and loadbalance plugins
* loop sends an initial query to detect infinite forwarding
loops in configured upstream DNS servers and fast exit with
an error (its a fatal misconfiguration on the network that
will otherwise cause resolvers to consume memory/CPU until
crashing, masking the problem)
* https://github.com/coredns/coredns/tree/master/plugin/loop
* loadbalance randomizes the ordering of A, AAAA, and MX records
in responses to provide round-robin load balancing (as usual,
clients may still cache responses though)
* https://github.com/coredns/coredns/tree/master/plugin/loadbalance
2018-11-10 17:36:56 -08:00
31f48a81a8 Update docs to show flannel DaemonSet instead of kube-flannel
* No functional change, the rename is just for consistency
2018-11-10 15:16:06 -08:00
2b3f61d1bb Update Calico from v3.3.0 to v3.3.1
* Structure Calico and flannel manifests
* Rename kube-flannel mentions to just flannel
2018-11-10 13:37:12 -08:00
8fd2978c31 Update bootkube image version from v0.13.0 to v0.14.0
* https://github.com/kubernetes-incubator/bootkube/releases/tag/v0.14.0
2018-11-06 23:35:11 -08:00
7de03a1279 Fix Prometheus etcd scrape config for DigitalOcean
* Kubelet uses a node's hostname as the node name, which isn't
resolvable on DigitalOcean. On DigitalOcean, the node name was
set to the internal IP until #337 switched to instead configuring
kube-apiserver to prefer the InternalIP for communication
* Explicitly configure etcd scrapes to target each controller by
internal IP and port 2381 (replace __address__)
2018-11-06 23:02:45 -08:00
be9f7b87d6 Update Prometheus from v2.4.3 to v2.5.0
* https://github.com/prometheus/prometheus/releases/tag/v2.5.0
2018-11-06 22:16:12 -08:00
721c847943 Set kube-apiserver kubelet preferred address types
* Prefer InternalIP and ExternalIP over the node's hostname,
to match upstream behavior and kubeadm
* Previously, hostname-override was used to set node names
to internal IP's to work around some cloud providers not
resolving hostnames for instances (e.g. DO droplets)
2018-11-03 22:31:55 -07:00
78c9fdc18f Update mkdocs-material docs theme version 2018-10-28 19:45:58 -07:00
884c8b39dc Update Grafana from v5.3.1 to v5.3.2
* https://github.com/grafana/grafana/releases/tag/v5.3.2
2018-10-28 19:44:22 -07:00
0e71f7e565 Ignore controller user_data changes to allow plugin updates
* Updating the `terraform-provider-ct` plugin is known to produce
a `user_data` diff in all pre-existing clusters. Applying the
diff to pre-existing cluster destroys controller nodes
* Ignore changes to controller `user_data`. Once all managed
clusters use a release containing this change, it is possible
to update the `terraform-provider-ct` plugin (worker `user_data`
will still be modified)
* Changing the module `ref` for an existing cluster and
re-applying is still NOT supported (although this PR
would protect controllers from being destroyed)
2018-10-28 16:48:12 -07:00
8c4200d425 Add DigitalOcean AAAA DNS records on Fedora Atomic 2018-10-28 14:57:31 -07:00
5be5b261e2 Add an IPv6 address and forwarding rules on Google Cloud
* Allowing serving IPv6 applications via Kubernetes Ingress
on Typhoon Google Cloud clusters
* Add `ingress_static_ipv6` output variable for use in AAAA
DNS records
2018-10-28 14:30:58 -07:00
f034ef90ae Add DigitalOcean AAAA DNS records resolving to workers
* Improve the workers "round-robin" DNS FQDN that is created
with each cluster by adding AAAA records
* CNAME's resolving to the DigitalOcean `workers_dns` output
can be followed to find a droplet's IPv4 or IPv6 address
* The CNI portmap plugin doesn't support IPv6. Hosting IPv6
apps is possible, but requires editing the nginx-ingress
addon with `hostNetwork: true`
2018-10-27 23:09:24 -07:00
3bba1ba0dc Use new azurerm_network_interface_backend_address_pool_association
* Require terraform-provider-azurerm v1.17+
* Inline load_balancer_backend_address_pools_ids is deprecated
and scheduled for removal in the v2.0 provider
* https://github.com/terraform-providers/terraform-provider-azurerm/pull/2079
2018-10-27 22:55:05 -07:00
dbe7604b67 Add primary field to ip_configuration required by Azure
* Required by terraform-provider-azurerm v1.17+
* https://github.com/terraform-providers/terraform-provider-azurerm/pull/2035
2018-10-27 16:44:44 -07:00
9b405a19b2 Fix minor naming inconsistencies in Ignition and CLC data 2018-10-27 16:24:59 -07:00
bfa1a679eb Name AWS and DigitalOcean Ignition data sources consistently 2018-10-27 16:14:44 -07:00
f1da0731d8 Update Kubernetes from v1.12.1 to v1.12.2
* Update CoreDNS from v1.2.2 to v1.2.4
* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.12.md#v1122
* https://coredns.io/2018/10/17/coredns-1.2.4-release/
* https://coredns.io/2018/10/16/coredns-1.2.3-release/
2018-10-27 15:47:57 -07:00
d641a058fe Update Calico from v3.2.3 to v3.3.0
* https://docs.projectcalico.org/v3.3/releases/
2018-10-23 20:30:30 -07:00
99a6d5478b Disable Kubelet read-only port 10255
* We can finally disable the Kubelet read-only port 10255!
* Journey: https://github.com/poseidon/typhoon/issues/322#issuecomment-431073073
2018-10-18 21:14:14 -07:00
bc750aec33 Configure Heapster to source metrics from Kubelet authenticated API
* Heapster can now get nodes (i.e. kubelets) from the apiserver and
source metrics from the Kubelet authenticated API (10250) instead of
the Kubelet HTTP read-only API (10255)
* https://github.com/kubernetes/heapster/blob/master/docs/source-configuration.md
* Use the heapster service account token via Kubelet bearer token
authn/authz.
* Permit Heapster to skip CA verification. The CA cert does not contain
IP SANs and cannot since nodes get random IPs that aren't known upfront.
Heapster obtains the node list from the apiserver, so the risk of
spoofing a node is limited. For the same reason, Prometheus scrapes
must skip CA verification for scraping Kubelet's provided by the apiserver.
* https://github.com/poseidon/typhoon/blob/v1.12.1/addons/prometheus/config.yaml#L68
* Create a heapster ClusterRole to work around the default Kubernetes
`system:heapster` ClusterRole lacking the proper GET `nodes/stats`
access. See https://github.com/kubernetes/heapster/issues/1936
2018-10-18 21:03:01 -07:00
d55bfd5589 Fix CoreDNS AntiAffinity spec to prefer spreading replicas
* Pods were still being scheduled at random due to a typo
2018-10-17 22:19:57 -07:00
0be4673e44 Add disk_iops variable for AWS
* Setting disk_iops is required for disk_type io1
* https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html#EBSVolumeTypes
2018-10-17 22:18:54 -07:00
3b44972d78 Add links to header to CHANGES 2018-10-17 09:08:58 -07:00
162 changed files with 10611 additions and 7855 deletions

View File

@ -5,8 +5,8 @@
### Environment
* Platform: aws, azure, bare-metal, google-cloud, digital-ocean
* OS: container-linux, fedora-atomic
* Ref: Release version or Git SHA (reporting latest is **not** helpful)
* OS: container-linux, flatcar-linux, or fedora-atomic
* Release: Typhoon version or Git SHA (reporting latest is **not** helpful)
* Terraform: `terraform version` (reporting latest is **not** helpful)
* Plugins: Provider plugin versions (reporting latest is **not** helpful)

View File

@ -4,20 +4,216 @@ Notable changes between versions.
## Latest
* Kubernetes [v1.12.1](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.12.md#v1121)
* Update etcd from v3.3.9 to [v3.3.10](https://github.com/etcd-io/etcd/blob/master/CHANGELOG-3.3.md#v3310-2018-10-10)
* Update CoreDNS from 1.1.3 to 1.2.2
* Update Calico from v3.2.1 to v3.2.3
* On multi-controller clusters, raise scheduler and controller-manager replics to equal the number of controller nodes ([#312](https://github.com/poseidon/typhoon/pull/312))
* Single-controller clusters continue to run 2 replicas as before
* Raise default CoreDNS replica count to the larger of 2 or the number of controller nodes ([#313](https://github.com/poseidon/typhoon/pull/313))
* Add AntiAffinity preferred rule to favor spreading CoreDNS pods
* Annotate Kubernetes control plane and addons to start containers with the Docker runtime's default seccomp profile ([#319](https://github.com/poseidon/typhoon/pull/319))
* Override Kubernetes default behavior that starts containers with seccomp=unconfined
## v1.13.4
* Kubernetes [v1.13.4](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.13.md#v1134)
* Update etcd from v3.3.11 to [v3.3.12](https://github.com/etcd-io/etcd/releases/tag/v3.3.12)
* Update Calico from v3.5.0 to [v3.5.2](https://docs.projectcalico.org/v3.5/releases/)
* Assign priorityClassNames to critical cluster and node components ([#406](https://github.com/poseidon/typhoon/pull/406))
* Inform node out-of-resource eviction and scheduler preemption and ordering
* Add CoreDNS readiness probe ([#410](https://github.com/poseidon/typhoon/pull/410))
#### Bare-Metal
* Recommend updating [terraform-provider-matchbox](https://github.com/coreos/terraform-provider-matchbox) plugin from v0.2.2 to [v0.2.3](https://github.com/coreos/terraform-provider-matchbox/releases/tag/v0.2.3) ([#402](https://github.com/poseidon/typhoon/pull/402))
* Improve docs on using Ubiquiti EdgeOS with bare-metal clusters ([#413](https://github.com/poseidon/typhoon/pull/413))
#### Google Cloud
* Support `terraform-provider-google` v2.0+ ([#407](https://github.com/poseidon/typhoon/pull/407))
* Require `terraform-provider-google` v1.19+ (**action required**)
* Set the minimum CPU platform to Intel Haswell ([#405](https://github.com/poseidon/typhoon/pull/405))
* Haswell or better is available in every zone (no price change)
* A few zones still default to Sandy/Ivy Bridge (shifts in April 2019)
#### Addons
* Modernize Prometheus rules and alerts ([#404](https://github.com/poseidon/typhoon/pull/404))
* Drop extraneous metrics ([#397](https://github.com/poseidon/typhoon/pull/397))
* Add `pod` name label to metrics discovered via service endpoints
* Rename `kubernetes_namespace` label to `namespace`
* Modernize Grafana and dashboards, see [docs](https://typhoon.psdn.io/addons/grafana/) ([#403](https://github.com/poseidon/typhoon/pull/403), [#404](https://github.com/poseidon/typhoon/pull/404))
* Upgrade Grafana from v5.4.3 to [v6.0.0](https://github.com/grafana/grafana/releases/tag/v6.0.0)!
* Enable Grafana [Explore](http://docs.grafana.org/guides/whats-new-in-v6-0/#explore) UI as a Viewer (inspect/edit without saving)
* Update nginx-ingress from v0.22.0 to v0.23.0
* Raise nginx-ingress liveness/readiness timeout to 5 seconds
* Remove nginx-ingess default-backend ([#401](https://github.com/poseidon/typhoon/pull/401))
#### Fedora Atomic
* Build Kubelet [system container](https://github.com/poseidon/system-containers) with buildah. The image is an OCI format and slightly larger.
## v1.13.3
* Kubernetes [v1.13.3](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.13.md#v1133)
* Update etcd from v3.3.10 to [v3.3.11](https://github.com/etcd-io/etcd/blob/master/CHANGELOG-3.3.md#v3311-2019-1-11)
* Update CoreDNS from v1.3.0 to [v1.3.1](https://coredns.io/2019/01/13/coredns-1.3.1-release/)
* Switch from the `proxy` plugin to the faster `forward` plugin for upsteam resolvers
* Update Calico from v3.4.0 to [v3.5.0](https://docs.projectcalico.org/v3.5/releases/)
* Update flannel from v0.10.0 to [v0.11.0](https://github.com/coreos/flannel/releases/tag/v0.11.0)
* Reduce pod eviction timeout for deleting pods on unready nodes to 1 minute
* Respond more quickly to node preemption (previously 5 minutes)
* Fix automatic worker deletion on shutdown for cloud platforms
* Lowering Kubelet privileges in [#372](https://github.com/poseidon/typhoon/pull/372) dropped a needed node deletion authorization. Scale-in due to manual terraform apply (any cloud), AWS spot termination, or Azure low priority deletion left old nodes registered, requiring manual deletion (`kubectl delete node name`)
#### AWS
* Add `ingress_zone_id` output with the NLB DNS name's Route53 zone for use in alias records ([#380](https://github.com/poseidon/typhoon/pull/380))
#### Azure
* Remove admin_password field (disabled) since it is now optional
* Fix azure provider warning, `public_ip` `allocation_method` replaces `public_ip_address_allocation`
* Require `terraform-provider-azurerm` v1.21+ (action required)
#### Addons
* Update nginx-ingress from v0.21.0 to v0.22.0
* Update Prometheus from v2.6.0 to v2.7.1
* Update kube-state-metrics from v1.4.0 to v1.5.0
* Fix ClusterRole to collect and export PodDisruptionBudget metrics ([#383](https://github.com/poseidon/typhoon/pull/383))
* Update node-exporter from v0.15.2 to v0.17.0
* Update Grafana from v5.4.2 to v5.4.3
## v1.13.2
* Kubernetes [v1.13.2](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.13.md#v1132)
* Add ServiceAccounts for `kube-apiserver` and `kube-scheduler` ([#370](https://github.com/poseidon/typhoon/pull/370))
* Use lower-privilege TLS client certificates for Kubelets ([#372](https://github.com/poseidon/typhoon/pull/372))
* Use HTTPS liveness probes for `kube-scheduler` and `kube-controller-manager` ([#377](https://github.com/poseidon/typhoon/pull/377))
* Update CoreDNS from v1.2.6 to [v1.3.0](https://coredns.io/2018/12/15/coredns-1.3.0-release/)
* Allow the `certificates.k8s.io` API to issue certificates signed by the cluster CA ([#376](https://github.com/poseidon/typhoon/pull/376))
* Configure controller manager to sign CSRs that are manually [approved](https://kubernetes.io/docs/tasks/tls/managing-tls-in-a-cluster) by an administrator
#### AWS
* Change `controller_type` and `worker_type` default from t2.small to t3.small ([#365](https://github.com/poseidon/typhoon/pull/365))
* t3.small is cheaper, provides 2 vCPU (instead of 1), and 5 Gbps of pod-to-pod bandwidth!
#### Bare-Metal
* Remove the `kubeconfig` output variable
#### Addons
* Update Prometheus from v2.5.0 to v2.6.0
## v1.13.1
* Kubernetes [v1.13.1](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.13.md#v1131)
* Update Calico from v3.3.2 to [v3.4.0](https://docs.projectcalico.org/v3.4/releases/) ([#362](https://github.com/poseidon/typhoon/pull/362))
* Install CNI plugins with an init container rather than a sidecar
* Improve the `calico-node` ClusterRole
* Recommend updating `terraform-provider-ct` plugin from v0.2.1 to v0.3.0 ([#363](https://github.com/poseidon/typhoon/pull/363))
* [Migration](https://typhoon.psdn.io/topics/maintenance/#upgrade-terraform-provider-ct) instructions for upgrading `terraform-provider-ct` in-place for v1.12.2+ clusters (**action required**)
* [Require](https://typhoon.psdn.io/topics/maintenance/#terraform-plugins-directory) switching from `~/.terraformrc` to the Terraform [third-party plugins](https://www.terraform.io/docs/configuration/providers.html#third-party-plugins) directory `~/.terraform.d/plugins/`
* Require Container Linux 1688.5.3 or newer
#### Google Cloud
* Increase TCP proxy apiserver backend service timeout from 1 minute to 5 minutes ([#361](https://github.com/poseidon/typhoon/pull/361))
* Align `port-forward` behavior closer to AWS/Azure (no timeout)
#### Addons
* Update Grafana from v5.4.0 to v5.4.2
## v1.13.0
* Kubernetes [v1.13.0](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.13.md#v1130)
* Update Calico from v3.3.1 to [v3.3.2](https://docs.projectcalico.org/v3.3/releases/)
#### Addons
* Update Grafana from v5.3.4 to v5.4.0
* Disable Grafana login form, since admin user can't be disabled ([#352](https://github.com/poseidon/typhoon/pull/352))
* Example manifests aim to provide a read-only dashboard view
## v1.12.3
* Kubernetes [v1.12.3](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.12.md#v1123)
* Add `enable_reporting` variable (default "false") to provide upstreams with usage data ([#345](https://github.com/poseidon/typhoon/pull/345))
* Change kube-apiserver `--kubelet-preferred-address-types` to InternalIP,ExternalIP,Hostname
* Update Calico from v3.3.0 to [v3.3.1](https://docs.projectcalico.org/v3.3/releases/)
* Disable Felix usage reporting by default ([#345](https://github.com/poseidon/typhoon/pull/345))
* Improve flannel manifests
* [Rename](https://github.com/poseidon/terraform-render-bootkube/commit/d045a8e6b8eccfbb9d69bb51953b5a93d23f67f7) `kube-flannel` DaemonSet to `flannel` and `kube-flannel-cfg` ConfigMap to `flannel-config`
* [Drop](https://github.com/poseidon/terraform-render-bootkube/commit/39f9afb3360ec642e5b98457c8bd07eda35b6c96) unused mounts and add a CPU resource request
* Update CoreDNS from v1.2.4 to [v1.2.6](https://coredns.io/2018/11/05/coredns-1.2.6-release/)
* Enable CoreDNS `loop` and `loadbalance` plugins ([#340](https://github.com/poseidon/typhoon/pull/340))
* Fix pod-checkpointer log noise and checkpointable pods detection ([#346](https://github.com/poseidon/typhoon/pull/346))
* Use kubernetes-incubator/bootkube v0.14.0
* [Recommend](https://typhoon.psdn.io/topics/maintenance/#terraform-plugins-directory) switching from `~/.terraformrc` to the Terraform [third-party plugins](https://www.terraform.io/docs/configuration/providers.html#third-party-plugins) directory `~/.terraform.d/plugins/`.
* Allows pinning `terraform-provider-ct` and `terraform-provider-matchbox` versions
* Improves safety of later plugin version migrations
#### Azure
* Use eviction policy `Delete` for `Low` priority virtual machine scale set workers ([#343](https://github.com/poseidon/typhoon/pull/343))
* Fix issue where Azure defaults to `Deallocate` eviction policy, which required manually restarting deallocated instances. `Delete` policy aligns Azure with AWS and GCP behavior.
* Require `terraform-provider-azurerm` v1.19+ (action required)
#### Bare-Metal
* Add Kubelet `/etc/iscsi` and `iscsadm` mounts on bare-metal for iSCSI ([#103](https://github.com/poseidon/typhoon/pull/103))
#### Addons
* Update nginx-ingress from v0.20.0 to v0.21.0
* Update Prometheus from v2.4.3 to v2.5.0
* Update Grafana from v5.3.2 to v5.3.4
## v1.12.2
* Kubernetes [v1.12.2](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.12.md#v1122)
* Update CoreDNS from 1.2.2 to [1.2.4](https://github.com/coredns/coredns/releases/tag/v1.2.4)
* Update Calico from v3.2.3 to [v3.3.0](https://docs.projectcalico.org/v3.3/releases/)
* Disable Kubelet read-only port ([#324](https://github.com/poseidon/typhoon/pull/324))
* Fix CoreDNS AntiAffinity spec to prefer spreading replicas
* Ignore controller node user-data changes ([#335](https://github.com/poseidon/typhoon/pull/335))
* Once all managed clusters use v1.12.2, it is possible to update `terraform-provider-ct`
#### AWS
* Add `disk_iops` variable for EBS volume IOPS ([#314](https://github.com/poseidon/typhoon/pull/314))
#### Azure
* Use new `azurerm_network_interface_backend_address_pool_association` ([#332](https://github.com/poseidon/typhoon/pull/332))
* Require `terraform-provider-azurerm` v1.17+ (action required)
* Add `primary` field to `ip_configuration` needed by v1.17+ ([#331](https://github.com/poseidon/typhoon/pull/331))
#### DigitalOcean
* Add AAAA DNS records resolving to worker nodes ([#333](https://github.com/poseidon/typhoon/pull/333))
* Hosting IPv6 apps requires editing nginx-ingress with `hostNetwork: true`
#### Google Cloud
* Add an IPv6 address and IPv6 forwarding rules for load balancing IPv6 Ingress ([#334](https://github.com/poseidon/typhoon/pull/334))
* Add `ingress_static_ipv6` output variable for use in AAAA DNS records
* Allow serving IPv6 applications via Kubernetes Ingress
#### Addons
* Configure Heapster to scrape Kubelets with bearer token auth ([#323](https://github.com/poseidon/typhoon/pull/323))
* Update Grafana from v5.3.1 to v5.3.2
## v1.12.1
* Kubernetes [v1.12.1](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.12.md#v1121)
* Update etcd from v3.3.9 to [v3.3.10](https://github.com/etcd-io/etcd/blob/master/CHANGELOG-3.3.md#v3310-2018-10-10)
* Update CoreDNS from 1.1.3 to [1.2.2](https://github.com/coredns/coredns/releases/tag/v1.2.2)
* Update Calico from v3.2.1 to [v3.2.3](https://docs.projectcalico.org/v3.2/releases/)
* Raise scheduler and controller-manager replicas to the larger of 2 or the number of controller nodes ([#312](https://github.com/poseidon/typhoon/pull/312))
* Single-controller clusters continue to run 2 replicas as before
* Raise default CoreDNS replicas to the larger of 2 or the number of controller nodes ([#313](https://github.com/poseidon/typhoon/pull/313))
* Add AntiAffinity preferred rule to favor spreading CoreDNS pods
* Annotate control plane and addon containers to use the Docker runtime seccomp profile ([#319](https://github.com/poseidon/typhoon/pull/319))
* Override Kubernetes default behavior that starts containers with `seccomp=unconfined`
#### Azure
* Remove `admin_password` field (disabled) since it is now optional
* Require `terraform-provider-azurerm` v1.16+ (action required)
#### Bare-Metal

View File

@ -11,29 +11,32 @@ Typhoon distributes upstream Kubernetes, architectural conventions, and cluster
## Features <a href="https://www.cncf.io/certification/software-conformance/"><img align="right" src="https://storage.googleapis.com/poseidon/certified-kubernetes.png"></a>
* Kubernetes v1.12.1 (upstream, via [kubernetes-incubator/bootkube](https://github.com/kubernetes-incubator/bootkube))
* Single or multi-master, workloads isolated on workers, [Calico](https://www.projectcalico.org/) or [flannel](https://github.com/coreos/flannel) networking
* Kubernetes v1.13.4 (upstream, via [kubernetes-incubator/bootkube](https://github.com/kubernetes-incubator/bootkube))
* Single or multi-master, [Calico](https://www.projectcalico.org/) or [flannel](https://github.com/coreos/flannel) networking
* On-cluster etcd with TLS, [RBAC](https://kubernetes.io/docs/admin/authorization/rbac/)-enabled, [network policy](https://kubernetes.io/docs/concepts/services-networking/network-policies/)
* Advanced features like [worker pools](https://typhoon.psdn.io/advanced/worker-pools/) and [preemption](https://typhoon.psdn.io/cl/google-cloud/#preemption) (varies by platform)
* Ready for Ingress, Prometheus, Grafana, and other optional [addons](https://typhoon.psdn.io/addons/overview/)
* Advanced features like [worker pools](https://typhoon.psdn.io/advanced/worker-pools/), [preemptible](https://typhoon.psdn.io/cl/google-cloud/#preemption) workers, and [snippets](https://typhoon.psdn.io/advanced/customization/#container-linux) customization
* Ready for Ingress, Prometheus, Grafana, CSI, or other [addons](https://typhoon.psdn.io/addons/overview/)
## Modules
Typhoon provides a Terraform Module for each supported operating system and platform.
Typhoon provides a Terraform Module for each supported operating system and platform. Container Linux is a mature and reliable choice. Also, Kinvolk's Flatcar Linux fork is selectable on AWS and bare-metal.
| Platform | Operating System | Terraform Module | Status |
|---------------|------------------|------------------|--------|
| AWS | Container Linux | [aws/container-linux/kubernetes](aws/container-linux/kubernetes) | stable |
| AWS | Fedora Atomic | [aws/fedora-atomic/kubernetes](aws/fedora-atomic/kubernetes) | alpha |
| Azure | Container Linux | [azure/container-linux/kubernetes](cl/azure.md) | alpha |
| Bare-Metal | Container Linux | [bare-metal/container-linux/kubernetes](bare-metal/container-linux/kubernetes) | stable |
| Bare-Metal | Fedora Atomic | [bare-metal/fedora-atomic/kubernetes](bare-metal/fedora-atomic/kubernetes) | alpha |
| Digital Ocean | Container Linux | [digital-ocean/container-linux/kubernetes](digital-ocean/container-linux/kubernetes) | beta |
| Digital Ocean | Fedora Atomic | [digital-ocean/fedora-atomic/kubernetes](digital-ocean/fedora-atomic/kubernetes) | alpha |
| Google Cloud | Container Linux | [google-cloud/container-linux/kubernetes](google-cloud/container-linux/kubernetes) | stable |
| Google Cloud | Fedora Atomic | [google-cloud/fedora-atomic/kubernetes](google-cloud/fedora-atomic/kubernetes) | alpha |
The AWS and bare-metal `container-linux` modules allow picking Red Hat Container Linux (formerly CoreOS Container Linux) or Kinvolk's Flatcar Linux friendly fork.
Fedora Atomic support is alpha and will evolve as Fedora Atomic is replaced by Fedora CoreOS.
| Platform | Operating System | Terraform Module | Status |
|---------------|------------------|------------------|--------|
| AWS | Fedora Atomic | [aws/fedora-atomic/kubernetes](aws/fedora-atomic/kubernetes) | alpha |
| Bare-Metal | Fedora Atomic | [bare-metal/fedora-atomic/kubernetes](bare-metal/fedora-atomic/kubernetes) | alpha |
| Digital Ocean | Fedora Atomic | [digital-ocean/fedora-atomic/kubernetes](digital-ocean/fedora-atomic/kubernetes) | alpha |
| Google Cloud | Fedora Atomic | [google-cloud/fedora-atomic/kubernetes](google-cloud/fedora-atomic/kubernetes) | alpha |
## Documentation
@ -47,7 +50,7 @@ Define a Kubernetes cluster by using the Terraform module for your chosen platfo
```tf
module "google-cloud-yavin" {
source = "git::https://github.com/poseidon/typhoon//google-cloud/container-linux/kubernetes?ref=v1.12.1"
source = "git::https://github.com/poseidon/typhoon//google-cloud/container-linux/kubernetes?ref=v1.13.4"
providers = {
google = "google.default"
@ -87,10 +90,10 @@ In 4-8 minutes (varies by platform), the cluster will be ready. This Google Clou
```sh
$ export KUBECONFIG=/home/user/.secrets/clusters/yavin/auth/kubeconfig
$ kubectl get nodes
NAME STATUS AGE VERSION
yavin-controller-0.c.example-com.internal Ready 6m v1.12.1
yavin-worker-jrbf.c.example-com.internal Ready 5m v1.12.1
yavin-worker-mzdm.c.example-com.internal Ready 5m v1.12.1
NAME ROLES STATUS AGE VERSION
yavin-controller-0.c.example-com.internal controller,master Ready 6m v1.13.4
yavin-worker-jrbf.c.example-com.internal node Ready 5m v1.13.4
yavin-worker-mzdm.c.example-com.internal node Ready 5m v1.13.4
```
List the pods.
@ -102,6 +105,7 @@ kube-system calico-node-1cs8z 2/2 Running 0
kube-system calico-node-d1l5b 2/2 Running 0 6m
kube-system calico-node-sp9ps 2/2 Running 0 6m
kube-system coredns-1187388186-zj5dl 1/1 Running 0 6m
kube-system coredns-1187388186-dkh3o 1/1 Running 0 6m
kube-system kube-apiserver-zppls 1/1 Running 0 6m
kube-system kube-controller-manager-3271970485-gh9kt 1/1 Running 0 6m
kube-system kube-controller-manager-3271970485-h90v8 1/1 Running 1 6m
@ -111,6 +115,7 @@ kube-system kube-proxy-njn47 1/1 Running 0
kube-system kube-scheduler-3895335239-5x87r 1/1 Running 0 6m
kube-system kube-scheduler-3895335239-bzrrt 1/1 Running 1 6m
kube-system pod-checkpointer-l6lrt 1/1 Running 0 6m
kube-system pod-checkpointer-l6lrt-controller-0 1/1 Running 0 6m
```
## Non-Goals

View File

@ -0,0 +1,36 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-config
namespace: monitoring
data:
custom.ini: |+
[server]
http_port = 8080
[paths]
data = /var/lib/grafana
plugins = /var/lib/grafana/plugins
provisioning = /etc/grafana/provisioning
[users]
allow_sign_up = false
allow_org_create = false
# viewers can edit/inspect, but not save
viewers_can_edit = true
# Disable login form, since Grafana always creates an admin user
[auth]
disable_login_form = true
# Disable the user/pass login system
[auth.basic]
enabled = false
# Allow anonymous authentication with view-only authorization
[auth.anonymous]
enabled = true
org_role = Viewer
[analytics]
reporting_enabled = false

File diff suppressed because it is too large Load Diff

View File

@ -10,7 +10,15 @@ data:
- name: prometheus
type: prometheus
access: proxy
orgId: 1
url: http://prometheus.monitoring.svc.cluster.local
version: 1
editable: false
loki.yaml: |+
apiVersion: 1
datasources:
- name: loki
type: loki
access: proxy
url: http://loki.monitoring.svc.cluster.local
version: 1
editable: false

View File

@ -23,18 +23,10 @@ spec:
spec:
containers:
- name: grafana
image: grafana/grafana:5.3.1
image: grafana/grafana:6.0.0
env:
- name: GF_SERVER_HTTP_PORT
value: "8080"
- name: GF_AUTH_BASIC_ENABLED
value: "false"
- name: GF_AUTH_ANONYMOUS_ENABLED
value: "true"
- name: GF_AUTH_ANONYMOUS_ORG_ROLE
value: Viewer
- name: GF_ANALYTICS_REPORTING_ENABLED
value: "false"
- name: GF_PATHS_CONFIG
value: "/etc/grafana/custom.ini"
ports:
- name: http
containerPort: 8080
@ -46,19 +38,24 @@ spec:
memory: 200Mi
cpu: 200m
volumeMounts:
- name: config
mountPath: /etc/grafana
- name: datasources
mountPath: /etc/grafana/provisioning/datasources
- name: dashboard-providers
- name: providers
mountPath: /etc/grafana/provisioning/dashboards
- name: dashboards
mountPath: /var/lib/grafana/dashboards
mountPath: /etc/grafana/dashboards
volumes:
- name: config
configMap:
name: grafana-config
- name: datasources
configMap:
name: grafana-datasources
- name: dashboard-providers
- name: providers
configMap:
name: grafana-dashboard-providers
name: grafana-providers
- name: dashboards
configMap:
name: grafana-dashboards

View File

@ -1,10 +1,10 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboard-providers
name: grafana-providers
namespace: monitoring
data:
dashboard-providers.yaml: |+
providers.yaml: |+
apiVersion: 1
providers:
- name: 'default'
@ -12,4 +12,4 @@ data:
folder: ''
type: file
options:
path: /var/lib/grafana/dashboards
path: /etc/grafana/dashboards

View File

@ -5,7 +5,7 @@ metadata:
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:heapster
name: heapster
subjects:
- kind: ServiceAccount
name: heapster

View File

@ -0,0 +1,30 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: heapster
rules:
- apiGroups:
- ""
resources:
- events
- namespaces
- nodes
- pods
verbs:
- get
- list
- watch
- apiGroups:
- extensions
resources:
- deployments
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- nodes/stats
verbs:
- get

View File

@ -23,7 +23,7 @@ spec:
image: k8s.gcr.io/heapster-amd64:v1.5.4
command:
- /heapster
- --source=kubernetes.summary_api:''
- --source=kubernetes.summary_api:''?useServiceAccount=true&kubeletHttps=true&kubeletPort=10250&insecure=true
livenessProbe:
httpGet:
path: /healthz

View File

@ -1,42 +0,0 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: default-backend
namespace: ingress
spec:
replicas: 1
selector:
matchLabels:
name: default-backend
phase: prod
template:
metadata:
labels:
name: default-backend
phase: prod
annotations:
seccomp.security.alpha.kubernetes.io/pod: 'docker/default'
spec:
containers:
- name: default-backend
# Any image is permissable as long as:
# 1. It serves a 404 page at /
# 2. It serves 200 on a /healthz endpoint
image: k8s.gcr.io/defaultbackend:1.4
ports:
- containerPort: 8080
resources:
limits:
cpu: 10m
memory: 20Mi
requests:
cpu: 10m
memory: 20Mi
livenessProbe:
httpGet:
path: /healthz
port: 8080
scheme: HTTP
initialDelaySeconds: 30
timeoutSeconds: 5
terminationGracePeriodSeconds: 60

View File

@ -1,15 +0,0 @@
apiVersion: v1
kind: Service
metadata:
name: default-backend
namespace: ingress
spec:
type: ClusterIP
selector:
name: default-backend
phase: prod
ports:
- name: http
protocol: TCP
port: 80
targetPort: 8080

View File

@ -24,10 +24,9 @@ spec:
node-role.kubernetes.io/node: ""
containers:
- name: nginx-ingress-controller
image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.20.0
image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.23.0
args:
- /nginx-ingress-controller
- --default-backend-service=$(POD_NAMESPACE)/default-backend
- --ingress-class=public
# use downward API
env:
@ -58,7 +57,7 @@ spec:
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
timeoutSeconds: 5
readinessProbe:
failureThreshold: 3
httpGet:
@ -67,7 +66,7 @@ spec:
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
timeoutSeconds: 5
securityContext:
capabilities:
add:

View File

@ -1,42 +0,0 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: default-backend
namespace: ingress
spec:
replicas: 1
selector:
matchLabels:
name: default-backend
phase: prod
template:
metadata:
labels:
name: default-backend
phase: prod
annotations:
seccomp.security.alpha.kubernetes.io/pod: 'docker/default'
spec:
containers:
- name: default-backend
# Any image is permissable as long as:
# 1. It serves a 404 page at /
# 2. It serves 200 on a /healthz endpoint
image: k8s.gcr.io/defaultbackend:1.4
ports:
- containerPort: 8080
resources:
limits:
cpu: 10m
memory: 20Mi
requests:
cpu: 10m
memory: 20Mi
livenessProbe:
httpGet:
path: /healthz
port: 8080
scheme: HTTP
initialDelaySeconds: 30
timeoutSeconds: 5
terminationGracePeriodSeconds: 60

View File

@ -1,15 +0,0 @@
apiVersion: v1
kind: Service
metadata:
name: default-backend
namespace: ingress
spec:
type: ClusterIP
selector:
name: default-backend
phase: prod
ports:
- name: http
protocol: TCP
port: 80
targetPort: 8080

View File

@ -24,10 +24,9 @@ spec:
node-role.kubernetes.io/node: ""
containers:
- name: nginx-ingress-controller
image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.20.0
image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.23.0
args:
- /nginx-ingress-controller
- --default-backend-service=$(POD_NAMESPACE)/default-backend
- --ingress-class=public
# use downward API
env:
@ -58,7 +57,7 @@ spec:
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
timeoutSeconds: 5
readinessProbe:
failureThreshold: 3
httpGet:
@ -67,7 +66,7 @@ spec:
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
timeoutSeconds: 5
securityContext:
capabilities:
add:

View File

@ -1,42 +0,0 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: default-backend
namespace: ingress
spec:
replicas: 1
selector:
matchLabels:
name: default-backend
phase: prod
template:
metadata:
labels:
name: default-backend
phase: prod
annotations:
seccomp.security.alpha.kubernetes.io/pod: 'docker/default'
spec:
containers:
- name: default-backend
# Any image is permissable as long as:
# 1. It serves a 404 page at /
# 2. It serves 200 on a /healthz endpoint
image: k8s.gcr.io/defaultbackend:1.4
ports:
- containerPort: 8080
resources:
limits:
cpu: 10m
memory: 20Mi
requests:
cpu: 10m
memory: 20Mi
livenessProbe:
httpGet:
path: /healthz
port: 8080
scheme: HTTP
initialDelaySeconds: 30
timeoutSeconds: 5
terminationGracePeriodSeconds: 60

View File

@ -1,15 +0,0 @@
apiVersion: v1
kind: Service
metadata:
name: default-backend
namespace: ingress
spec:
type: ClusterIP
selector:
name: default-backend
phase: prod
ports:
- name: http
protocol: TCP
port: 80
targetPort: 8080

View File

@ -22,10 +22,9 @@ spec:
spec:
containers:
- name: nginx-ingress-controller
image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.20.0
image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.23.0
args:
- /nginx-ingress-controller
- --default-backend-service=$(POD_NAMESPACE)/default-backend
- --ingress-class=public
# use downward API
env:
@ -53,7 +52,7 @@ spec:
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
timeoutSeconds: 1
timeoutSeconds: 5
readinessProbe:
httpGet:
path: /healthz
@ -62,7 +61,7 @@ spec:
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
timeoutSeconds: 1
timeoutSeconds: 5
securityContext:
capabilities:
add:

View File

@ -24,10 +24,9 @@ spec:
node-role.kubernetes.io/node: ""
containers:
- name: nginx-ingress-controller
image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.20.0
image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.23.0
args:
- /nginx-ingress-controller
- --default-backend-service=$(POD_NAMESPACE)/default-backend
- --ingress-class=public
# use downward API
env:
@ -58,7 +57,7 @@ spec:
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
timeoutSeconds: 5
readinessProbe:
failureThreshold: 3
httpGet:
@ -67,7 +66,7 @@ spec:
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
timeoutSeconds: 5
securityContext:
capabilities:
add:

View File

@ -1,42 +0,0 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: default-backend
namespace: ingress
spec:
replicas: 1
selector:
matchLabels:
name: default-backend
phase: prod
template:
metadata:
labels:
name: default-backend
phase: prod
annotations:
seccomp.security.alpha.kubernetes.io/pod: 'docker/default'
spec:
containers:
- name: default-backend
# Any image is permissable as long as:
# 1. It serves a 404 page at /
# 2. It serves 200 on a /healthz endpoint
image: k8s.gcr.io/defaultbackend:1.4
ports:
- containerPort: 8080
resources:
limits:
cpu: 10m
memory: 20Mi
requests:
cpu: 10m
memory: 20Mi
livenessProbe:
httpGet:
path: /healthz
port: 8080
scheme: HTTP
initialDelaySeconds: 30
timeoutSeconds: 5
terminationGracePeriodSeconds: 60

View File

@ -1,15 +0,0 @@
apiVersion: v1
kind: Service
metadata:
name: default-backend
namespace: ingress
spec:
type: ClusterIP
selector:
name: default-backend
phase: prod
ports:
- name: http
protocol: TCP
port: 80
targetPort: 8080

View File

@ -1,42 +0,0 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: default-backend
namespace: ingress
spec:
replicas: 1
selector:
matchLabels:
name: default-backend
phase: prod
template:
metadata:
labels:
name: default-backend
phase: prod
annotations:
seccomp.security.alpha.kubernetes.io/pod: 'docker/default'
spec:
containers:
- name: default-backend
# Any image is permissable as long as:
# 1. It serves a 404 page at /
# 2. It serves 200 on a /healthz endpoint
image: k8s.gcr.io/defaultbackend:1.4
ports:
- containerPort: 8080
resources:
limits:
cpu: 10m
memory: 20Mi
requests:
cpu: 10m
memory: 20Mi
livenessProbe:
httpGet:
path: /healthz
port: 8080
scheme: HTTP
initialDelaySeconds: 30
timeoutSeconds: 5
terminationGracePeriodSeconds: 60

View File

@ -1,15 +0,0 @@
apiVersion: v1
kind: Service
metadata:
name: default-backend
namespace: ingress
spec:
type: ClusterIP
selector:
name: default-backend
phase: prod
ports:
- name: http
protocol: TCP
port: 80
targetPort: 8080

View File

@ -24,10 +24,9 @@ spec:
node-role.kubernetes.io/node: ""
containers:
- name: nginx-ingress-controller
image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.20.0
image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.23.0
args:
- /nginx-ingress-controller
- --default-backend-service=$(POD_NAMESPACE)/default-backend
- --ingress-class=public
# use downward API
env:
@ -58,7 +57,7 @@ spec:
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
timeoutSeconds: 5
readinessProbe:
failureThreshold: 3
httpGet:
@ -67,7 +66,7 @@ spec:
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
timeoutSeconds: 5
securityContext:
capabilities:
add:

View File

@ -55,6 +55,17 @@ data:
action: replace
target_label: job
metric_relabel_configs:
- source_labels: [__name__]
action: drop
regex: etcd_(debugging|disk|request|server).*
- source_labels: [__name__]
action: drop
regex: apiserver_admission_controller_admission_latencies_seconds_.*
- source_labels: [__name__]
action: drop
regex: apiserver_admission_step_admission_latencies_seconds_.*
# Scrape config for node (i.e. kubelet) /metrics (e.g. 'kubelet_'). Explore
# metrics from a node by scraping kubelet (127.0.0.1:10250/metrics).
- job_name: 'kubelet'
@ -89,6 +100,13 @@ data:
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
metric_relabel_configs:
- source_labels: [__name__, image]
action: drop
regex: container_([a-z_]+);
- source_labels: [__name__]
action: drop
regex: container_(network_tcp_usage_total|network_udp_usage_total|tasks_state|cpu_load_average_10s)
# Scrap etcd metrics from controllers via listen-metrics-urls
@ -102,7 +120,7 @@ data:
regex: 'true'
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- source_labels: [__meta_kubernetes_node_name]
- source_labels: [__meta_kubernetes_node_address_InternalIP]
action: replace
target_label: __address__
replacement: '${1}:2381'
@ -119,10 +137,10 @@ data:
# * `prometheus.io/port`: If the metrics are exposed on a different port to the
# service then set this appropriately.
- job_name: 'kubernetes-service-endpoints'
kubernetes_sd_configs:
- role: endpoints
honor_labels: true
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
@ -144,10 +162,18 @@ data:
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: job
metric_relabel_configs:
- source_labels: [__name__]
action: drop
regex: etcd_(debugging|disk|request|server).*
# Example scrape config for probing services via the Blackbox Exporter.
#
@ -177,7 +203,7 @@ data:
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
target_label: kubernetes_namespace
target_label: namespace
- source_labels: [__meta_kubernetes_service_name]
target_label: job

View File

@ -20,7 +20,7 @@ spec:
serviceAccountName: prometheus
containers:
- name: prometheus
image: quay.io/prometheus/prometheus:v2.4.3
image: quay.io/prometheus/prometheus:v2.7.1
args:
- --web.listen-address=0.0.0.0:9090
- --config.file=/etc/prometheus/prometheus.yaml

View File

@ -3,7 +3,8 @@ kind: ClusterRole
metadata:
name: kube-state-metrics
rules:
- apiGroups: [""]
- apiGroups:
- ""
resources:
- configmaps
- secrets
@ -17,23 +18,47 @@ rules:
- persistentvolumes
- namespaces
- endpoints
verbs: ["list", "watch"]
- apiGroups: ["extensions"]
verbs:
- list
- watch
- apiGroups:
- extensions
resources:
- daemonsets
- deployments
- replicasets
verbs: ["list", "watch"]
- apiGroups: ["apps"]
verbs:
- list
- watch
- apiGroups:
- apps
resources:
- statefulsets
verbs: ["list", "watch"]
- apiGroups: ["batch"]
- daemonsets
- deployments
- replicasets
verbs:
- list
- watch
- apiGroups:
- batch
resources:
- cronjobs
- jobs
verbs: ["list", "watch"]
- apiGroups: ["autoscaling"]
verbs:
- list
- watch
- apiGroups:
- autoscaling
resources:
- horizontalpodautoscalers
verbs: ["list", "watch"]
verbs:
- list
- watch
- apiGroups:
- policy
resources:
- poddisruptionbudgets
verbs:
- list
- watch

View File

@ -24,7 +24,7 @@ spec:
serviceAccountName: kube-state-metrics
containers:
- name: kube-state-metrics
image: quay.io/coreos/kube-state-metrics:v1.4.0
image: quay.io/coreos/kube-state-metrics:v1.5.0
ports:
- name: metrics
containerPort: 8080
@ -35,7 +35,7 @@ spec:
initialDelaySeconds: 5
timeoutSeconds: 5
- name: addon-resizer
image: k8s.gcr.io/addon-resizer:1.7
image: k8s.gcr.io/addon-resizer:1.8.4
resources:
limits:
cpu: 100m

View File

@ -6,7 +6,7 @@ metadata:
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: kube-state-metrics-resizer
name: kube-state-metrics
subjects:
- kind: ServiceAccount
name: kube-state-metrics

View File

@ -1,15 +1,31 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: kube-state-metrics-resizer
name: kube-state-metrics
namespace: monitoring
rules:
- apiGroups: [""]
- apiGroups:
- ""
resources:
- pods
verbs: ["get"]
- apiGroups: ["extensions"]
verbs:
- get
- apiGroups:
- extensions
resources:
- deployments
resourceNames: ["kube-state-metrics"]
verbs: ["get", "update"]
resourceNames:
- kube-state-metrics
verbs:
- get
- update
- apiGroups:
- apps
resources:
- deployments
resourceNames:
- kube-state-metrics
verbs:
- get
- update

View File

@ -28,21 +28,24 @@ spec:
hostPID: true
containers:
- name: node-exporter
image: quay.io/prometheus/node-exporter:v0.15.2
image: quay.io/prometheus/node-exporter:v0.17.0
args:
- "--path.procfs=/host/proc"
- "--path.sysfs=/host/sys"
- --path.procfs=/host/proc
- --path.sysfs=/host/sys
- --path.rootfs=/host/root
- --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+)($|/)
- --collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$
ports:
- name: metrics
containerPort: 9100
hostPort: 9100
resources:
requests:
memory: 30Mi
cpu: 100m
limits:
memory: 50Mi
limits:
cpu: 200m
memory: 100Mi
volumeMounts:
- name: proc
mountPath: /host/proc
@ -50,6 +53,9 @@ spec:
- name: sys
mountPath: /host/sys
readOnly: true
- name: root
mountPath: /host/root
readOnly: true
tolerations:
- effect: NoSchedule
operator: Exists
@ -60,3 +66,6 @@ spec:
- name: sys
hostPath:
path: /sys
- name: root
hostPath:
path: /

View File

@ -4,582 +4,1089 @@ metadata:
name: prometheus-rules
namespace: monitoring
data:
alertmanager.rules.yaml: |
groups:
- name: alertmanager.rules
rules:
- alert: AlertmanagerConfigInconsistent
expr: count_values("config_hash", alertmanager_config_hash) BY (service) / ON(service)
GROUP_LEFT() label_replace(prometheus_operator_alertmanager_spec_replicas, "service",
"alertmanager-$1", "alertmanager", "(.*)") != 1
for: 5m
labels:
severity: critical
annotations:
description: The configuration of the instances of the Alertmanager cluster
`{{$labels.service}}` are out of sync.
- alert: AlertmanagerDownOrMissing
expr: label_replace(prometheus_operator_alertmanager_spec_replicas, "job", "alertmanager-$1",
"alertmanager", "(.*)") / ON(job) GROUP_RIGHT() sum(up) BY (job) != 1
for: 5m
labels:
severity: warning
annotations:
description: An unexpected number of Alertmanagers are scraped or Alertmanagers
disappeared from discovery.
- alert: AlertmanagerFailedReload
expr: alertmanager_config_last_reload_successful == 0
for: 10m
labels:
severity: warning
annotations:
description: Reloading Alertmanager's configuration has failed for {{ $labels.namespace
}}/{{ $labels.pod}}.
etcd3.rules.yaml: |
groups:
- name: ./etcd3.rules
rules:
- alert: InsufficientMembers
expr: count(up{job="etcd"} == 0) > (count(up{job="etcd"}) / 2 - 1)
for: 3m
labels:
severity: critical
annotations:
description: If one more etcd member goes down the cluster will be unavailable
summary: etcd cluster insufficient members
- alert: NoLeader
expr: etcd_server_has_leader{job="etcd"} == 0
for: 1m
labels:
severity: critical
annotations:
description: etcd member {{ $labels.instance }} has no leader
summary: etcd member has no leader
- alert: HighNumberOfLeaderChanges
expr: increase(etcd_server_leader_changes_seen_total{job="etcd"}[1h]) > 3
labels:
severity: warning
annotations:
description: etcd instance {{ $labels.instance }} has seen {{ $value }} leader
changes within the last hour
summary: a high number of leader changes within the etcd cluster are happening
- alert: GRPCRequestsSlow
expr: histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job="etcd",grpc_type="unary"}[5m])) by (grpc_service, grpc_method, le))
> 0.15
for: 10m
labels:
severity: critical
annotations:
description: on etcd instance {{ $labels.instance }} gRPC requests to {{ $labels.grpc_method
}} are slow
summary: slow gRPC requests
- alert: HighNumberOfFailedHTTPRequests
expr: sum(rate(etcd_http_failed_total{job="etcd"}[5m])) BY (method) / sum(rate(etcd_http_received_total{job="etcd"}[5m]))
BY (method) > 0.01
for: 10m
labels:
severity: warning
annotations:
description: '{{ $value }}% of requests for {{ $labels.method }} failed on etcd
instance {{ $labels.instance }}'
summary: a high number of HTTP requests are failing
- alert: HighNumberOfFailedHTTPRequests
expr: sum(rate(etcd_http_failed_total{job="etcd"}[5m])) BY (method) / sum(rate(etcd_http_received_total{job="etcd"}[5m]))
BY (method) > 0.05
for: 5m
labels:
severity: critical
annotations:
description: '{{ $value }}% of requests for {{ $labels.method }} failed on etcd
instance {{ $labels.instance }}'
summary: a high number of HTTP requests are failing
- alert: HTTPRequestsSlow
expr: histogram_quantile(0.99, rate(etcd_http_successful_duration_seconds_bucket[5m]))
> 0.15
for: 10m
labels:
severity: warning
annotations:
description: on etcd instance {{ $labels.instance }} HTTP requests to {{ $labels.method
}} are slow
summary: slow HTTP requests
- alert: EtcdMemberCommunicationSlow
expr: histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[5m]))
> 0.15
for: 10m
labels:
severity: warning
annotations:
description: etcd instance {{ $labels.instance }} member communication with
{{ $labels.To }} is slow
summary: etcd member communication is slow
- alert: HighNumberOfFailedProposals
expr: increase(etcd_server_proposals_failed_total{job="etcd"}[1h]) > 5
labels:
severity: warning
annotations:
description: etcd instance {{ $labels.instance }} has seen {{ $value }} proposal
failures within the last hour
summary: a high number of proposals within the etcd cluster are failing
- alert: HighFsyncDurations
expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m]))
> 0.5
for: 10m
labels:
severity: warning
annotations:
description: etcd instance {{ $labels.instance }} fync durations are high
summary: high fsync durations
- alert: HighCommitDurations
expr: histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m]))
> 0.25
for: 10m
labels:
severity: warning
annotations:
description: etcd instance {{ $labels.instance }} commit durations are high
summary: high commit durations
general.rules.yaml: |
groups:
- name: general.rules
rules:
- alert: TargetDown
expr: 100 * (count(up == 0) BY (job) / count(up) BY (job)) > 10
for: 10m
labels:
severity: warning
annotations:
description: '{{ $value }}% of {{ $labels.job }} targets are down.'
summary: Targets are down
- record: fd_utilization
expr: process_open_fds / process_max_fds
- alert: FdExhaustionClose
expr: predict_linear(fd_utilization[1h], 3600 * 4) > 1
for: 10m
labels:
severity: warning
annotations:
description: '{{ $labels.job }}: {{ $labels.namespace }}/{{ $labels.pod }} instance
will exhaust in file/socket descriptors within the next 4 hours'
summary: file descriptors soon exhausted
- alert: FdExhaustionClose
expr: predict_linear(fd_utilization[10m], 3600) > 1
for: 10m
labels:
severity: critical
annotations:
description: '{{ $labels.job }}: {{ $labels.namespace }}/{{ $labels.pod }} instance
will exhaust in file/socket descriptors within the next hour'
summary: file descriptors soon exhausted
kube-controller-manager.rules.yaml: |
groups:
- name: kube-controller-manager.rules
rules:
- alert: K8SControllerManagerDown
expr: absent(up{job="kube-controller-manager"} == 1)
for: 5m
labels:
severity: critical
annotations:
description: There is no running K8S controller manager. Deployments and replication
controllers are not making progress.
summary: Controller manager is down
kube-scheduler.rules.yaml: |
groups:
- name: kube-scheduler.rules
rules:
- record: cluster:scheduler_e2e_scheduling_latency_seconds:quantile
expr: histogram_quantile(0.99, sum(scheduler_e2e_scheduling_latency_microseconds_bucket)
BY (le, cluster)) / 1e+06
labels:
quantile: "0.99"
- record: cluster:scheduler_e2e_scheduling_latency_seconds:quantile
expr: histogram_quantile(0.9, sum(scheduler_e2e_scheduling_latency_microseconds_bucket)
BY (le, cluster)) / 1e+06
labels:
quantile: "0.9"
- record: cluster:scheduler_e2e_scheduling_latency_seconds:quantile
expr: histogram_quantile(0.5, sum(scheduler_e2e_scheduling_latency_microseconds_bucket)
BY (le, cluster)) / 1e+06
labels:
quantile: "0.5"
- record: cluster:scheduler_scheduling_algorithm_latency_seconds:quantile
expr: histogram_quantile(0.99, sum(scheduler_scheduling_algorithm_latency_microseconds_bucket)
BY (le, cluster)) / 1e+06
labels:
quantile: "0.99"
- record: cluster:scheduler_scheduling_algorithm_latency_seconds:quantile
expr: histogram_quantile(0.9, sum(scheduler_scheduling_algorithm_latency_microseconds_bucket)
BY (le, cluster)) / 1e+06
labels:
quantile: "0.9"
- record: cluster:scheduler_scheduling_algorithm_latency_seconds:quantile
expr: histogram_quantile(0.5, sum(scheduler_scheduling_algorithm_latency_microseconds_bucket)
BY (le, cluster)) / 1e+06
labels:
quantile: "0.5"
- record: cluster:scheduler_binding_latency_seconds:quantile
expr: histogram_quantile(0.99, sum(scheduler_binding_latency_microseconds_bucket)
BY (le, cluster)) / 1e+06
labels:
quantile: "0.99"
- record: cluster:scheduler_binding_latency_seconds:quantile
expr: histogram_quantile(0.9, sum(scheduler_binding_latency_microseconds_bucket)
BY (le, cluster)) / 1e+06
labels:
quantile: "0.9"
- record: cluster:scheduler_binding_latency_seconds:quantile
expr: histogram_quantile(0.5, sum(scheduler_binding_latency_microseconds_bucket)
BY (le, cluster)) / 1e+06
labels:
quantile: "0.5"
- alert: K8SSchedulerDown
expr: absent(up{job="kube-scheduler"} == 1)
for: 5m
labels:
severity: critical
annotations:
description: There is no running K8S scheduler. New pods are not being assigned
to nodes.
summary: Scheduler is down
kube-state-metrics.rules.yaml: |
groups:
- name: kube-state-metrics.rules
rules:
- alert: DeploymentGenerationMismatch
expr: kube_deployment_status_observed_generation != kube_deployment_metadata_generation
for: 15m
labels:
severity: warning
annotations:
description: Observed deployment generation does not match expected one for
deployment {{$labels.namespaces}}/{{$labels.deployment}}
summary: Deployment is outdated
- alert: DeploymentReplicasNotUpdated
expr: ((kube_deployment_status_replicas_updated != kube_deployment_spec_replicas)
or (kube_deployment_status_replicas_available != kube_deployment_spec_replicas))
unless (kube_deployment_spec_paused == 1)
for: 15m
labels:
severity: warning
annotations:
description: Replicas are not updated and available for deployment {{$labels.namespaces}}/{{$labels.deployment}}
summary: Deployment replicas are outdated
- alert: DaemonSetRolloutStuck
expr: kube_daemonset_status_number_ready / kube_daemonset_status_desired_number_scheduled
* 100 < 100
for: 15m
labels:
severity: warning
annotations:
description: Only {{$value}}% of desired pods scheduled and ready for daemon
set {{$labels.namespaces}}/{{$labels.daemonset}}
summary: DaemonSet is missing pods
- alert: K8SDaemonSetsNotScheduled
expr: kube_daemonset_status_desired_number_scheduled - kube_daemonset_status_current_number_scheduled
> 0
for: 10m
labels:
severity: warning
annotations:
description: A number of daemonsets are not scheduled.
summary: Daemonsets are not scheduled correctly
- alert: DaemonSetsMissScheduled
expr: kube_daemonset_status_number_misscheduled > 0
for: 10m
labels:
severity: warning
annotations:
description: A number of daemonsets are running where they are not supposed
to run.
summary: Daemonsets are not scheduled correctly
- alert: PodFrequentlyRestarting
expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
for: 10m
labels:
severity: warning
annotations:
description: Pod {{$labels.namespaces}}/{{$labels.pod}} restarted {{$value}}
times within the last hour
summary: Pod is restarting frequently
kubelet.rules.yaml: |
groups:
- name: kubelet.rules
rules:
- alert: K8SNodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 1h
labels:
severity: warning
annotations:
description: The Kubelet on {{ $labels.node }} has not checked in with the API,
or has set itself to NotReady, for more than an hour
summary: Node status is NotReady
- alert: K8SManyNodesNotReady
expr: count(kube_node_status_condition{condition="Ready",status="true"} == 0)
> 1 and (count(kube_node_status_condition{condition="Ready",status="true"} ==
0) / count(kube_node_status_condition{condition="Ready",status="true"})) > 0.2
for: 1m
labels:
severity: critical
annotations:
description: '{{ $value }}% of Kubernetes nodes are not ready'
- alert: K8SKubeletDown
expr: count(up{job="kubelet"} == 0) / count(up{job="kubelet"}) * 100 > 3
for: 1h
labels:
severity: warning
annotations:
description: Prometheus failed to scrape {{ $value }}% of kubelets.
- alert: K8SKubeletDown
expr: (absent(up{job="kubelet"} == 1) or count(up{job="kubelet"} == 0) / count(up{job="kubelet"}))
* 100 > 10
for: 1h
labels:
severity: critical
annotations:
description: Prometheus failed to scrape {{ $value }}% of kubelets, or all Kubelets
have disappeared from service discovery.
summary: Many Kubelets cannot be scraped
- alert: K8SKubeletTooManyPods
expr: kubelet_running_pod_count > 100
for: 10m
labels:
severity: warning
annotations:
description: Kubelet {{$labels.instance}} is running {{$value}} pods, close
to the limit of 110
summary: Kubelet is close to pod limit
kubernetes.rules.yaml: |
groups:
- name: kubernetes.rules
rules:
- record: pod_name:container_memory_usage_bytes:sum
expr: sum(container_memory_usage_bytes{container_name!="POD",pod_name!=""}) BY
(pod_name)
- record: pod_name:container_spec_cpu_shares:sum
expr: sum(container_spec_cpu_shares{container_name!="POD",pod_name!=""}) BY (pod_name)
- record: pod_name:container_cpu_usage:sum
expr: sum(rate(container_cpu_usage_seconds_total{container_name!="POD",pod_name!=""}[5m]))
BY (pod_name)
- record: pod_name:container_fs_usage_bytes:sum
expr: sum(container_fs_usage_bytes{container_name!="POD",pod_name!=""}) BY (pod_name)
- record: namespace:container_memory_usage_bytes:sum
expr: sum(container_memory_usage_bytes{container_name!=""}) BY (namespace)
- record: namespace:container_spec_cpu_shares:sum
expr: sum(container_spec_cpu_shares{container_name!=""}) BY (namespace)
- record: namespace:container_cpu_usage:sum
expr: sum(rate(container_cpu_usage_seconds_total{container_name!="POD"}[5m]))
BY (namespace)
- record: cluster:memory_usage:ratio
expr: sum(container_memory_usage_bytes{container_name!="POD",pod_name!=""}) BY
(cluster) / sum(machine_memory_bytes) BY (cluster)
- record: cluster:container_spec_cpu_shares:ratio
expr: sum(container_spec_cpu_shares{container_name!="POD",pod_name!=""}) / 1000
/ sum(machine_cpu_cores)
- record: cluster:container_cpu_usage:ratio
expr: sum(rate(container_cpu_usage_seconds_total{container_name!="POD",pod_name!=""}[5m]))
/ sum(machine_cpu_cores)
- record: apiserver_latency_seconds:quantile
expr: histogram_quantile(0.99, rate(apiserver_request_latencies_bucket[5m])) /
1e+06
labels:
quantile: "0.99"
- record: apiserver_latency:quantile_seconds
expr: histogram_quantile(0.9, rate(apiserver_request_latencies_bucket[5m])) /
1e+06
labels:
quantile: "0.9"
- record: apiserver_latency_seconds:quantile
expr: histogram_quantile(0.5, rate(apiserver_request_latencies_bucket[5m])) /
1e+06
labels:
quantile: "0.5"
- alert: APIServerLatencyHigh
expr: apiserver_latency_seconds:quantile{quantile="0.99",subresource!="log",verb!~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"}
> 1
for: 10m
labels:
severity: warning
annotations:
description: the API server has a 99th percentile latency of {{ $value }} seconds
for {{$labels.verb}} {{$labels.resource}}
- alert: APIServerLatencyHigh
expr: apiserver_latency_seconds:quantile{quantile="0.99",subresource!="log",verb!~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"}
> 4
for: 10m
labels:
severity: critical
annotations:
description: the API server has a 99th percentile latency of {{ $value }} seconds
for {{$labels.verb}} {{$labels.resource}}
- alert: APIServerErrorsHigh
expr: rate(apiserver_request_count{code=~"^(?:5..)$"}[5m]) / rate(apiserver_request_count[5m])
* 100 > 2
for: 10m
labels:
severity: warning
annotations:
description: API server returns errors for {{ $value }}% of requests
- alert: APIServerErrorsHigh
expr: rate(apiserver_request_count{code=~"^(?:5..)$"}[5m]) / rate(apiserver_request_count[5m])
* 100 > 5
for: 10m
labels:
severity: critical
annotations:
description: API server returns errors for {{ $value }}% of requests
- alert: K8SApiserverDown
expr: absent(up{job="apiserver"} == 1)
for: 20m
labels:
severity: critical
annotations:
description: No API servers are reachable or all have disappeared from service
discovery
- alert: K8sCertificateExpirationNotice
labels:
severity: warning
annotations:
description: Kubernetes API Certificate is expiring soon (less than 7 days)
expr: sum(apiserver_client_certificate_expiration_seconds_bucket{le="604800"}) > 0
- alert: K8sCertificateExpirationNotice
labels:
severity: critical
annotations:
description: Kubernetes API Certificate is expiring in less than 1 day
expr: sum(apiserver_client_certificate_expiration_seconds_bucket{le="86400"}) > 0
node.rules.yaml: |
groups:
- name: node.rules
rules:
- record: instance:node_cpu:rate:sum
expr: sum(rate(node_cpu{mode!="idle",mode!="iowait",mode!~"^(?:guest.*)$"}[3m]))
BY (instance)
- record: instance:node_filesystem_usage:sum
expr: sum((node_filesystem_size{mountpoint="/"} - node_filesystem_free{mountpoint="/"}))
BY (instance)
- record: instance:node_network_receive_bytes:rate:sum
expr: sum(rate(node_network_receive_bytes[3m])) BY (instance)
- record: instance:node_network_transmit_bytes:rate:sum
expr: sum(rate(node_network_transmit_bytes[3m])) BY (instance)
- record: instance:node_cpu:ratio
expr: sum(rate(node_cpu{mode!="idle"}[5m])) WITHOUT (cpu, mode) / ON(instance)
GROUP_LEFT() count(sum(node_cpu) BY (instance, cpu)) BY (instance)
- record: cluster:node_cpu:sum_rate5m
expr: sum(rate(node_cpu{mode!="idle"}[5m]))
- record: cluster:node_cpu:ratio
expr: cluster:node_cpu:rate5m / count(sum(node_cpu) BY (instance, cpu))
- alert: NodeExporterDown
expr: absent(up{job="node-exporter"} == 1)
for: 10m
labels:
severity: warning
annotations:
description: Prometheus could not scrape a node-exporter for more than 10m,
or node-exporters have disappeared from discovery
- alert: NodeDiskRunningFull
expr: predict_linear(node_filesystem_free[6h], 3600 * 24) < 0
for: 30m
labels:
severity: warning
annotations:
description: device {{$labels.device}} on node {{$labels.instance}} is running
full within the next 24 hours (mounted at {{$labels.mountpoint}})
- alert: NodeDiskRunningFull
expr: predict_linear(node_filesystem_free[30m], 3600 * 2) < 0
for: 10m
labels:
severity: critical
annotations:
description: device {{$labels.device}} on node {{$labels.instance}} is running
full within the next 2 hours (mounted at {{$labels.mountpoint}})
- alert: InactiveRAIDDisk
expr: node_md_disks - node_md_disks_active > 0
for: 10m
labels:
severity: warning
annotations:
description: '{{$value}} RAID disk(s) on node {{$labels.instance}} are inactive'
prometheus.rules.yaml: |
groups:
- name: prometheus.rules
rules:
- alert: PrometheusConfigReloadFailed
expr: prometheus_config_last_reload_successful == 0
for: 10m
labels:
severity: warning
annotations:
description: Reloading Prometheus' configuration has failed for {{$labels.namespace}}/{{$labels.pod}}
- alert: PrometheusNotificationQueueRunningFull
expr: predict_linear(prometheus_notifications_queue_length[5m], 60 * 30) > prometheus_notifications_queue_capacity
for: 10m
labels:
severity: warning
annotations:
description: Prometheus' alert notification queue is running full for {{$labels.namespace}}/{{
$labels.pod}}
- alert: PrometheusErrorSendingAlerts
expr: rate(prometheus_notifications_errors_total[5m]) / rate(prometheus_notifications_sent_total[5m])
> 0.01
for: 10m
labels:
severity: warning
annotations:
description: Errors while sending alerts from Prometheus {{$labels.namespace}}/{{
$labels.pod}} to Alertmanager {{$labels.Alertmanager}}
- alert: PrometheusErrorSendingAlerts
expr: rate(prometheus_notifications_errors_total[5m]) / rate(prometheus_notifications_sent_total[5m])
> 0.03
for: 10m
labels:
severity: critical
annotations:
description: Errors while sending alerts from Prometheus {{$labels.namespace}}/{{
$labels.pod}} to Alertmanager {{$labels.Alertmanager}}
- alert: PrometheusNotConnectedToAlertmanagers
expr: prometheus_notifications_alertmanagers_discovered < 1
for: 10m
labels:
severity: warning
annotations:
description: Prometheus {{ $labels.namespace }}/{{ $labels.pod}} is not connected
to any Alertmanagers
- alert: PrometheusTSDBReloadsFailing
expr: increase(prometheus_tsdb_reloads_failures_total[2h]) > 0
for: 12h
labels:
severity: warning
annotations:
description: '{{$labels.job}} at {{$labels.instance}} had {{$value | humanize}}
reload failures over the last four hours.'
summary: Prometheus has issues reloading data blocks from disk
- alert: PrometheusTSDBCompactionsFailing
expr: increase(prometheus_tsdb_compactions_failed_total[2h]) > 0
for: 12h
labels:
severity: warning
annotations:
description: '{{$labels.job}} at {{$labels.instance}} had {{$value | humanize}}
compaction failures over the last four hours.'
summary: Prometheus has issues compacting sample blocks
- alert: PrometheusTSDBWALCorruptions
expr: tsdb_wal_corruptions_total > 0
for: 4h
labels:
severity: warning
annotations:
description: '{{$labels.job}} at {{$labels.instance}} has a corrupted write-ahead
log (WAL).'
summary: Prometheus write-ahead log is corrupted
- alert: PrometheusNotIngestingSamples
expr: rate(prometheus_tsdb_head_samples_appended_total[5m]) <= 0
for: 10m
labels:
severity: warning
annotations:
description: "Prometheus {{ $labels.namespace }}/{{ $labels.pod}} isn't ingesting samples."
summary: "Prometheus isn't ingesting samples"
etcd.yaml: |-
{
"groups": [
{
"name": "etcd",
"rules": [
{
"alert": "etcdInsufficientMembers",
"annotations": {
"message": "etcd cluster \"{{ $labels.job }}\": insufficient members ({{ $value }})."
},
"expr": "sum(up{job=~\".*etcd.*\"} == bool 1) by (job) < ((count(up{job=~\".*etcd.*\"}) by (job) + 1) / 2)\n",
"for": "3m",
"labels": {
"severity": "critical"
}
},
{
"alert": "etcdNoLeader",
"annotations": {
"message": "etcd cluster \"{{ $labels.job }}\": member {{ $labels.instance }} has no leader."
},
"expr": "etcd_server_has_leader{job=~\".*etcd.*\"} == 0\n",
"for": "1m",
"labels": {
"severity": "critical"
}
},
{
"alert": "etcdHighNumberOfLeaderChanges",
"annotations": {
"message": "etcd cluster \"{{ $labels.job }}\": instance {{ $labels.instance }} has seen {{ $value }} leader changes within the last 30 minutes."
},
"expr": "rate(etcd_server_leader_changes_seen_total{job=~\".*etcd.*\"}[15m]) > 3\n",
"for": "15m",
"labels": {
"severity": "warning"
}
},
{
"alert": "etcdGRPCRequestsSlow",
"annotations": {
"message": "etcd cluster \"{{ $labels.job }}\": gRPC requests to {{ $labels.grpc_method }} are taking {{ $value }}s on etcd instance {{ $labels.instance }}."
},
"expr": "histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job=~\".*etcd.*\", grpc_type=\"unary\"}[5m])) by (job, instance, grpc_service, grpc_method, le))\n> 0.15\n",
"for": "10m",
"labels": {
"severity": "critical"
}
},
{
"alert": "etcdMemberCommunicationSlow",
"annotations": {
"message": "etcd cluster \"{{ $labels.job }}\": member communication with {{ $labels.To }} is taking {{ $value }}s on etcd instance {{ $labels.instance }}."
},
"expr": "histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket{job=~\".*etcd.*\"}[5m]))\n> 0.15\n",
"for": "10m",
"labels": {
"severity": "warning"
}
},
{
"alert": "etcdHighNumberOfFailedProposals",
"annotations": {
"message": "etcd cluster \"{{ $labels.job }}\": {{ $value }} proposal failures within the last 30 minutes on etcd instance {{ $labels.instance }}."
},
"expr": "rate(etcd_server_proposals_failed_total{job=~\".*etcd.*\"}[15m]) > 5\n",
"for": "15m",
"labels": {
"severity": "warning"
}
},
{
"alert": "etcdHighFsyncDurations",
"annotations": {
"message": "etcd cluster \"{{ $labels.job }}\": 99th percentile fync durations are {{ $value }}s on etcd instance {{ $labels.instance }}."
},
"expr": "histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~\".*etcd.*\"}[5m]))\n> 0.5\n",
"for": "10m",
"labels": {
"severity": "warning"
}
},
{
"alert": "etcdHighCommitDurations",
"annotations": {
"message": "etcd cluster \"{{ $labels.job }}\": 99th percentile commit durations {{ $value }}s on etcd instance {{ $labels.instance }}."
},
"expr": "histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket{job=~\".*etcd.*\"}[5m]))\n> 0.25\n",
"for": "10m",
"labels": {
"severity": "warning"
}
},
{
"alert": "etcdHighNumberOfFailedHTTPRequests",
"annotations": {
"message": "{{ $value }}% of requests for {{ $labels.method }} failed on etcd instance {{ $labels.instance }}"
},
"expr": "sum(rate(etcd_http_failed_total{job=~\".*etcd.*\", code!=\"404\"}[5m])) BY (method) / sum(rate(etcd_http_received_total{job=~\".*etcd.*\"}[5m]))\nBY (method) > 0.01\n",
"for": "10m",
"labels": {
"severity": "warning"
}
},
{
"alert": "etcdHighNumberOfFailedHTTPRequests",
"annotations": {
"message": "{{ $value }}% of requests for {{ $labels.method }} failed on etcd instance {{ $labels.instance }}."
},
"expr": "sum(rate(etcd_http_failed_total{job=~\".*etcd.*\", code!=\"404\"}[5m])) BY (method) / sum(rate(etcd_http_received_total{job=~\".*etcd.*\"}[5m]))\nBY (method) > 0.05\n",
"for": "10m",
"labels": {
"severity": "critical"
}
},
{
"alert": "etcdHTTPRequestsSlow",
"annotations": {
"message": "etcd instance {{ $labels.instance }} HTTP requests to {{ $labels.method }} are slow."
},
"expr": "histogram_quantile(0.99, rate(etcd_http_successful_duration_seconds_bucket[5m]))\n> 0.15\n",
"for": "10m",
"labels": {
"severity": "warning"
}
}
]
}
]
}
extra.yaml: |-
{
"groups": [
{
"name": "extra.rules",
"rules": [
{
"alert": "InactiveRAIDDisk",
"annotations": {
"message": "{{ $value }} RAID disk(s) on node {{ $labels.instance }} are inactive."
},
"expr": "node_md_disks - node_md_disks_active > 0",
"for": "10m",
"labels": {
"severity": "warning"
}
}
]
}
]
}
kube.yaml: |-
{
"groups": [
{
"name": "k8s.rules",
"rules": [
{
"expr": "sum(rate(container_cpu_usage_seconds_total{job=\"kubernetes-cadvisor\", image!=\"\", container_name!=\"\"}[5m])) by (namespace)\n",
"record": "namespace:container_cpu_usage_seconds_total:sum_rate"
},
{
"expr": "sum by (namespace, pod_name, container_name) (\n rate(container_cpu_usage_seconds_total{job=\"kubernetes-cadvisor\", image!=\"\", container_name!=\"\"}[5m])\n)\n",
"record": "namespace_pod_name_container_name:container_cpu_usage_seconds_total:sum_rate"
},
{
"expr": "sum(container_memory_usage_bytes{job=\"kubernetes-cadvisor\", image!=\"\", container_name!=\"\"}) by (namespace)\n",
"record": "namespace:container_memory_usage_bytes:sum"
},
{
"expr": "sum by (namespace, label_name) (\n sum(rate(container_cpu_usage_seconds_total{job=\"kubernetes-cadvisor\", image!=\"\", container_name!=\"\"}[5m])) by (namespace, pod_name)\n * on (namespace, pod_name) group_left(label_name)\n label_replace(kube_pod_labels{job=\"kube-state-metrics\"}, \"pod_name\", \"$1\", \"pod\", \"(.*)\")\n)\n",
"record": "namespace_name:container_cpu_usage_seconds_total:sum_rate"
},
{
"expr": "sum by (namespace, label_name) (\n sum(container_memory_usage_bytes{job=\"kubernetes-cadvisor\",image!=\"\", container_name!=\"\"}) by (pod_name, namespace)\n* on (namespace, pod_name) group_left(label_name)\n label_replace(kube_pod_labels{job=\"kube-state-metrics\"}, \"pod_name\", \"$1\", \"pod\", \"(.*)\")\n)\n",
"record": "namespace_name:container_memory_usage_bytes:sum"
},
{
"expr": "sum by (namespace, label_name) (\n sum(kube_pod_container_resource_requests_memory_bytes{job=\"kube-state-metrics\"}) by (namespace, pod)\n* on (namespace, pod) group_left(label_name)\n label_replace(kube_pod_labels{job=\"kube-state-metrics\"}, \"pod_name\", \"$1\", \"pod\", \"(.*)\")\n)\n",
"record": "namespace_name:kube_pod_container_resource_requests_memory_bytes:sum"
},
{
"expr": "sum by (namespace, label_name) (\n sum(kube_pod_container_resource_requests_cpu_cores{job=\"kube-state-metrics\"} and on(pod) kube_pod_status_scheduled{condition=\"true\"}) by (namespace, pod)\n* on (namespace, pod) group_left(label_name)\n label_replace(kube_pod_labels{job=\"kube-state-metrics\"}, \"pod_name\", \"$1\", \"pod\", \"(.*)\")\n)\n",
"record": "namespace_name:kube_pod_container_resource_requests_cpu_cores:sum"
}
]
},
{
"name": "kube-scheduler.rules",
"rules": [
{
"expr": "histogram_quantile(0.99, sum(rate(scheduler_e2e_scheduling_latency_microseconds_bucket{job=\"kube-scheduler\"}[5m])) without(instance, pod)) / 1e+06\n",
"labels": {
"quantile": "0.99"
},
"record": "cluster_quantile:scheduler_e2e_scheduling_latency:histogram_quantile"
},
{
"expr": "histogram_quantile(0.99, sum(rate(scheduler_scheduling_algorithm_latency_microseconds_bucket{job=\"kube-scheduler\"}[5m])) without(instance, pod)) / 1e+06\n",
"labels": {
"quantile": "0.99"
},
"record": "cluster_quantile:scheduler_scheduling_algorithm_latency:histogram_quantile"
},
{
"expr": "histogram_quantile(0.99, sum(rate(scheduler_binding_latency_microseconds_bucket{job=\"kube-scheduler\"}[5m])) without(instance, pod)) / 1e+06\n",
"labels": {
"quantile": "0.99"
},
"record": "cluster_quantile:scheduler_binding_latency:histogram_quantile"
},
{
"expr": "histogram_quantile(0.9, sum(rate(scheduler_e2e_scheduling_latency_microseconds_bucket{job=\"kube-scheduler\"}[5m])) without(instance, pod)) / 1e+06\n",
"labels": {
"quantile": "0.9"
},
"record": "cluster_quantile:scheduler_e2e_scheduling_latency:histogram_quantile"
},
{
"expr": "histogram_quantile(0.9, sum(rate(scheduler_scheduling_algorithm_latency_microseconds_bucket{job=\"kube-scheduler\"}[5m])) without(instance, pod)) / 1e+06\n",
"labels": {
"quantile": "0.9"
},
"record": "cluster_quantile:scheduler_scheduling_algorithm_latency:histogram_quantile"
},
{
"expr": "histogram_quantile(0.9, sum(rate(scheduler_binding_latency_microseconds_bucket{job=\"kube-scheduler\"}[5m])) without(instance, pod)) / 1e+06\n",
"labels": {
"quantile": "0.9"
},
"record": "cluster_quantile:scheduler_binding_latency:histogram_quantile"
},
{
"expr": "histogram_quantile(0.5, sum(rate(scheduler_e2e_scheduling_latency_microseconds_bucket{job=\"kube-scheduler\"}[5m])) without(instance, pod)) / 1e+06\n",
"labels": {
"quantile": "0.5"
},
"record": "cluster_quantile:scheduler_e2e_scheduling_latency:histogram_quantile"
},
{
"expr": "histogram_quantile(0.5, sum(rate(scheduler_scheduling_algorithm_latency_microseconds_bucket{job=\"kube-scheduler\"}[5m])) without(instance, pod)) / 1e+06\n",
"labels": {
"quantile": "0.5"
},
"record": "cluster_quantile:scheduler_scheduling_algorithm_latency:histogram_quantile"
},
{
"expr": "histogram_quantile(0.5, sum(rate(scheduler_binding_latency_microseconds_bucket{job=\"kube-scheduler\"}[5m])) without(instance, pod)) / 1e+06\n",
"labels": {
"quantile": "0.5"
},
"record": "cluster_quantile:scheduler_binding_latency:histogram_quantile"
}
]
},
{
"name": "kube-apiserver.rules",
"rules": [
{
"expr": "histogram_quantile(0.99, sum(rate(apiserver_request_latencies_bucket{job=\"apiserver\"}[5m])) without(instance, pod)) / 1e+06\n",
"labels": {
"quantile": "0.99"
},
"record": "cluster_quantile:apiserver_request_latencies:histogram_quantile"
},
{
"expr": "histogram_quantile(0.9, sum(rate(apiserver_request_latencies_bucket{job=\"apiserver\"}[5m])) without(instance, pod)) / 1e+06\n",
"labels": {
"quantile": "0.9"
},
"record": "cluster_quantile:apiserver_request_latencies:histogram_quantile"
},
{
"expr": "histogram_quantile(0.5, sum(rate(apiserver_request_latencies_bucket{job=\"apiserver\"}[5m])) without(instance, pod)) / 1e+06\n",
"labels": {
"quantile": "0.5"
},
"record": "cluster_quantile:apiserver_request_latencies:histogram_quantile"
}
]
},
{
"name": "node.rules",
"rules": [
{
"expr": "sum(min(kube_pod_info) by (node))",
"record": ":kube_pod_info_node_count:"
},
{
"expr": "max(label_replace(kube_pod_info{job=\"kube-state-metrics\"}, \"pod\", \"$1\", \"pod\", \"(.*)\")) by (node, namespace, pod)\n",
"record": "node_namespace_pod:kube_pod_info:"
},
{
"expr": "count by (node) (sum by (node, cpu) (\n node_cpu_seconds_total{job=\"node-exporter\"}\n* on (namespace, pod) group_left(node)\n node_namespace_pod:kube_pod_info:\n))\n",
"record": "node:node_num_cpu:sum"
},
{
"expr": "1 - avg(rate(node_cpu_seconds_total{job=\"node-exporter\",mode=\"idle\"}[1m]))\n",
"record": ":node_cpu_utilisation:avg1m"
},
{
"expr": "1 - avg by (node) (\n rate(node_cpu_seconds_total{job=\"node-exporter\",mode=\"idle\"}[1m])\n* on (namespace, pod) group_left(node)\n node_namespace_pod:kube_pod_info:)\n",
"record": "node:node_cpu_utilisation:avg1m"
},
{
"expr": "node:node_cpu_utilisation:avg1m\n *\nnode:node_num_cpu:sum\n /\nscalar(sum(node:node_num_cpu:sum))\n",
"record": "node:cluster_cpu_utilisation:ratio"
},
{
"expr": "sum(node_load1{job=\"node-exporter\"})\n/\nsum(node:node_num_cpu:sum)\n",
"record": ":node_cpu_saturation_load1:"
},
{
"expr": "sum by (node) (\n node_load1{job=\"node-exporter\"}\n* on (namespace, pod) group_left(node)\n node_namespace_pod:kube_pod_info:\n)\n/\nnode:node_num_cpu:sum\n",
"record": "node:node_cpu_saturation_load1:"
},
{
"expr": "1 -\nsum(node_memory_MemFree_bytes{job=\"node-exporter\"} + node_memory_Cached_bytes{job=\"node-exporter\"} + node_memory_Buffers_bytes{job=\"node-exporter\"})\n/\nsum(node_memory_MemTotal_bytes{job=\"node-exporter\"})\n",
"record": ":node_memory_utilisation:"
},
{
"expr": "sum(node_memory_MemFree_bytes{job=\"node-exporter\"} + node_memory_Cached_bytes{job=\"node-exporter\"} + node_memory_Buffers_bytes{job=\"node-exporter\"})\n",
"record": ":node_memory_MemFreeCachedBuffers_bytes:sum"
},
{
"expr": "sum(node_memory_MemTotal_bytes{job=\"node-exporter\"})\n",
"record": ":node_memory_MemTotal_bytes:sum"
},
{
"expr": "sum by (node) (\n (node_memory_MemFree_bytes{job=\"node-exporter\"} + node_memory_Cached_bytes{job=\"node-exporter\"} + node_memory_Buffers_bytes{job=\"node-exporter\"})\n * on (namespace, pod) group_left(node)\n node_namespace_pod:kube_pod_info:\n)\n",
"record": "node:node_memory_bytes_available:sum"
},
{
"expr": "sum by (node) (\n node_memory_MemTotal_bytes{job=\"node-exporter\"}\n * on (namespace, pod) group_left(node)\n node_namespace_pod:kube_pod_info:\n)\n",
"record": "node:node_memory_bytes_total:sum"
},
{
"expr": "(node:node_memory_bytes_total:sum - node:node_memory_bytes_available:sum)\n/\nnode:node_memory_bytes_total:sum\n",
"record": "node:node_memory_utilisation:ratio"
},
{
"expr": "(node:node_memory_bytes_total:sum - node:node_memory_bytes_available:sum)\n/\nscalar(sum(node:node_memory_bytes_total:sum))\n",
"record": "node:cluster_memory_utilisation:ratio"
},
{
"expr": "1e3 * sum(\n (rate(node_vmstat_pgpgin{job=\"node-exporter\"}[1m])\n + rate(node_vmstat_pgpgout{job=\"node-exporter\"}[1m]))\n)\n",
"record": ":node_memory_swap_io_bytes:sum_rate"
},
{
"expr": "1 -\nsum by (node) (\n (node_memory_MemFree_bytes{job=\"node-exporter\"} + node_memory_Cached_bytes{job=\"node-exporter\"} + node_memory_Buffers_bytes{job=\"node-exporter\"})\n* on (namespace, pod) group_left(node)\n node_namespace_pod:kube_pod_info:\n)\n/\nsum by (node) (\n node_memory_MemTotal_bytes{job=\"node-exporter\"}\n* on (namespace, pod) group_left(node)\n node_namespace_pod:kube_pod_info:\n)\n",
"record": "node:node_memory_utilisation:"
},
{
"expr": "1 - (node:node_memory_bytes_available:sum / node:node_memory_bytes_total:sum)\n",
"record": "node:node_memory_utilisation_2:"
},
{
"expr": "1e3 * sum by (node) (\n (rate(node_vmstat_pgpgin{job=\"node-exporter\"}[1m])\n + rate(node_vmstat_pgpgout{job=\"node-exporter\"}[1m]))\n * on (namespace, pod) group_left(node)\n node_namespace_pod:kube_pod_info:\n)\n",
"record": "node:node_memory_swap_io_bytes:sum_rate"
},
{
"expr": "avg(irate(node_disk_io_time_seconds_total{job=\"node-exporter\",device=~\"nvme.+|rbd.+|sd.+|vd.+|xvd.+\"}[1m]))\n",
"record": ":node_disk_utilisation:avg_irate"
},
{
"expr": "avg by (node) (\n irate(node_disk_io_time_seconds_total{job=\"node-exporter\",device=~\"nvme.+|rbd.+|sd.+|vd.+|xvd.+\"}[1m])\n* on (namespace, pod) group_left(node)\n node_namespace_pod:kube_pod_info:\n)\n",
"record": "node:node_disk_utilisation:avg_irate"
},
{
"expr": "avg(irate(node_disk_io_time_weighted_seconds_total{job=\"node-exporter\",device=~\"nvme.+|rbd.+|sd.+|vd.+|xvd.+\"}[1m]) / 1e3)\n",
"record": ":node_disk_saturation:avg_irate"
},
{
"expr": "avg by (node) (\n irate(node_disk_io_time_weighted_seconds_total{job=\"node-exporter\",device=~\"nvme.+|rbd.+|sd.+|vd.+|xvd.+\"}[1m]) / 1e3\n* on (namespace, pod) group_left(node)\n node_namespace_pod:kube_pod_info:\n)\n",
"record": "node:node_disk_saturation:avg_irate"
},
{
"expr": "max by (namespace, pod, device) ((node_filesystem_size_bytes{fstype=~\"ext[234]|btrfs|xfs|zfs\"}\n- node_filesystem_avail_bytes{fstype=~\"ext[234]|btrfs|xfs|zfs\"})\n/ node_filesystem_size_bytes{fstype=~\"ext[234]|btrfs|xfs|zfs\"})\n",
"record": "node:node_filesystem_usage:"
},
{
"expr": "max by (namespace, pod, device) (node_filesystem_avail_bytes{fstype=~\"ext[234]|btrfs|xfs|zfs\"} / node_filesystem_size_bytes{fstype=~\"ext[234]|btrfs|xfs|zfs\"})\n",
"record": "node:node_filesystem_avail:"
},
{
"expr": "sum(irate(node_network_receive_bytes_total{job=\"node-exporter\",device!~\"veth.+\"}[1m])) +\nsum(irate(node_network_transmit_bytes_total{job=\"node-exporter\",device!~\"veth.+\"}[1m]))\n",
"record": ":node_net_utilisation:sum_irate"
},
{
"expr": "sum by (node) (\n (irate(node_network_receive_bytes_total{job=\"node-exporter\",device!~\"veth.+\"}[1m]) +\n irate(node_network_transmit_bytes_total{job=\"node-exporter\",device!~\"veth.+\"}[1m]))\n* on (namespace, pod) group_left(node)\n node_namespace_pod:kube_pod_info:\n)\n",
"record": "node:node_net_utilisation:sum_irate"
},
{
"expr": "sum(irate(node_network_receive_drop_total{job=\"node-exporter\",device!~\"veth.+\"}[1m])) +\nsum(irate(node_network_transmit_drop_total{job=\"node-exporter\",device!~\"veth.+\"}[1m]))\n",
"record": ":node_net_saturation:sum_irate"
},
{
"expr": "sum by (node) (\n (irate(node_network_receive_drop_total{job=\"node-exporter\",device!~\"veth.+\"}[1m]) +\n irate(node_network_transmit_drop_total{job=\"node-exporter\",device!~\"veth.+\"}[1m]))\n* on (namespace, pod) group_left(node)\n node_namespace_pod:kube_pod_info:\n)\n",
"record": "node:node_net_saturation:sum_irate"
},
{
"expr": "max(\n max(\n kube_pod_info{job=\"kube-state-metrics\", host_ip!=\"\"}\n ) by (node, host_ip)\n * on (host_ip) group_right (node)\n label_replace(\n (max(node_filesystem_files{job=\"node-exporter\", mountpoint=\"/\"}) by (instance)), \"host_ip\", \"$1\", \"instance\", \"(.*):.*\"\n )\n) by (node)\n",
"record": "node:node_inodes_total:"
},
{
"expr": "max(\n max(\n kube_pod_info{job=\"kube-state-metrics\", host_ip!=\"\"}\n ) by (node, host_ip)\n * on (host_ip) group_right (node)\n label_replace(\n (max(node_filesystem_files_free{job=\"node-exporter\", mountpoint=\"/\"}) by (instance)), \"host_ip\", \"$1\", \"instance\", \"(.*):.*\"\n )\n) by (node)\n",
"record": "node:node_inodes_free:"
}
]
},
{
"name": "kubernetes-absent",
"rules": [
{
"alert": "KubeAPIDown",
"annotations": {
"message": "KubeAPI has disappeared from Prometheus target discovery.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapidown"
},
"expr": "absent(up{job=\"apiserver\"} == 1)\n",
"for": "15m",
"labels": {
"severity": "critical"
}
},
{
"alert": "KubeControllerManagerDown",
"annotations": {
"message": "KubeControllerManager has disappeared from Prometheus target discovery.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecontrollermanagerdown"
},
"expr": "absent(up{job=\"kube-controller-manager\"} == 1)\n",
"for": "15m",
"labels": {
"severity": "critical"
}
},
{
"alert": "KubeSchedulerDown",
"annotations": {
"message": "KubeScheduler has disappeared from Prometheus target discovery.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeschedulerdown"
},
"expr": "absent(up{job=\"kube-scheduler\"} == 1)\n",
"for": "15m",
"labels": {
"severity": "critical"
}
},
{
"alert": "KubeletDown",
"annotations": {
"message": "Kubelet has disappeared from Prometheus target discovery.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletdown"
},
"expr": "absent(up{job=\"kubelet\"} == 1)\n",
"for": "15m",
"labels": {
"severity": "critical"
}
}
]
},
{
"name": "kubernetes-apps",
"rules": [
{
"alert": "KubePodCrashLooping",
"annotations": {
"message": "Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is restarting {{ printf \"%.2f\" $value }} times / 5 minutes.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepodcrashlooping"
},
"expr": "rate(kube_pod_container_status_restarts_total{job=\"kube-state-metrics\"}[15m]) * 60 * 5 > 0\n",
"for": "1h",
"labels": {
"severity": "critical"
}
},
{
"alert": "KubePodNotReady",
"annotations": {
"message": "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-ready state for longer than an hour.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepodnotready"
},
"expr": "sum by (namespace, pod) (kube_pod_status_phase{job=\"kube-state-metrics\", phase=~\"Pending|Unknown\"}) > 0\n",
"for": "1h",
"labels": {
"severity": "critical"
}
},
{
"alert": "KubeDeploymentGenerationMismatch",
"annotations": {
"message": "Deployment generation for {{ $labels.namespace }}/{{ $labels.deployment }} does not match, this indicates that the Deployment has failed but has not been rolled back.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedeploymentgenerationmismatch"
},
"expr": "kube_deployment_status_observed_generation{job=\"kube-state-metrics\"}\n !=\nkube_deployment_metadata_generation{job=\"kube-state-metrics\"}\n",
"for": "15m",
"labels": {
"severity": "critical"
}
},
{
"alert": "KubeDeploymentReplicasMismatch",
"annotations": {
"message": "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has not matched the expected number of replicas for longer than an hour.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedeploymentreplicasmismatch"
},
"expr": "kube_deployment_spec_replicas{job=\"kube-state-metrics\"}\n !=\nkube_deployment_status_replicas_available{job=\"kube-state-metrics\"}\n",
"for": "1h",
"labels": {
"severity": "critical"
}
},
{
"alert": "KubeStatefulSetReplicasMismatch",
"annotations": {
"message": "StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} has not matched the expected number of replicas for longer than 15 minutes.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatefulsetreplicasmismatch"
},
"expr": "kube_statefulset_status_replicas_ready{job=\"kube-state-metrics\"}\n !=\nkube_statefulset_status_replicas{job=\"kube-state-metrics\"}\n",
"for": "15m",
"labels": {
"severity": "critical"
}
},
{
"alert": "KubeStatefulSetGenerationMismatch",
"annotations": {
"message": "StatefulSet generation for {{ $labels.namespace }}/{{ $labels.statefulset }} does not match, this indicates that the StatefulSet has failed but has not been rolled back.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatefulsetgenerationmismatch"
},
"expr": "kube_statefulset_status_observed_generation{job=\"kube-state-metrics\"}\n !=\nkube_statefulset_metadata_generation{job=\"kube-state-metrics\"}\n",
"for": "15m",
"labels": {
"severity": "critical"
}
},
{
"alert": "KubeStatefulSetUpdateNotRolledOut",
"annotations": {
"message": "StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} update has not been rolled out.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatefulsetupdatenotrolledout"
},
"expr": "max without (revision) (\n kube_statefulset_status_current_revision{job=\"kube-state-metrics\"}\n unless\n kube_statefulset_status_update_revision{job=\"kube-state-metrics\"}\n)\n *\n(\n kube_statefulset_replicas{job=\"kube-state-metrics\"}\n !=\n kube_statefulset_status_replicas_updated{job=\"kube-state-metrics\"}\n)\n",
"for": "15m",
"labels": {
"severity": "critical"
}
},
{
"alert": "KubeDaemonSetRolloutStuck",
"annotations": {
"message": "Only {{ $value }}% of the desired Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are scheduled and ready.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedaemonsetrolloutstuck"
},
"expr": "kube_daemonset_status_number_ready{job=\"kube-state-metrics\"}\n /\nkube_daemonset_status_desired_number_scheduled{job=\"kube-state-metrics\"} * 100 < 100\n",
"for": "15m",
"labels": {
"severity": "critical"
}
},
{
"alert": "KubeDaemonSetNotScheduled",
"annotations": {
"message": "{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are not scheduled.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedaemonsetnotscheduled"
},
"expr": "kube_daemonset_status_desired_number_scheduled{job=\"kube-state-metrics\"}\n -\nkube_daemonset_status_current_number_scheduled{job=\"kube-state-metrics\"} > 0\n",
"for": "10m",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeDaemonSetMisScheduled",
"annotations": {
"message": "{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are running where they are not supposed to run.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedaemonsetmisscheduled"
},
"expr": "kube_daemonset_status_number_misscheduled{job=\"kube-state-metrics\"} > 0\n",
"for": "10m",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeCronJobRunning",
"annotations": {
"message": "CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more than 1h to complete.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecronjobrunning"
},
"expr": "time() - kube_cronjob_next_schedule_time{job=\"kube-state-metrics\"} > 3600\n",
"for": "1h",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeJobCompletion",
"annotations": {
"message": "Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more than one hour to complete.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubejobcompletion"
},
"expr": "kube_job_spec_completions{job=\"kube-state-metrics\"} - kube_job_status_succeeded{job=\"kube-state-metrics\"} > 0\n",
"for": "1h",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeJobFailed",
"annotations": {
"message": "Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubejobfailed"
},
"expr": "kube_job_status_failed{job=\"kube-state-metrics\"} > 0\n",
"for": "1h",
"labels": {
"severity": "warning"
}
}
]
},
{
"name": "kubernetes-resources",
"rules": [
{
"alert": "KubeCPUOvercommit",
"annotations": {
"message": "Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuovercommit"
},
"expr": "sum(namespace_name:kube_pod_container_resource_requests_cpu_cores:sum)\n /\nsum(node:node_num_cpu:sum)\n >\n(count(node:node_num_cpu:sum)-1) / count(node:node_num_cpu:sum)\n",
"for": "5m",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeMemOvercommit",
"annotations": {
"message": "Cluster has overcommitted memory resource requests for Pods and cannot tolerate node failure.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubememovercommit"
},
"expr": "sum(namespace_name:kube_pod_container_resource_requests_memory_bytes:sum)\n /\nsum(node_memory_MemTotal_bytes)\n >\n(count(node:node_num_cpu:sum)-1)\n /\ncount(node:node_num_cpu:sum)\n",
"for": "5m",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeCPUOvercommit",
"annotations": {
"message": "Cluster has overcommitted CPU resource requests for Namespaces.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuovercommit"
},
"expr": "sum(kube_resourcequota{job=\"kube-state-metrics\", type=\"hard\", resource=\"requests.cpu\"})\n /\nsum(node:node_num_cpu:sum)\n > 1.5\n",
"for": "5m",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeMemOvercommit",
"annotations": {
"message": "Cluster has overcommitted memory resource requests for Namespaces.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubememovercommit"
},
"expr": "sum(kube_resourcequota{job=\"kube-state-metrics\", type=\"hard\", resource=\"requests.memory\"})\n /\nsum(node_memory_MemTotal_bytes{job=\"node-exporter\"})\n > 1.5\n",
"for": "5m",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeQuotaExceeded",
"annotations": {
"message": "Namespace {{ $labels.namespace }} is using {{ printf \"%0.0f\" $value }}% of its {{ $labels.resource }} quota.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubequotaexceeded"
},
"expr": "100 * kube_resourcequota{job=\"kube-state-metrics\", type=\"used\"}\n / ignoring(instance, job, type)\n(kube_resourcequota{job=\"kube-state-metrics\", type=\"hard\"} > 0)\n > 90\n",
"for": "15m",
"labels": {
"severity": "warning"
}
},
{
"alert": "CPUThrottlingHigh",
"annotations": {
"message": "{{ printf \"%0.0f\" $value }}% throttling of CPU in namespace {{ $labels.namespace }} for container {{ $labels.container_name }} in pod {{ $labels.pod_name }}.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-cputhrottlinghigh"
},
"expr": "100 * sum(increase(container_cpu_cfs_throttled_periods_total{container_name!=\"\", }[5m])) by (container_name, pod_name, namespace)\n /\nsum(increase(container_cpu_cfs_periods_total{}[5m])) by (container_name, pod_name, namespace)\n > 100 \n",
"for": "15m",
"labels": {
"severity": "warning"
}
}
]
},
{
"name": "kubernetes-storage",
"rules": [
{
"alert": "KubePersistentVolumeUsageCritical",
"annotations": {
"message": "The PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is only {{ printf \"%0.2f\" $value }}% free.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepersistentvolumeusagecritical"
},
"expr": "100 * kubelet_volume_stats_available_bytes{job=\"kubelet\"}\n /\nkubelet_volume_stats_capacity_bytes{job=\"kubelet\"}\n < 3\n",
"for": "1m",
"labels": {
"severity": "critical"
}
},
{
"alert": "KubePersistentVolumeFullInFourDays",
"annotations": {
"message": "Based on recent sampling, the PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is expected to fill up within four days. Currently {{ printf \"%0.2f\" $value }}% is available.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepersistentvolumefullinfourdays"
},
"expr": "100 * (\n kubelet_volume_stats_available_bytes{job=\"kubelet\"}\n /\n kubelet_volume_stats_capacity_bytes{job=\"kubelet\"}\n) < 15\nand\npredict_linear(kubelet_volume_stats_available_bytes{job=\"kubelet\"}[6h], 4 * 24 * 3600) < 0\n",
"for": "5m",
"labels": {
"severity": "critical"
}
},
{
"alert": "KubePersistentVolumeErrors",
"annotations": {
"message": "The persistent volume {{ $labels.persistentvolume }} has status {{ $labels.phase }}.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepersistentvolumeerrors"
},
"expr": "kube_persistentvolume_status_phase{phase=~\"Failed|Pending\",job=\"kube-state-metrics\"} > 0\n",
"for": "5m",
"labels": {
"severity": "critical"
}
}
]
},
{
"name": "kubernetes-system",
"rules": [
{
"alert": "KubeNodeNotReady",
"annotations": {
"message": "{{ $labels.node }} has been unready for more than an hour.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubenodenotready"
},
"expr": "kube_node_status_condition{job=\"kube-state-metrics\",condition=\"Ready\",status=\"true\"} == 0\n",
"for": "1h",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeVersionMismatch",
"annotations": {
"message": "There are {{ $value }} different semantic versions of Kubernetes components running.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeversionmismatch"
},
"expr": "count(count by (gitVersion) (label_replace(kubernetes_build_info{job!=\"coredns\"},\"gitVersion\",\"$1\",\"gitVersion\",\"(v[0-9]*.[0-9]*.[0-9]*).*\"))) > 1\n",
"for": "1h",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeClientErrors",
"annotations": {
"message": "Kubernetes API server client '{{ $labels.job }}/{{ $labels.instance }}' is experiencing {{ printf \"%0.0f\" $value }}% errors.'",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeclienterrors"
},
"expr": "(sum(rate(rest_client_requests_total{code=~\"5..\"}[5m])) by (instance, job)\n /\nsum(rate(rest_client_requests_total[5m])) by (instance, job))\n* 100 > 1\n",
"for": "15m",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeClientErrors",
"annotations": {
"message": "Kubernetes API server client '{{ $labels.job }}/{{ $labels.instance }}' is experiencing {{ printf \"%0.0f\" $value }} errors / second.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeclienterrors"
},
"expr": "sum(rate(ksm_scrape_error_total{job=\"kube-state-metrics\"}[5m])) by (instance, job) > 0.1\n",
"for": "15m",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeletTooManyPods",
"annotations": {
"message": "Kubelet {{ $labels.instance }} is running {{ $value }} Pods, close to the limit of 110.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubelettoomanypods"
},
"expr": "kubelet_running_pod_count{job=\"kubelet\"} > 110 * 0.9\n",
"for": "15m",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeAPILatencyHigh",
"annotations": {
"message": "The API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapilatencyhigh"
},
"expr": "cluster_quantile:apiserver_request_latencies:histogram_quantile{job=\"apiserver\",quantile=\"0.99\",subresource!=\"log\",verb!~\"^(?:LIST|WATCH|WATCHLIST|PROXY|CONNECT)$\"} > 1\n",
"for": "10m",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeAPILatencyHigh",
"annotations": {
"message": "The API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapilatencyhigh"
},
"expr": "cluster_quantile:apiserver_request_latencies:histogram_quantile{job=\"apiserver\",quantile=\"0.99\",subresource!=\"log\",verb!~\"^(?:LIST|WATCH|WATCHLIST|PROXY|CONNECT)$\"} > 4\n",
"for": "10m",
"labels": {
"severity": "critical"
}
},
{
"alert": "KubeAPIErrorsHigh",
"annotations": {
"message": "API server is returning errors for {{ $value }}% of requests.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorshigh"
},
"expr": "sum(rate(apiserver_request_count{job=\"apiserver\",code=~\"^(?:5..)$\"}[5m])) without(instance, pod)\n /\nsum(rate(apiserver_request_count{job=\"apiserver\"}[5m])) without(instance, pod) * 100 > 10\n",
"for": "10m",
"labels": {
"severity": "critical"
}
},
{
"alert": "KubeAPIErrorsHigh",
"annotations": {
"message": "API server is returning errors for {{ $value }}% of requests.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorshigh"
},
"expr": "sum(rate(apiserver_request_count{job=\"apiserver\",code=~\"^(?:5..)$\"}[5m])) without(instance, pod)\n /\nsum(rate(apiserver_request_count{job=\"apiserver\"}[5m])) without(instance, pod) * 100 > 5\n",
"for": "10m",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeClientCertificateExpiration",
"annotations": {
"message": "A client certificate used to authenticate to the apiserver is expiring in less than 7 days.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeclientcertificateexpiration"
},
"expr": "histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job=\"apiserver\"}[5m]))) < 604800\n",
"labels": {
"severity": "warning"
}
},
{
"alert": "KubeClientCertificateExpiration",
"annotations": {
"message": "A client certificate used to authenticate to the apiserver is expiring in less than 24 hours.",
"runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeclientcertificateexpiration"
},
"expr": "histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job=\"apiserver\"}[5m]))) < 86400\n",
"labels": {
"severity": "critical"
}
}
]
}
]
}
kubeprom.yaml: |-
{
"groups": [
{
"name": "kube-prometheus-node-recording.rules",
"rules": [
{
"expr": "sum(rate(node_cpu_seconds_total{mode!=\"idle\",mode!=\"iowait\"}[3m])) BY (instance)",
"record": "instance:node_cpu:rate:sum"
},
{
"expr": "sum((node_filesystem_size_bytes{mountpoint=\"/\"} - node_filesystem_free_bytes{mountpoint=\"/\"})) BY (instance)",
"record": "instance:node_filesystem_usage:sum"
},
{
"expr": "sum(rate(node_network_receive_bytes_total[3m])) BY (instance)",
"record": "instance:node_network_receive_bytes:rate:sum"
},
{
"expr": "sum(rate(node_network_transmit_bytes_total[3m])) BY (instance)",
"record": "instance:node_network_transmit_bytes:rate:sum"
},
{
"expr": "sum(rate(node_cpu_seconds_total{mode!=\"idle\",mode!=\"iowait\"}[5m])) WITHOUT (cpu, mode) / ON(instance) GROUP_LEFT() count(sum(node_cpu_seconds_total) BY (instance, cpu)) BY (instance)",
"record": "instance:node_cpu:ratio"
},
{
"expr": "sum(rate(node_cpu_seconds_total{mode!=\"idle\",mode!=\"iowait\"}[5m]))",
"record": "cluster:node_cpu:sum_rate5m"
},
{
"expr": "cluster:node_cpu_seconds_total:rate5m / count(sum(node_cpu_seconds_total) BY (instance, cpu))",
"record": "cluster:node_cpu:ratio"
}
]
},
{
"name": "kube-prometheus-node-alerting.rules",
"rules": [
{
"alert": "NodeDiskRunningFull",
"annotations": {
"message": "Device {{ $labels.device }} of node-exporter {{ $labels.namespace }}/{{ $labels.pod }} will be full within the next 24 hours."
},
"expr": "(node:node_filesystem_usage: > 0.85) and (predict_linear(node:node_filesystem_avail:[6h], 3600 * 24) < 0)\n",
"for": "30m",
"labels": {
"severity": "warning"
}
},
{
"alert": "NodeDiskRunningFull",
"annotations": {
"message": "Device {{ $labels.device }} of node-exporter {{ $labels.namespace }}/{{ $labels.pod }} will be full within the next 2 hours."
},
"expr": "(node:node_filesystem_usage: > 0.85) and (predict_linear(node:node_filesystem_avail:[30m], 3600 * 2) < 0)\n",
"for": "10m",
"labels": {
"severity": "critical"
}
}
]
},
{
"name": "prometheus.rules",
"rules": [
{
"alert": "PrometheusConfigReloadFailed",
"annotations": {
"description": "Reloading Prometheus' configuration has failed for {{$labels.namespace}}/{{$labels.pod}}",
"summary": "Reloading Prometheus' configuration failed"
},
"expr": "prometheus_config_last_reload_successful{job=\"prometheus\"} == 0\n",
"for": "10m",
"labels": {
"severity": "warning"
}
},
{
"alert": "PrometheusNotificationQueueRunningFull",
"annotations": {
"description": "Prometheus' alert notification queue is running full for {{$labels.namespace}}/{{ $labels.pod}}",
"summary": "Prometheus' alert notification queue is running full"
},
"expr": "predict_linear(prometheus_notifications_queue_length{job=\"prometheus\"}[5m], 60 * 30) > prometheus_notifications_queue_capacity{job=\"prometheus\"}\n",
"for": "10m",
"labels": {
"severity": "warning"
}
},
{
"alert": "PrometheusErrorSendingAlerts",
"annotations": {
"description": "Errors while sending alerts from Prometheus {{$labels.namespace}}/{{ $labels.pod}} to Alertmanager {{$labels.Alertmanager}}",
"summary": "Errors while sending alert from Prometheus"
},
"expr": "rate(prometheus_notifications_errors_total{job=\"prometheus\"}[5m]) / rate(prometheus_notifications_sent_total{job=\"prometheus\"}[5m]) > 0.01\n",
"for": "10m",
"labels": {
"severity": "warning"
}
},
{
"alert": "PrometheusErrorSendingAlerts",
"annotations": {
"description": "Errors while sending alerts from Prometheus {{$labels.namespace}}/{{ $labels.pod}} to Alertmanager {{$labels.Alertmanager}}",
"summary": "Errors while sending alerts from Prometheus"
},
"expr": "rate(prometheus_notifications_errors_total{job=\"prometheus\"}[5m]) / rate(prometheus_notifications_sent_total{job=\"prometheus\"}[5m]) > 0.03\n",
"for": "10m",
"labels": {
"severity": "critical"
}
},
{
"alert": "PrometheusNotConnectedToAlertmanagers",
"annotations": {
"description": "Prometheus {{ $labels.namespace }}/{{ $labels.pod}} is not connected to any Alertmanagers",
"summary": "Prometheus is not connected to any Alertmanagers"
},
"expr": "prometheus_notifications_alertmanagers_discovered{job=\"prometheus\"} < 1\n",
"for": "10m",
"labels": {
"severity": "warning"
}
},
{
"alert": "PrometheusTSDBReloadsFailing",
"annotations": {
"description": "{{$labels.job}} at {{$labels.instance}} had {{$value | humanize}} reload failures over the last four hours.",
"summary": "Prometheus has issues reloading data blocks from disk"
},
"expr": "increase(prometheus_tsdb_reloads_failures_total{job=\"prometheus\"}[2h]) > 0\n",
"for": "12h",
"labels": {
"severity": "warning"
}
},
{
"alert": "PrometheusTSDBCompactionsFailing",
"annotations": {
"description": "{{$labels.job}} at {{$labels.instance}} had {{$value | humanize}} compaction failures over the last four hours.",
"summary": "Prometheus has issues compacting sample blocks"
},
"expr": "increase(prometheus_tsdb_compactions_failed_total{job=\"prometheus\"}[2h]) > 0\n",
"for": "12h",
"labels": {
"severity": "warning"
}
},
{
"alert": "PrometheusTSDBWALCorruptions",
"annotations": {
"description": "{{$labels.job}} at {{$labels.instance}} has a corrupted write-ahead log (WAL).",
"summary": "Prometheus write-ahead log is corrupted"
},
"expr": "tsdb_wal_corruptions_total{job=\"prometheus\"} > 0\n",
"for": "4h",
"labels": {
"severity": "warning"
}
},
{
"alert": "PrometheusNotIngestingSamples",
"annotations": {
"description": "Prometheus {{ $labels.namespace }}/{{ $labels.pod}} isn't ingesting samples.",
"summary": "Prometheus isn't ingesting samples"
},
"expr": "rate(prometheus_tsdb_head_samples_appended_total{job=\"prometheus\"}[5m]) <= 0\n",
"for": "10m",
"labels": {
"severity": "warning"
}
},
{
"alert": "PrometheusTargetScrapesDuplicate",
"annotations": {
"description": "{{$labels.namespace}}/{{$labels.pod}} has many samples rejected due to duplicate timestamps but different values",
"summary": "Prometheus has many samples rejected"
},
"expr": "increase(prometheus_target_scrapes_sample_duplicate_timestamp_total{job=\"prometheus\"}[5m]) > 0\n",
"for": "10m",
"labels": {
"severity": "warning"
}
}
]
},
{
"name": "general.rules",
"rules": [
{
"alert": "TargetDown",
"annotations": {
"message": "{{ $value }}% of the {{ $labels.job }} targets are down."
},
"expr": "100 * (count(up == 0) BY (job) / count(up) BY (job)) > 10",
"for": "10m",
"labels": {
"severity": "warning"
}
}
]
}
]
}

View File

@ -11,10 +11,10 @@ Typhoon distributes upstream Kubernetes, architectural conventions, and cluster
## Features <a href="https://www.cncf.io/certification/software-conformance/"><img align="right" src="https://storage.googleapis.com/poseidon/certified-kubernetes.png"></a>
* Kubernetes v1.12.1 (upstream, via [kubernetes-incubator/bootkube](https://github.com/kubernetes-incubator/bootkube))
* Single or multi-master, workloads isolated on workers, [Calico](https://www.projectcalico.org/) or [flannel](https://github.com/coreos/flannel) networking
* Kubernetes v1.13.4 (upstream, via [kubernetes-incubator/bootkube](https://github.com/kubernetes-incubator/bootkube))
* Single or multi-master, [Calico](https://www.projectcalico.org/) or [flannel](https://github.com/coreos/flannel) networking
* On-cluster etcd with TLS, [RBAC](https://kubernetes.io/docs/admin/authorization/rbac/)-enabled, [network policy](https://kubernetes.io/docs/concepts/services-networking/network-policies/)
* Advanced features like [worker pools](https://typhoon.psdn.io/advanced/worker-pools/)
* Advanced features like [worker pools](https://typhoon.psdn.io/advanced/worker-pools/), [spot](https://typhoon.psdn.io/cl/aws/#spot) workers, and [snippets](https://typhoon.psdn.io/advanced/customization/#container-linux) customization
* Ready for Ingress, Prometheus, Grafana, and other optional [addons](https://typhoon.psdn.io/addons/overview/)
## Docs

View File

@ -1,6 +1,6 @@
# Self-hosted Kubernetes assets (kubeconfig, manifests)
module "bootkube" {
source = "git::https://github.com/poseidon/terraform-render-bootkube.git?ref=81f19507faabf411db9c760d55f3d03f7d78f4c9"
source = "git::https://github.com/poseidon/terraform-render-bootkube.git?ref=953521dbba49eb6a39204f30a3978730eac01e11"
cluster_name = "${var.cluster_name}"
api_servers = ["${format("%s.%s", var.cluster_name, var.dns_zone)}"]
@ -11,4 +11,5 @@ module "bootkube" {
pod_cidr = "${var.pod_cidr}"
service_cidr = "${var.service_cidr}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
enable_reporting = "${var.enable_reporting}"
}

View File

@ -7,7 +7,7 @@ systemd:
- name: 40-etcd-cluster.conf
contents: |
[Service]
Environment="ETCD_IMAGE_TAG=v3.3.10"
Environment="ETCD_IMAGE_TAG=v3.3.12"
Environment="ETCD_NAME=${etcd_name}"
Environment="ETCD_ADVERTISE_CLIENT_URLS=https://${etcd_domain}:2379"
Environment="ETCD_INITIAL_ADVERTISE_PEER_URLS=https://${etcd_domain}:2380"
@ -78,7 +78,7 @@ systemd:
--authentication-token-webhook \
--authorization-mode=Webhook \
--client-ca-file=/etc/kubernetes/ca.crt \
--cluster_dns=${k8s_dns_service_ip} \
--cluster_dns=${cluster_dns_service_ip} \
--cluster_domain=${cluster_domain_suffix} \
--cni-conf-dir=/etc/kubernetes/cni/net.d \
--exit-on-lock-contention \
@ -88,6 +88,7 @@ systemd:
--node-labels=node-role.kubernetes.io/master \
--node-labels=node-role.kubernetes.io/controller="true" \
--pod-manifest-path=/etc/kubernetes/manifests \
--read-only-port=0 \
--register-with-taints=node-role.kubernetes.io/master=:NoSchedule \
--volume-plugin-dir=/var/lib/kubelet/volumeplugins
ExecStop=-/usr/bin/rkt stop --uuid-file=/var/cache/kubelet-pod.uuid
@ -122,7 +123,7 @@ storage:
contents:
inline: |
KUBELET_IMAGE_URL=docker://k8s.gcr.io/hyperkube
KUBELET_IMAGE_TAG=v1.12.1
KUBELET_IMAGE_TAG=v1.13.4
- path: /etc/sysctl.d/max-user-watches.conf
filesystem: root
contents:
@ -142,17 +143,14 @@ storage:
set -e
# Move experimental manifests
[ -n "$(ls /opt/bootkube/assets/manifests-*/* 2>/dev/null)" ] && mv /opt/bootkube/assets/manifests-*/* /opt/bootkube/assets/manifests && rm -rf /opt/bootkube/assets/manifests-*
BOOTKUBE_ACI="$${BOOTKUBE_ACI:-quay.io/coreos/bootkube}"
BOOTKUBE_VERSION="$${BOOTKUBE_VERSION:-v0.13.0}"
BOOTKUBE_ASSETS="$${BOOTKUBE_ASSETS:-/opt/bootkube/assets}"
exec /usr/bin/rkt run \
--trust-keys-from-https \
--volume assets,kind=host,source=$${BOOTKUBE_ASSETS} \
--volume assets,kind=host,source=/opt/bootkube/assets \
--mount volume=assets,target=/assets \
--volume bootstrap,kind=host,source=/etc/kubernetes \
--mount volume=bootstrap,target=/etc/kubernetes \
$${RKT_OPTS} \
$${BOOTKUBE_ACI}:$${BOOTKUBE_VERSION} \
quay.io/coreos/bootkube:v0.14.0 \
--net=host \
--dns=host \
--exec=/bootkube -- start --asset-dir=/assets "$@"

View File

@ -24,12 +24,13 @@ resource "aws_instance" "controllers" {
instance_type = "${var.controller_type}"
ami = "${local.ami_id}"
user_data = "${element(data.ct_config.controller_ign.*.rendered, count.index)}"
user_data = "${element(data.ct_config.controller-ignitions.*.rendered, count.index)}"
# storage
root_block_device {
volume_type = "${var.disk_type}"
volume_size = "${var.disk_size}"
iops = "${var.disk_iops}"
}
# network
@ -38,12 +39,23 @@ resource "aws_instance" "controllers" {
vpc_security_group_ids = ["${aws_security_group.controller.id}"]
lifecycle {
ignore_changes = ["ami"]
ignore_changes = [
"ami",
"user_data",
]
}
}
# Controller Container Linux Config
data "template_file" "controller_config" {
# Controller Ignition configs
data "ct_config" "controller-ignitions" {
count = "${var.controller_count}"
content = "${element(data.template_file.controller-configs.*.rendered, count.index)}"
pretty_print = false
snippets = ["${var.controller_clc_snippets}"]
}
# Controller Container Linux configs
data "template_file" "controller-configs" {
count = "${var.controller_count}"
template = "${file("${path.module}/cl/controller.yaml.tmpl")}"
@ -56,10 +68,10 @@ data "template_file" "controller_config" {
# etcd0=https://cluster-etcd0.example.com,etcd1=https://cluster-etcd1.example.com,...
etcd_initial_cluster = "${join(",", data.template_file.etcds.*.rendered)}"
kubeconfig = "${indent(10, module.bootkube.kubeconfig)}"
ssh_authorized_key = "${var.ssh_authorized_key}"
k8s_dns_service_ip = "${cidrhost(var.service_cidr, 10)}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
kubeconfig = "${indent(10, module.bootkube.kubeconfig-kubelet)}"
ssh_authorized_key = "${var.ssh_authorized_key}"
cluster_dns_service_ip = "${cidrhost(var.service_cidr, 10)}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
}
}
@ -73,10 +85,3 @@ data "template_file" "etcds" {
dns_zone = "${var.dns_zone}"
}
}
data "ct_config" "controller_ign" {
count = "${var.controller_count}"
content = "${element(data.template_file.controller_config.*.rendered, count.index)}"
pretty_print = false
snippets = ["${var.controller_clc_snippets}"]
}

View File

@ -1,3 +1,7 @@
output "kubeconfig-admin" {
value = "${module.bootkube.kubeconfig-admin}"
}
# Outputs for Kubernetes Ingress
output "ingress_dns_name" {
@ -5,6 +9,11 @@ output "ingress_dns_name" {
description = "DNS name of the network load balancer for distributing traffic to Ingress controllers"
}
output "ingress_zone_id" {
value = "${aws_lb.nlb.zone_id}"
description = "Route53 zone id of the network load balancer DNS name that can be used in Route53 alias records"
}
# Outputs for worker pools
output "vpc_id" {
@ -23,7 +32,7 @@ output "worker_security_groups" {
}
output "kubeconfig" {
value = "${module.bootkube.kubeconfig}"
value = "${module.bootkube.kubeconfig-kubelet}"
}
# Outputs for custom load balancing

View File

@ -104,27 +104,6 @@ resource "aws_security_group_rule" "controller-kubelet-self" {
self = true
}
# Allow heapster / metrics-server to scrape kubelet read-only
resource "aws_security_group_rule" "controller-kubelet-read" {
security_group_id = "${aws_security_group.controller.id}"
type = "ingress"
protocol = "tcp"
from_port = 10255
to_port = 10255
source_security_group_id = "${aws_security_group.worker.id}"
}
resource "aws_security_group_rule" "controller-kubelet-read-self" {
security_group_id = "${aws_security_group.controller.id}"
type = "ingress"
protocol = "tcp"
from_port = 10255
to_port = 10255
self = true
}
resource "aws_security_group_rule" "controller-bgp" {
security_group_id = "${aws_security_group.controller.id}"
@ -300,27 +279,6 @@ resource "aws_security_group_rule" "worker-kubelet-self" {
self = true
}
# Allow heapster / metrics-server to scrape kubelet read-only
resource "aws_security_group_rule" "worker-kubelet-read" {
security_group_id = "${aws_security_group.worker.id}"
type = "ingress"
protocol = "tcp"
from_port = 10255
to_port = 10255
source_security_group_id = "${aws_security_group.controller.id}"
}
resource "aws_security_group_rule" "worker-kubelet-read-self" {
security_group_id = "${aws_security_group.worker.id}"
type = "ingress"
protocol = "tcp"
from_port = 10255
to_port = 10255
self = true
}
resource "aws_security_group_rule" "worker-bgp" {
security_group_id = "${aws_security_group.worker.id}"

View File

@ -31,13 +31,13 @@ variable "worker_count" {
variable "controller_type" {
type = "string"
default = "t2.small"
default = "t3.small"
description = "EC2 instance type for controllers"
}
variable "worker_type" {
type = "string"
default = "t2.small"
default = "t3.small"
description = "EC2 instance type for workers"
}
@ -59,6 +59,12 @@ variable "disk_type" {
description = "Type of the EBS volume (e.g. standard, gp2, io1)"
}
variable "disk_iops" {
type = "string"
default = "0"
description = "IOPS of the EBS volume (e.g. 100)"
}
variable "worker_price" {
type = "string"
default = ""
@ -128,3 +134,9 @@ variable "cluster_domain_suffix" {
type = "string"
default = "cluster.local"
}
variable "enable_reporting" {
type = "string"
description = "Enable usage or analytics reporting to upstreams (Calico)"
default = "false"
}

View File

@ -13,7 +13,7 @@ module "workers" {
spot_price = "${var.worker_price}"
# configuration
kubeconfig = "${module.bootkube.kubeconfig}"
kubeconfig = "${module.bootkube.kubeconfig-kubelet}"
ssh_authorized_key = "${var.ssh_authorized_key}"
service_cidr = "${var.service_cidr}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"

View File

@ -51,7 +51,7 @@ systemd:
--authentication-token-webhook \
--authorization-mode=Webhook \
--client-ca-file=/etc/kubernetes/ca.crt \
--cluster_dns=${k8s_dns_service_ip} \
--cluster_dns=${cluster_dns_service_ip} \
--cluster_domain=${cluster_domain_suffix} \
--cni-conf-dir=/etc/kubernetes/cni/net.d \
--exit-on-lock-contention \
@ -60,6 +60,7 @@ systemd:
--network-plugin=cni \
--node-labels=node-role.kubernetes.io/node \
--pod-manifest-path=/etc/kubernetes/manifests \
--read-only-port=0 \
--volume-plugin-dir=/var/lib/kubelet/volumeplugins
ExecStop=-/usr/bin/rkt stop --uuid-file=/var/cache/kubelet-pod.uuid
Restart=always
@ -92,7 +93,7 @@ storage:
contents:
inline: |
KUBELET_IMAGE_URL=docker://k8s.gcr.io/hyperkube
KUBELET_IMAGE_TAG=v1.12.1
KUBELET_IMAGE_TAG=v1.13.4
- path: /etc/sysctl.d/max-user-watches.conf
filesystem: root
contents:
@ -110,7 +111,7 @@ storage:
--volume config,kind=host,source=/etc/kubernetes \
--mount volume=config,target=/etc/kubernetes \
--insecure-options=image \
docker://k8s.gcr.io/hyperkube:v1.12.1 \
docker://k8s.gcr.io/hyperkube:v1.13.4 \
--net=host \
--dns=host \
--exec=/kubectl -- --kubeconfig=/etc/kubernetes/kubeconfig delete node $(hostname)

View File

@ -30,7 +30,7 @@ variable "count" {
variable "instance_type" {
type = "string"
default = "t2.small"
default = "t3.small"
description = "EC2 instance type"
}
@ -52,6 +52,12 @@ variable "disk_type" {
description = "Type of the EBS volume (e.g. standard, gp2, io1)"
}
variable "disk_iops" {
type = "string"
default = "0"
description = "IOPS of the EBS volume (required for io1)"
}
variable "spot_price" {
type = "string"
default = ""

View File

@ -46,12 +46,13 @@ resource "aws_launch_configuration" "worker" {
spot_price = "${var.spot_price}"
enable_monitoring = false
user_data = "${data.ct_config.worker_ign.rendered}"
user_data = "${data.ct_config.worker-ignition.rendered}"
# storage
root_block_device {
volume_type = "${var.disk_type}"
volume_size = "${var.disk_size}"
iops = "${var.disk_iops}"
}
# network
@ -64,20 +65,21 @@ resource "aws_launch_configuration" "worker" {
}
}
# Worker Container Linux Config
data "template_file" "worker_config" {
template = "${file("${path.module}/cl/worker.yaml.tmpl")}"
vars = {
kubeconfig = "${indent(10, var.kubeconfig)}"
ssh_authorized_key = "${var.ssh_authorized_key}"
k8s_dns_service_ip = "${cidrhost(var.service_cidr, 10)}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
}
}
data "ct_config" "worker_ign" {
content = "${data.template_file.worker_config.rendered}"
# Worker Ignition config
data "ct_config" "worker-ignition" {
content = "${data.template_file.worker-config.rendered}"
pretty_print = false
snippets = ["${var.clc_snippets}"]
}
# Worker Container Linux config
data "template_file" "worker-config" {
template = "${file("${path.module}/cl/worker.yaml.tmpl")}"
vars = {
kubeconfig = "${indent(10, var.kubeconfig)}"
ssh_authorized_key = "${var.ssh_authorized_key}"
cluster_dns_service_ip = "${cidrhost(var.service_cidr, 10)}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
}
}

View File

@ -11,10 +11,10 @@ Typhoon distributes upstream Kubernetes, architectural conventions, and cluster
## Features <a href="https://www.cncf.io/certification/software-conformance/"><img align="right" src="https://storage.googleapis.com/poseidon/certified-kubernetes.png"></a>
* Kubernetes v1.12.1 (upstream, via [kubernetes-incubator/bootkube](https://github.com/kubernetes-incubator/bootkube))
* Single or multi-master, workloads isolated on workers, [Calico](https://www.projectcalico.org/) or [flannel](https://github.com/coreos/flannel) networking
* Kubernetes v1.13.4 (upstream, via [kubernetes-incubator/bootkube](https://github.com/kubernetes-incubator/bootkube))
* Single or multi-master, [Calico](https://www.projectcalico.org/) or [flannel](https://github.com/coreos/flannel) networking
* On-cluster etcd with TLS, [RBAC](https://kubernetes.io/docs/admin/authorization/rbac/)-enabled, [network policy](https://kubernetes.io/docs/concepts/services-networking/network-policies/)
* Advanced features like [worker pools](https://typhoon.psdn.io/advanced/worker-pools/)
* Advanced features like [worker pools](https://typhoon.psdn.io/advanced/worker-pools/) and [spot](https://typhoon.psdn.io/cl/aws/#spot) workers
* Ready for Ingress, Prometheus, Grafana, and other optional [addons](https://typhoon.psdn.io/addons/overview/)
## Docs

View File

@ -1,6 +1,6 @@
# Self-hosted Kubernetes assets (kubeconfig, manifests)
module "bootkube" {
source = "git::https://github.com/poseidon/terraform-render-bootkube.git?ref=81f19507faabf411db9c760d55f3d03f7d78f4c9"
source = "git::https://github.com/poseidon/terraform-render-bootkube.git?ref=953521dbba49eb6a39204f30a3978730eac01e11"
cluster_name = "${var.cluster_name}"
api_servers = ["${format("%s.%s", var.cluster_name, var.dns_zone)}"]
@ -11,6 +11,7 @@ module "bootkube" {
pod_cidr = "${var.pod_cidr}"
service_cidr = "${var.service_cidr}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
enable_reporting = "${var.enable_reporting}"
# Fedora
trusted_certs_dir = "/etc/pki/tls/certs"

View File

@ -19,24 +19,9 @@ write_files:
ETCD_PEER_CERT_FILE=/etc/ssl/certs/etcd/peer.crt
ETCD_PEER_KEY_FILE=/etc/ssl/certs/etcd/peer.key
ETCD_PEER_CLIENT_CERT_AUTH=true
- path: /etc/systemd/system/cloud-metadata.service
content: |
[Unit]
Description=Cloud metadata agent
[Service]
Type=oneshot
Environment=OUTPUT=/run/metadata/cloud
ExecStart=/usr/bin/mkdir -p /run/metadata
ExecStart=/usr/bin/bash -c 'echo "HOSTNAME_OVERRIDE=$(curl\
--url http://169.254.169.254/latest/meta-data/local-ipv4\
--retry 10)" > $${OUTPUT}'
[Install]
WantedBy=multi-user.target
- path: /etc/systemd/system/kubelet.service.d/10-typhoon.conf
content: |
[Unit]
Requires=cloud-metadata.service
After=cloud-metadata.service
Wants=rpc-statd.service
[Service]
ExecStartPre=/bin/mkdir -p /opt/cni/bin
@ -55,7 +40,7 @@ write_files:
--authentication-token-webhook \
--authorization-mode=Webhook \
--client-ca-file=/etc/kubernetes/ca.crt \
--cluster_dns=${k8s_dns_service_ip} \
--cluster_dns=${cluster_dns_service_ip} \
--cluster_domain=${cluster_domain_suffix} \
--cni-conf-dir=/etc/kubernetes/cni/net.d \
--exit-on-lock-contention \
@ -65,6 +50,7 @@ write_files:
--node-labels=node-role.kubernetes.io/master \
--node-labels=node-role.kubernetes.io/controller="true" \
--pod-manifest-path=/etc/kubernetes/manifests \
--read-only-port=0 \
--register-with-taints=node-role.kubernetes.io/master=:NoSchedule \
--volume-plugin-dir=/var/lib/kubelet/volumeplugins"
- path: /etc/kubernetes/kubeconfig
@ -92,11 +78,10 @@ bootcmd:
runcmd:
- [systemctl, daemon-reload]
- [systemctl, restart, NetworkManager]
- "atomic install --system --name=etcd quay.io/poseidon/etcd:v3.3.10"
- "atomic install --system --name=kubelet quay.io/poseidon/kubelet:v1.12.1"
- "atomic install --system --name=bootkube quay.io/poseidon/bootkube:v0.13.0"
- "atomic install --system --name=etcd quay.io/poseidon/etcd:v3.3.12"
- "atomic install --system --name=kubelet quay.io/poseidon/kubelet:v1.13.4"
- "atomic install --system --name=bootkube quay.io/poseidon/bootkube:v0.14.0"
- [systemctl, start, --no-block, etcd.service]
- [systemctl, enable, cloud-metadata.service]
- [systemctl, start, --no-block, kubelet.service]
users:
- default

View File

@ -30,6 +30,7 @@ resource "aws_instance" "controllers" {
root_block_device {
volume_type = "${var.disk_type}"
volume_size = "${var.disk_size}"
iops = "${var.disk_iops}"
}
# network
@ -38,7 +39,10 @@ resource "aws_instance" "controllers" {
vpc_security_group_ids = ["${aws_security_group.controller.id}"]
lifecycle {
ignore_changes = ["ami"]
ignore_changes = [
"ami",
"user_data",
]
}
}
@ -56,10 +60,10 @@ data "template_file" "controller-cloudinit" {
# etcd0=https://cluster-etcd0.example.com,etcd1=https://cluster-etcd1.example.com,...
etcd_initial_cluster = "${join(",", data.template_file.etcds.*.rendered)}"
kubeconfig = "${indent(6, module.bootkube.kubeconfig)}"
ssh_authorized_key = "${var.ssh_authorized_key}"
k8s_dns_service_ip = "${cidrhost(var.service_cidr, 10)}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
kubeconfig = "${indent(6, module.bootkube.kubeconfig-kubelet)}"
ssh_authorized_key = "${var.ssh_authorized_key}"
cluster_dns_service_ip = "${cidrhost(var.service_cidr, 10)}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
}
}

View File

@ -1,3 +1,7 @@
output "kubeconfig-admin" {
value = "${module.bootkube.kubeconfig-admin}"
}
# Outputs for Kubernetes Ingress
output "ingress_dns_name" {
@ -5,6 +9,11 @@ output "ingress_dns_name" {
description = "DNS name of the network load balancer for distributing traffic to Ingress controllers"
}
output "ingress_zone_id" {
value = "${aws_lb.nlb.zone_id}"
description = "Route53 zone id of the network load balancer DNS name that can be used in Route53 alias records"
}
# Outputs for worker pools
output "vpc_id" {
@ -23,7 +32,7 @@ output "worker_security_groups" {
}
output "kubeconfig" {
value = "${module.bootkube.kubeconfig}"
value = "${module.bootkube.kubeconfig-kubelet}"
}
# Outputs for custom load balancing

View File

@ -104,27 +104,6 @@ resource "aws_security_group_rule" "controller-kubelet-self" {
self = true
}
# Allow heapster / metrics-server to scrape kubelet read-only
resource "aws_security_group_rule" "controller-kubelet-read" {
security_group_id = "${aws_security_group.controller.id}"
type = "ingress"
protocol = "tcp"
from_port = 10255
to_port = 10255
source_security_group_id = "${aws_security_group.worker.id}"
}
resource "aws_security_group_rule" "controller-kubelet-read-self" {
security_group_id = "${aws_security_group.controller.id}"
type = "ingress"
protocol = "tcp"
from_port = 10255
to_port = 10255
self = true
}
resource "aws_security_group_rule" "controller-bgp" {
security_group_id = "${aws_security_group.controller.id}"
@ -300,27 +279,6 @@ resource "aws_security_group_rule" "worker-kubelet-self" {
self = true
}
# Allow heapster / metrics-server to scrape kubelet read-only
resource "aws_security_group_rule" "worker-kubelet-read" {
security_group_id = "${aws_security_group.worker.id}"
type = "ingress"
protocol = "tcp"
from_port = 10255
to_port = 10255
source_security_group_id = "${aws_security_group.controller.id}"
}
resource "aws_security_group_rule" "worker-kubelet-read-self" {
security_group_id = "${aws_security_group.worker.id}"
type = "ingress"
protocol = "tcp"
from_port = 10255
to_port = 10255
self = true
}
resource "aws_security_group_rule" "worker-bgp" {
security_group_id = "${aws_security_group.worker.id}"

View File

@ -31,13 +31,13 @@ variable "worker_count" {
variable "controller_type" {
type = "string"
default = "t2.small"
default = "t3.small"
description = "EC2 instance type for controllers"
}
variable "worker_type" {
type = "string"
default = "t2.small"
default = "t3.small"
description = "EC2 instance type for workers"
}
@ -53,6 +53,12 @@ variable "disk_type" {
description = "Type of the EBS volume (e.g. standard, gp2, io1)"
}
variable "disk_iops" {
type = "string"
default = "0"
description = "IOPS of the EBS volume (e.g. 100)"
}
variable "worker_price" {
type = "string"
default = ""
@ -110,3 +116,9 @@ variable "cluster_domain_suffix" {
type = "string"
default = "cluster.local"
}
variable "enable_reporting" {
type = "string"
description = "Enable usage or analytics reporting to upstreams (Calico)"
default = "false"
}

View File

@ -12,7 +12,7 @@ module "workers" {
spot_price = "${var.worker_price}"
# configuration
kubeconfig = "${module.bootkube.kubeconfig}"
kubeconfig = "${module.bootkube.kubeconfig-kubelet}"
ssh_authorized_key = "${var.ssh_authorized_key}"
service_cidr = "${var.service_cidr}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"

View File

@ -1,23 +1,8 @@
#cloud-config
write_files:
- path: /etc/systemd/system/cloud-metadata.service
content: |
[Unit]
Description=Cloud metadata agent
[Service]
Type=oneshot
Environment=OUTPUT=/run/metadata/cloud
ExecStart=/usr/bin/mkdir -p /run/metadata
ExecStart=/usr/bin/bash -c 'echo "HOSTNAME_OVERRIDE=$(curl\
--url http://169.254.169.254/latest/meta-data/local-ipv4\
--retry 10)" > $${OUTPUT}'
[Install]
WantedBy=multi-user.target
- path: /etc/systemd/system/kubelet.service.d/10-typhoon.conf
content: |
[Unit]
Requires=cloud-metadata.service
After=cloud-metadata.service
Wants=rpc-statd.service
[Service]
ExecStartPre=/bin/mkdir -p /opt/cni/bin
@ -34,7 +19,7 @@ write_files:
--authentication-token-webhook \
--authorization-mode=Webhook \
--client-ca-file=/etc/kubernetes/ca.crt \
--cluster_dns=${k8s_dns_service_ip} \
--cluster_dns=${cluster_dns_service_ip} \
--cluster_domain=${cluster_domain_suffix} \
--cni-conf-dir=/etc/kubernetes/cni/net.d \
--exit-on-lock-contention \
@ -43,6 +28,7 @@ write_files:
--network-plugin=cni \
--node-labels=node-role.kubernetes.io/node \
--pod-manifest-path=/etc/kubernetes/manifests \
--read-only-port=0 \
--volume-plugin-dir=/var/lib/kubelet/volumeplugins"
- path: /etc/kubernetes/kubeconfig
permissions: '0644'
@ -68,8 +54,7 @@ bootcmd:
runcmd:
- [systemctl, daemon-reload]
- [systemctl, restart, NetworkManager]
- [systemctl, enable, cloud-metadata.service]
- "atomic install --system --name=kubelet quay.io/poseidon/kubelet:v1.12.1"
- "atomic install --system --name=kubelet quay.io/poseidon/kubelet:v1.13.4"
- [systemctl, start, --no-block, kubelet.service]
users:
- default

View File

@ -30,7 +30,7 @@ variable "count" {
variable "instance_type" {
type = "string"
default = "t2.small"
default = "t3.small"
description = "EC2 instance type"
}
@ -46,6 +46,12 @@ variable "disk_type" {
description = "Type of the EBS volume (e.g. standard, gp2, io1)"
}
variable "disk_iops" {
type = "string"
default = "0"
description = "IOPS of the EBS volume (required for io1)"
}
variable "spot_price" {
type = "string"
default = ""

View File

@ -52,6 +52,7 @@ resource "aws_launch_configuration" "worker" {
root_block_device {
volume_type = "${var.disk_type}"
volume_size = "${var.disk_size}"
iops = "${var.disk_iops}"
}
# network
@ -69,9 +70,9 @@ data "template_file" "worker-cloudinit" {
template = "${file("${path.module}/cloudinit/worker.yaml.tmpl")}"
vars = {
kubeconfig = "${indent(6, var.kubeconfig)}"
ssh_authorized_key = "${var.ssh_authorized_key}"
k8s_dns_service_ip = "${cidrhost(var.service_cidr, 10)}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
kubeconfig = "${indent(6, var.kubeconfig)}"
ssh_authorized_key = "${var.ssh_authorized_key}"
cluster_dns_service_ip = "${cidrhost(var.service_cidr, 10)}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
}
}

View File

@ -11,9 +11,10 @@ Typhoon distributes upstream Kubernetes, architectural conventions, and cluster
## Features <a href="https://www.cncf.io/certification/software-conformance/"><img align="right" src="https://storage.googleapis.com/poseidon/certified-kubernetes.png"></a>
* Kubernetes v1.12.1 (upstream, via [kubernetes-incubator/bootkube](https://github.com/kubernetes-incubator/bootkube))
* Single or multi-master, workloads isolated on workers, [flannel](https://github.com/coreos/flannel) networking
* Kubernetes v1.13.4 (upstream, via [kubernetes-incubator/bootkube](https://github.com/kubernetes-incubator/bootkube))
* Single or multi-master, [flannel](https://github.com/coreos/flannel) networking
* On-cluster etcd with TLS, [RBAC](https://kubernetes.io/docs/admin/authorization/rbac/)-enabled
* Advanced features like [worker pools](https://typhoon.psdn.io/advanced/worker-pools/), [low-priority](https://typhoon.psdn.io/cl/azure/#low-priority) workers, and [snippets](https://typhoon.psdn.io/advanced/customization/#container-linux) customization
* Ready for Ingress, Prometheus, Grafana, and other optional [addons](https://typhoon.psdn.io/addons/overview/)
## Docs

View File

@ -1,6 +1,6 @@
# Self-hosted Kubernetes assets (kubeconfig, manifests)
module "bootkube" {
source = "git::https://github.com/poseidon/terraform-render-bootkube.git?ref=81f19507faabf411db9c760d55f3d03f7d78f4c9"
source = "git::https://github.com/poseidon/terraform-render-bootkube.git?ref=953521dbba49eb6a39204f30a3978730eac01e11"
cluster_name = "${var.cluster_name}"
api_servers = ["${format("%s.%s", var.cluster_name, var.dns_zone)}"]
@ -10,4 +10,5 @@ module "bootkube" {
pod_cidr = "${var.pod_cidr}"
service_cidr = "${var.service_cidr}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
enable_reporting = "${var.enable_reporting}"
}

View File

@ -7,7 +7,7 @@ systemd:
- name: 40-etcd-cluster.conf
contents: |
[Service]
Environment="ETCD_IMAGE_TAG=v3.3.10"
Environment="ETCD_IMAGE_TAG=v3.3.12"
Environment="ETCD_NAME=${etcd_name}"
Environment="ETCD_ADVERTISE_CLIENT_URLS=https://${etcd_domain}:2379"
Environment="ETCD_INITIAL_ADVERTISE_PEER_URLS=https://${etcd_domain}:2380"
@ -78,7 +78,7 @@ systemd:
--authentication-token-webhook \
--authorization-mode=Webhook \
--client-ca-file=/etc/kubernetes/ca.crt \
--cluster_dns=${k8s_dns_service_ip} \
--cluster_dns=${cluster_dns_service_ip} \
--cluster_domain=${cluster_domain_suffix} \
--cni-conf-dir=/etc/kubernetes/cni/net.d \
--exit-on-lock-contention \
@ -88,6 +88,7 @@ systemd:
--node-labels=node-role.kubernetes.io/master \
--node-labels=node-role.kubernetes.io/controller="true" \
--pod-manifest-path=/etc/kubernetes/manifests \
--read-only-port=0 \
--register-with-taints=node-role.kubernetes.io/master=:NoSchedule \
--volume-plugin-dir=/var/lib/kubelet/volumeplugins
ExecStop=-/usr/bin/rkt stop --uuid-file=/var/cache/kubelet-pod.uuid
@ -122,7 +123,7 @@ storage:
contents:
inline: |
KUBELET_IMAGE_URL=docker://k8s.gcr.io/hyperkube
KUBELET_IMAGE_TAG=v1.12.1
KUBELET_IMAGE_TAG=v1.13.4
- path: /etc/sysctl.d/max-user-watches.conf
filesystem: root
contents:
@ -142,17 +143,14 @@ storage:
set -e
# Move experimental manifests
[ -n "$(ls /opt/bootkube/assets/manifests-*/* 2>/dev/null)" ] && mv /opt/bootkube/assets/manifests-*/* /opt/bootkube/assets/manifests && rm -rf /opt/bootkube/assets/manifests-*
BOOTKUBE_ACI="$${BOOTKUBE_ACI:-quay.io/coreos/bootkube}"
BOOTKUBE_VERSION="$${BOOTKUBE_VERSION:-v0.13.0}"
BOOTKUBE_ASSETS="$${BOOTKUBE_ASSETS:-/opt/bootkube/assets}"
exec /usr/bin/rkt run \
--trust-keys-from-https \
--volume assets,kind=host,source=$${BOOTKUBE_ASSETS} \
--volume assets,kind=host,source=/opt/bootkube/assets \
--mount volume=assets,target=/assets \
--volume bootstrap,kind=host,source=/etc/kubernetes \
--mount volume=bootstrap,target=/etc/kubernetes \
$${RKT_OPTS} \
$${BOOTKUBE_ACI}:$${BOOTKUBE_VERSION} \
quay.io/coreos/bootkube:v0.14.0 \
--net=host \
--dns=host \
--exec=/bootkube -- start --asset-dir=/assets "$@"

View File

@ -85,6 +85,7 @@ resource "azurerm_virtual_machine" "controllers" {
lifecycle {
ignore_changes = [
"storage_os_disk",
"os_profile",
]
}
}
@ -105,21 +106,25 @@ resource "azurerm_network_interface" "controllers" {
# public IPv4
public_ip_address_id = "${element(azurerm_public_ip.controllers.*.id, count.index)}"
# backend address pool to which the NIC should be added
load_balancer_backend_address_pools_ids = ["${azurerm_lb_backend_address_pool.controller.id}"]
}
}
# Add controller NICs to the controller backend address pool
resource "azurerm_network_interface_backend_address_pool_association" "controllers" {
network_interface_id = "${azurerm_network_interface.controllers.id}"
ip_configuration_name = "ip0"
backend_address_pool_id = "${azurerm_lb_backend_address_pool.controller.id}"
}
# Controller public IPv4 addresses
resource "azurerm_public_ip" "controllers" {
count = "${var.controller_count}"
resource_group_name = "${azurerm_resource_group.cluster.name}"
name = "${var.cluster_name}-controller-${count.index}"
location = "${azurerm_resource_group.cluster.location}"
sku = "Standard"
public_ip_address_allocation = "static"
name = "${var.cluster_name}-controller-${count.index}"
location = "${azurerm_resource_group.cluster.location}"
sku = "Standard"
allocation_method = "Static"
}
# Controller Ignition configs
@ -144,10 +149,10 @@ data "template_file" "controller-configs" {
# etcd0=https://cluster-etcd0.example.com,etcd1=https://cluster-etcd1.example.com,...
etcd_initial_cluster = "${join(",", data.template_file.etcds.*.rendered)}"
kubeconfig = "${indent(10, module.bootkube.kubeconfig)}"
ssh_authorized_key = "${var.ssh_authorized_key}"
k8s_dns_service_ip = "${cidrhost(var.service_cidr, 10)}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
kubeconfig = "${indent(10, module.bootkube.kubeconfig-kubelet)}"
ssh_authorized_key = "${var.ssh_authorized_key}"
cluster_dns_service_ip = "${cidrhost(var.service_cidr, 10)}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
}
}

View File

@ -17,20 +17,20 @@ resource "azurerm_dns_a_record" "apiserver" {
resource "azurerm_public_ip" "apiserver-ipv4" {
resource_group_name = "${azurerm_resource_group.cluster.name}"
name = "${var.cluster_name}-apiserver-ipv4"
location = "${var.region}"
sku = "Standard"
public_ip_address_allocation = "static"
name = "${var.cluster_name}-apiserver-ipv4"
location = "${var.region}"
sku = "Standard"
allocation_method = "Static"
}
# Static IPv4 address for the ingress frontend
resource "azurerm_public_ip" "ingress-ipv4" {
resource_group_name = "${azurerm_resource_group.cluster.name}"
name = "${var.cluster_name}-ingress-ipv4"
location = "${var.region}"
sku = "Standard"
public_ip_address_allocation = "static"
name = "${var.cluster_name}-ingress-ipv4"
location = "${var.region}"
sku = "Standard"
allocation_method = "Static"
}
# Network Load Balancer for apiservers and ingress

View File

@ -1,3 +1,7 @@
output "kubeconfig-admin" {
value = "${module.bootkube.kubeconfig-admin}"
}
# Outputs for Kubernetes Ingress
output "ingress_static_ipv4" {
@ -28,5 +32,5 @@ output "backend_address_pool_id" {
}
output "kubeconfig" {
value = "${module.bootkube.kubeconfig}"
value = "${module.bootkube.kubeconfig-kubelet}"
}

View File

@ -5,7 +5,7 @@ terraform {
}
provider "azurerm" {
version = "~> 1.16"
version = "~> 1.21"
}
provider "local" {
@ -23,4 +23,3 @@ provider "template" {
provider "tls" {
version = "~> 1.0"
}

View File

@ -117,22 +117,6 @@ resource "azurerm_network_security_rule" "controller-kubelet" {
destination_address_prefix = "${azurerm_subnet.controller.address_prefix}"
}
# Allow heapster / metrics-server to scrape kubelet read-only
resource "azurerm_network_security_rule" "controller-kubelet-read" {
resource_group_name = "${azurerm_resource_group.cluster.name}"
name = "allow-kubelet-read"
network_security_group_name = "${azurerm_network_security_group.controller.name}"
priority = "2035"
access = "Allow"
direction = "Inbound"
protocol = "Tcp"
source_port_range = "*"
destination_port_range = "10255"
source_address_prefix = "${azurerm_subnet.worker.address_prefix}"
destination_address_prefix = "${azurerm_subnet.controller.address_prefix}"
}
# Override Azure AllowVNetInBound and AllowAzureLoadBalancerInBound
# https://docs.microsoft.com/en-us/azure/virtual-network/security-overview#default-security-rules
@ -269,22 +253,6 @@ resource "azurerm_network_security_rule" "worker-kubelet" {
destination_address_prefix = "${azurerm_subnet.worker.address_prefix}"
}
# Allow heapster / metrics-server to scrape kubelet read-only
resource "azurerm_network_security_rule" "worker-kubelet-read" {
resource_group_name = "${azurerm_resource_group.cluster.name}"
name = "allow-kubelet-read"
network_security_group_name = "${azurerm_network_security_group.worker.name}"
priority = "2030"
access = "Allow"
direction = "Inbound"
protocol = "Tcp"
source_port_range = "*"
destination_port_range = "10255"
source_address_prefix = "${azurerm_subnet.worker.address_prefix}"
destination_address_prefix = "${azurerm_subnet.worker.address_prefix}"
}
# Override Azure AllowVNetInBound and AllowAzureLoadBalancerInBound
# https://docs.microsoft.com/en-us/azure/virtual-network/security-overview#default-security-rules

View File

@ -115,3 +115,9 @@ variable "cluster_domain_suffix" {
type = "string"
default = "cluster.local"
}
variable "enable_reporting" {
type = "string"
description = "Enable usage or analytics reporting to upstreams (Calico)"
default = "false"
}

View File

@ -15,7 +15,7 @@ module "workers" {
priority = "${var.worker_priority}"
# configuration
kubeconfig = "${module.bootkube.kubeconfig}"
kubeconfig = "${module.bootkube.kubeconfig-kubelet}"
ssh_authorized_key = "${var.ssh_authorized_key}"
service_cidr = "${var.service_cidr}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"

View File

@ -51,7 +51,7 @@ systemd:
--authentication-token-webhook \
--authorization-mode=Webhook \
--client-ca-file=/etc/kubernetes/ca.crt \
--cluster_dns=${k8s_dns_service_ip} \
--cluster_dns=${cluster_dns_service_ip} \
--cluster_domain=${cluster_domain_suffix} \
--cni-conf-dir=/etc/kubernetes/cni/net.d \
--exit-on-lock-contention \
@ -60,6 +60,7 @@ systemd:
--network-plugin=cni \
--node-labels=node-role.kubernetes.io/node \
--pod-manifest-path=/etc/kubernetes/manifests \
--read-only-port=0 \
--volume-plugin-dir=/var/lib/kubelet/volumeplugins
ExecStop=-/usr/bin/rkt stop --uuid-file=/var/cache/kubelet-pod.uuid
Restart=always
@ -92,7 +93,7 @@ storage:
contents:
inline: |
KUBELET_IMAGE_URL=docker://k8s.gcr.io/hyperkube
KUBELET_IMAGE_TAG=v1.12.1
KUBELET_IMAGE_TAG=v1.13.4
- path: /etc/sysctl.d/max-user-watches.conf
filesystem: root
contents:
@ -110,7 +111,7 @@ storage:
--volume config,kind=host,source=/etc/kubernetes \
--mount volume=config,target=/etc/kubernetes \
--insecure-options=image \
docker://k8s.gcr.io/hyperkube:v1.12.1 \
docker://k8s.gcr.io/hyperkube:v1.13.4 \
--net=host \
--dns=host \
--exec=/kubectl -- --kubeconfig=/etc/kubernetes/kubeconfig delete node $(hostname | tr '[:upper:]' '[:lower:]')

View File

@ -37,7 +37,7 @@ resource "azurerm_virtual_machine_scale_set" "workers" {
os_profile {
computer_name_prefix = "${var.name}-worker-"
admin_username = "core"
custom_data = "${element(data.ct_config.worker-ignitions.*.rendered, count.index)}"
custom_data = "${data.ct_config.worker-ignition.rendered}"
}
# Azure mandates setting an ssh_key, even though Ignition custom_data handles it too
@ -58,6 +58,7 @@ resource "azurerm_virtual_machine_scale_set" "workers" {
ip_configuration {
name = "ip0"
primary = true
subnet_id = "${var.subnet_id}"
# backend address pool to which the NIC should be added
@ -66,8 +67,9 @@ resource "azurerm_virtual_machine_scale_set" "workers" {
}
# lifecycle
priority = "${var.priority}"
upgrade_policy_mode = "Manual"
priority = "${var.priority}"
eviction_policy = "Delete"
}
# Scale up or down to maintain desired number, tolerating deallocations.
@ -93,20 +95,20 @@ resource "azurerm_autoscale_setting" "workers" {
}
# Worker Ignition configs
data "ct_config" "worker-ignitions" {
content = "${data.template_file.worker-configs.rendered}"
data "ct_config" "worker-ignition" {
content = "${data.template_file.worker-config.rendered}"
pretty_print = false
snippets = ["${var.clc_snippets}"]
}
# Worker Container Linux configs
data "template_file" "worker-configs" {
data "template_file" "worker-config" {
template = "${file("${path.module}/cl/worker.yaml.tmpl")}"
vars = {
kubeconfig = "${indent(10, var.kubeconfig)}"
ssh_authorized_key = "${var.ssh_authorized_key}"
k8s_dns_service_ip = "${cidrhost(var.service_cidr, 10)}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
kubeconfig = "${indent(10, var.kubeconfig)}"
ssh_authorized_key = "${var.ssh_authorized_key}"
cluster_dns_service_ip = "${cidrhost(var.service_cidr, 10)}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
}
}

View File

@ -11,9 +11,10 @@ Typhoon distributes upstream Kubernetes, architectural conventions, and cluster
## Features <a href="https://www.cncf.io/certification/software-conformance/"><img align="right" src="https://storage.googleapis.com/poseidon/certified-kubernetes.png"></a>
* Kubernetes v1.12.1 (upstream, via [kubernetes-incubator/bootkube](https://github.com/kubernetes-incubator/bootkube))
* Single or multi-master, workloads isolated on workers, [Calico](https://www.projectcalico.org/) or [flannel](https://github.com/coreos/flannel) networking
* Kubernetes v1.13.4 (upstream, via [kubernetes-incubator/bootkube](https://github.com/kubernetes-incubator/bootkube))
* Single or multi-master, [Calico](https://www.projectcalico.org/) or [flannel](https://github.com/coreos/flannel) networking
* On-cluster etcd with TLS, [RBAC](https://kubernetes.io/docs/admin/authorization/rbac/)-enabled, [network policy](https://kubernetes.io/docs/concepts/services-networking/network-policies/)
* Advanced features like [snippets](https://typhoon.psdn.io/advanced/customization/#container-linux) customization
* Ready for Ingress, Prometheus, Grafana, and other optional [addons](https://typhoon.psdn.io/addons/overview/)
## Docs

View File

@ -1,6 +1,6 @@
# Self-hosted Kubernetes assets (kubeconfig, manifests)
module "bootkube" {
source = "git::https://github.com/poseidon/terraform-render-bootkube.git?ref=81f19507faabf411db9c760d55f3d03f7d78f4c9"
source = "git::https://github.com/poseidon/terraform-render-bootkube.git?ref=953521dbba49eb6a39204f30a3978730eac01e11"
cluster_name = "${var.cluster_name}"
api_servers = ["${var.k8s_domain_name}"]
@ -12,4 +12,5 @@ module "bootkube" {
pod_cidr = "${var.pod_cidr}"
service_cidr = "${var.service_cidr}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
enable_reporting = "${var.enable_reporting}"
}

View File

@ -7,7 +7,7 @@ systemd:
- name: 40-etcd-cluster.conf
contents: |
[Service]
Environment="ETCD_IMAGE_TAG=v3.3.10"
Environment="ETCD_IMAGE_TAG=v3.3.12"
Environment="ETCD_NAME=${etcd_name}"
Environment="ETCD_ADVERTISE_CLIENT_URLS=https://${domain_name}:2379"
Environment="ETCD_INITIAL_ADVERTISE_PEER_URLS=https://${domain_name}:2380"
@ -70,6 +70,10 @@ systemd:
--mount volume=opt-cni-bin,target=/opt/cni/bin \
--volume var-log,kind=host,source=/var/log \
--mount volume=var-log,target=/var/log \
--volume iscsiconf,kind=host,source=/etc/iscsi/ \
--mount volume=iscsiconf,target=/etc/iscsi/ \
--volume iscsiadm,kind=host,source=/usr/sbin/iscsiadm \
--mount volume=iscsiadm,target=/sbin/iscsiadm \
--insecure-options=image"
ExecStartPre=/bin/mkdir -p /opt/cni/bin
ExecStartPre=/bin/mkdir -p /etc/kubernetes/manifests
@ -86,7 +90,7 @@ systemd:
--authentication-token-webhook \
--authorization-mode=Webhook \
--client-ca-file=/etc/kubernetes/ca.crt \
--cluster_dns=${k8s_dns_service_ip} \
--cluster_dns=${cluster_dns_service_ip} \
--cluster_domain=${cluster_domain_suffix} \
--cni-conf-dir=/etc/kubernetes/cni/net.d \
--exit-on-lock-contention \
@ -97,6 +101,7 @@ systemd:
--node-labels=node-role.kubernetes.io/master \
--node-labels=node-role.kubernetes.io/controller="true" \
--pod-manifest-path=/etc/kubernetes/manifests \
--read-only-port=0 \
--register-with-taints=node-role.kubernetes.io/master=:NoSchedule \
--volume-plugin-dir=/var/lib/kubelet/volumeplugins
ExecStop=-/usr/bin/rkt stop --uuid-file=/var/cache/kubelet-pod.uuid
@ -123,7 +128,7 @@ storage:
contents:
inline: |
KUBELET_IMAGE_URL=docker://k8s.gcr.io/hyperkube
KUBELET_IMAGE_TAG=v1.12.1
KUBELET_IMAGE_TAG=v1.13.4
- path: /etc/hostname
filesystem: root
mode: 0644
@ -149,17 +154,14 @@ storage:
set -e
# Move experimental manifests
[ -n "$(ls /opt/bootkube/assets/manifests-*/* 2>/dev/null)" ] && mv /opt/bootkube/assets/manifests-*/* /opt/bootkube/assets/manifests && rm -rf /opt/bootkube/assets/manifests-*
BOOTKUBE_ACI="$${BOOTKUBE_ACI:-quay.io/coreos/bootkube}"
BOOTKUBE_VERSION="$${BOOTKUBE_VERSION:-v0.13.0}"
BOOTKUBE_ASSETS="$${BOOTKUBE_ASSETS:-/opt/bootkube/assets}"
exec /usr/bin/rkt run \
--trust-keys-from-https \
--volume assets,kind=host,source=$BOOTKUBE_ASSETS \
--volume assets,kind=host,source=/opt/bootkube/assets \
--mount volume=assets,target=/assets \
--volume bootstrap,kind=host,source=/etc/kubernetes \
--mount volume=bootstrap,target=/etc/kubernetes \
$$RKT_OPTS \
$${BOOTKUBE_ACI}:$${BOOTKUBE_VERSION} \
quay.io/coreos/bootkube:v0.14.0 \
--net=host \
--dns=host \
--exec=/bootkube -- start --asset-dir=/assets "$@"

View File

@ -45,6 +45,10 @@ systemd:
--mount volume=opt-cni-bin,target=/opt/cni/bin \
--volume var-log,kind=host,source=/var/log \
--mount volume=var-log,target=/var/log \
--volume iscsiconf,kind=host,source=/etc/iscsi/ \
--mount volume=iscsiconf,target=/etc/iscsi/ \
--volume iscsiadm,kind=host,source=/usr/sbin/iscsiadm \
--mount volume=iscsiadm,target=/sbin/iscsiadm \
--insecure-options=image"
ExecStartPre=/bin/mkdir -p /opt/cni/bin
ExecStartPre=/bin/mkdir -p /etc/kubernetes/manifests
@ -59,7 +63,7 @@ systemd:
--authentication-token-webhook \
--authorization-mode=Webhook \
--client-ca-file=/etc/kubernetes/ca.crt \
--cluster_dns=${k8s_dns_service_ip} \
--cluster_dns=${cluster_dns_service_ip} \
--cluster_domain=${cluster_domain_suffix} \
--cni-conf-dir=/etc/kubernetes/cni/net.d \
--exit-on-lock-contention \
@ -69,6 +73,7 @@ systemd:
--network-plugin=cni \
--node-labels=node-role.kubernetes.io/node \
--pod-manifest-path=/etc/kubernetes/manifests \
--read-only-port=0 \
--volume-plugin-dir=/var/lib/kubelet/volumeplugins
ExecStop=-/usr/bin/rkt stop --uuid-file=/var/cache/kubelet-pod.uuid
Restart=always
@ -84,7 +89,7 @@ storage:
contents:
inline: |
KUBELET_IMAGE_URL=docker://k8s.gcr.io/hyperkube
KUBELET_IMAGE_TAG=v1.12.1
KUBELET_IMAGE_TAG=v1.13.4
- path: /etc/hostname
filesystem: root
mode: 0644

View File

@ -1,3 +1,3 @@
output "kubeconfig" {
value = "${module.bootkube.kubeconfig}"
output "kubeconfig-admin" {
value = "${module.bootkube.kubeconfig-admin}"
}

View File

@ -160,12 +160,12 @@ data "template_file" "controller-configs" {
template = "${file("${path.module}/cl/controller.yaml.tmpl")}"
vars {
domain_name = "${element(var.controller_domains, count.index)}"
etcd_name = "${element(var.controller_names, count.index)}"
etcd_initial_cluster = "${join(",", formatlist("%s=https://%s:2380", var.controller_names, var.controller_domains))}"
k8s_dns_service_ip = "${module.bootkube.kube_dns_service_ip}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
ssh_authorized_key = "${var.ssh_authorized_key}"
domain_name = "${element(var.controller_domains, count.index)}"
etcd_name = "${element(var.controller_names, count.index)}"
etcd_initial_cluster = "${join(",", formatlist("%s=https://%s:2380", var.controller_names, var.controller_domains))}"
cluster_dns_service_ip = "${module.bootkube.cluster_dns_service_ip}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
ssh_authorized_key = "${var.ssh_authorized_key}"
}
}
@ -191,10 +191,10 @@ data "template_file" "worker-configs" {
template = "${file("${path.module}/cl/worker.yaml.tmpl")}"
vars {
domain_name = "${element(var.worker_domains, count.index)}"
k8s_dns_service_ip = "${module.bootkube.kube_dns_service_ip}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
ssh_authorized_key = "${var.ssh_authorized_key}"
domain_name = "${element(var.worker_domains, count.index)}"
cluster_dns_service_ip = "${module.bootkube.cluster_dns_service_ip}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
ssh_authorized_key = "${var.ssh_authorized_key}"
}
}

View File

@ -18,7 +18,7 @@ resource "null_resource" "copy-controller-secrets" {
}
provisioner "file" {
content = "${module.bootkube.kubeconfig}"
content = "${module.bootkube.kubeconfig-kubelet}"
destination = "$HOME/kubeconfig"
}
@ -94,7 +94,7 @@ resource "null_resource" "copy-worker-secrets" {
}
provisioner "file" {
content = "${module.bootkube.kubeconfig}"
content = "${module.bootkube.kubeconfig-kubelet}"
destination = "$HOME/kubeconfig"
}

View File

@ -141,3 +141,9 @@ variable "kernel_args" {
type = "list"
default = []
}
variable "enable_reporting" {
type = "string"
description = "Enable usage or analytics reporting to upstreams (Calico)"
default = "false"
}

View File

@ -11,8 +11,8 @@ Typhoon distributes upstream Kubernetes, architectural conventions, and cluster
## Features <a href="https://www.cncf.io/certification/software-conformance/"><img align="right" src="https://storage.googleapis.com/poseidon/certified-kubernetes.png"></a>
* Kubernetes v1.12.1 (upstream, via [kubernetes-incubator/bootkube](https://github.com/kubernetes-incubator/bootkube))
* Single or multi-master, workloads isolated on workers, [Calico](https://www.projectcalico.org/) or [flannel](https://github.com/coreos/flannel) networking
* Kubernetes v1.13.4 (upstream, via [kubernetes-incubator/bootkube](https://github.com/kubernetes-incubator/bootkube))
* Single or multi-master, [Calico](https://www.projectcalico.org/) or [flannel](https://github.com/coreos/flannel) networking
* On-cluster etcd with TLS, [RBAC](https://kubernetes.io/docs/admin/authorization/rbac/)-enabled, [network policy](https://kubernetes.io/docs/concepts/services-networking/network-policies/)
* Ready for Ingress, Prometheus, Grafana, and other optional [addons](https://typhoon.psdn.io/addons/overview/)

View File

@ -1,6 +1,6 @@
# Self-hosted Kubernetes assets (kubeconfig, manifests)
module "bootkube" {
source = "git::https://github.com/poseidon/terraform-render-bootkube.git?ref=81f19507faabf411db9c760d55f3d03f7d78f4c9"
source = "git::https://github.com/poseidon/terraform-render-bootkube.git?ref=953521dbba49eb6a39204f30a3978730eac01e11"
cluster_name = "${var.cluster_name}"
api_servers = ["${var.k8s_domain_name}"]
@ -11,6 +11,7 @@ module "bootkube" {
pod_cidr = "${var.pod_cidr}"
service_cidr = "${var.service_cidr}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
enable_reporting = "${var.enable_reporting}"
# Fedora
trusted_certs_dir = "/etc/pki/tls/certs"

View File

@ -40,7 +40,7 @@ write_files:
--authentication-token-webhook \
--authorization-mode=Webhook \
--client-ca-file=/etc/kubernetes/ca.crt \
--cluster_dns=${k8s_dns_service_ip} \
--cluster_dns=${cluster_dns_service_ip} \
--cluster_domain=${cluster_domain_suffix} \
--cni-conf-dir=/etc/kubernetes/cni/net.d \
--exit-on-lock-contention \
@ -51,6 +51,7 @@ write_files:
--node-labels=node-role.kubernetes.io/master \
--node-labels=node-role.kubernetes.io/controller="true" \
--pod-manifest-path=/etc/kubernetes/manifests \
--read-only-port=0 \
--register-with-taints=node-role.kubernetes.io/master=:NoSchedule \
--volume-plugin-dir=/var/lib/kubelet/volumeplugins"
- path: /etc/systemd/system/kubelet.path
@ -83,9 +84,9 @@ runcmd:
- [systemctl, daemon-reload]
- [systemctl, restart, NetworkManager]
- [hostnamectl, set-hostname, ${domain_name}]
- "atomic install --system --name=etcd quay.io/poseidon/etcd:v3.3.10"
- "atomic install --system --name=kubelet quay.io/poseidon/kubelet:v1.12.1"
- "atomic install --system --name=bootkube quay.io/poseidon/bootkube:v0.13.0"
- "atomic install --system --name=etcd quay.io/poseidon/etcd:v3.3.12"
- "atomic install --system --name=kubelet quay.io/poseidon/kubelet:v1.13.4"
- "atomic install --system --name=bootkube quay.io/poseidon/bootkube:v0.14.0"
- [systemctl, start, --no-block, etcd.service]
- [systemctl, enable, kubelet.path]
- [systemctl, start, --no-block, kubelet.path]

View File

@ -19,7 +19,7 @@ write_files:
--authentication-token-webhook \
--authorization-mode=Webhook \
--client-ca-file=/etc/kubernetes/ca.crt \
--cluster_dns=${k8s_dns_service_ip} \
--cluster_dns=${cluster_dns_service_ip} \
--cluster_domain=${cluster_domain_suffix} \
--cni-conf-dir=/etc/kubernetes/cni/net.d \
--exit-on-lock-contention \
@ -29,6 +29,7 @@ write_files:
--network-plugin=cni \
--node-labels=node-role.kubernetes.io/node \
--pod-manifest-path=/etc/kubernetes/manifests \
--read-only-port=0 \
--volume-plugin-dir=/var/lib/kubelet/volumeplugins"
- path: /etc/systemd/system/kubelet.path
content: |
@ -59,7 +60,7 @@ runcmd:
- [systemctl, daemon-reload]
- [systemctl, restart, NetworkManager]
- [hostnamectl, set-hostname, ${domain_name}]
- "atomic install --system --name=kubelet quay.io/poseidon/kubelet:v1.12.1"
- "atomic install --system --name=kubelet quay.io/poseidon/kubelet:v1.13.4"
- [systemctl, enable, kubelet.path]
- [systemctl, start, --no-block, kubelet.path]
users:

View File

@ -1,3 +1,3 @@
output "kubeconfig" {
value = "${module.bootkube.kubeconfig}"
output "kubeconfig-admin" {
value = "${module.bootkube.kubeconfig-admin}"
}

View File

@ -55,12 +55,12 @@ data "template_file" "controller-configs" {
template = "${file("${path.module}/cloudinit/controller.yaml.tmpl")}"
vars {
domain_name = "${element(var.controller_domains, count.index)}"
etcd_name = "${element(var.controller_names, count.index)}"
etcd_initial_cluster = "${join(",", formatlist("%s=https://%s:2380", var.controller_names, var.controller_domains))}"
k8s_dns_service_ip = "${module.bootkube.kube_dns_service_ip}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
ssh_authorized_key = "${var.ssh_authorized_key}"
domain_name = "${element(var.controller_domains, count.index)}"
etcd_name = "${element(var.controller_names, count.index)}"
etcd_initial_cluster = "${join(",", formatlist("%s=https://%s:2380", var.controller_names, var.controller_domains))}"
cluster_dns_service_ip = "${module.bootkube.cluster_dns_service_ip}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
ssh_authorized_key = "${var.ssh_authorized_key}"
}
}
@ -79,9 +79,9 @@ data "template_file" "worker-configs" {
template = "${file("${path.module}/cloudinit/worker.yaml.tmpl")}"
vars {
domain_name = "${element(var.worker_domains, count.index)}"
k8s_dns_service_ip = "${module.bootkube.kube_dns_service_ip}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
ssh_authorized_key = "${var.ssh_authorized_key}"
domain_name = "${element(var.worker_domains, count.index)}"
cluster_dns_service_ip = "${module.bootkube.cluster_dns_service_ip}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
ssh_authorized_key = "${var.ssh_authorized_key}"
}
}

View File

@ -18,7 +18,7 @@ resource "null_resource" "copy-controller-secrets" {
}
provisioner "file" {
content = "${module.bootkube.kubeconfig}"
content = "${module.bootkube.kubeconfig-kubelet}"
destination = "$HOME/kubeconfig"
}
@ -92,7 +92,7 @@ resource "null_resource" "copy-worker-secrets" {
}
provisioner "file" {
content = "${module.bootkube.kubeconfig}"
content = "${module.bootkube.kubeconfig-kubelet}"
destination = "$HOME/kubeconfig"
}

View File

@ -110,3 +110,9 @@ variable "kernel_args" {
type = "list"
default = []
}
variable "enable_reporting" {
type = "string"
description = "Enable usage or analytics reporting to upstreams (Calico)"
default = "false"
}

View File

@ -11,10 +11,11 @@ Typhoon distributes upstream Kubernetes, architectural conventions, and cluster
## Features <a href="https://www.cncf.io/certification/software-conformance/"><img align="right" src="https://storage.googleapis.com/poseidon/certified-kubernetes.png"></a>
* Kubernetes v1.12.1 (upstream, via [kubernetes-incubator/bootkube](https://github.com/kubernetes-incubator/bootkube))
* Single or multi-master, workloads isolated on workers, [flannel](https://github.com/coreos/flannel) networking
* Kubernetes v1.13.4 (upstream, via [kubernetes-incubator/bootkube](https://github.com/kubernetes-incubator/bootkube))
* Single or multi-master, [flannel](https://github.com/coreos/flannel) networking
* On-cluster etcd with TLS, [RBAC](https://kubernetes.io/docs/admin/authorization/rbac/)-enabled
* Ready for Ingress, Prometheus, Grafana, and other optional [addons](https://typhoon.psdn.io/addons/overview/)
* Advanced features like [snippets](https://typhoon.psdn.io/advanced/customization/#container-linux) customization
* Ready for Ingress, Prometheus, Grafana, CSI, and other [addons](https://typhoon.psdn.io/addons/overview/)
## Docs

View File

@ -1,6 +1,6 @@
# Self-hosted Kubernetes assets (kubeconfig, manifests)
module "bootkube" {
source = "git::https://github.com/poseidon/terraform-render-bootkube.git?ref=81f19507faabf411db9c760d55f3d03f7d78f4c9"
source = "git::https://github.com/poseidon/terraform-render-bootkube.git?ref=953521dbba49eb6a39204f30a3978730eac01e11"
cluster_name = "${var.cluster_name}"
api_servers = ["${format("%s.%s", var.cluster_name, var.dns_zone)}"]
@ -11,4 +11,5 @@ module "bootkube" {
pod_cidr = "${var.pod_cidr}"
service_cidr = "${var.service_cidr}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
enable_reporting = "${var.enable_reporting}"
}

View File

@ -7,7 +7,7 @@ systemd:
- name: 40-etcd-cluster.conf
contents: |
[Service]
Environment="ETCD_IMAGE_TAG=v3.3.10"
Environment="ETCD_IMAGE_TAG=v3.3.12"
Environment="ETCD_NAME=${etcd_name}"
Environment="ETCD_ADVERTISE_CLIENT_URLS=https://${etcd_domain}:2379"
Environment="ETCD_INITIAL_ADVERTISE_PEER_URLS=https://${etcd_domain}:2380"
@ -56,12 +56,9 @@ systemd:
contents: |
[Unit]
Description=Kubelet via Hyperkube
Requires=coreos-metadata.service
After=coreos-metadata.service
Wants=rpc-statd.service
[Service]
EnvironmentFile=/etc/kubernetes/kubelet.env
EnvironmentFile=/run/metadata/coreos
Environment="RKT_RUN_ARGS=--uuid-file-save=/var/cache/kubelet-pod.uuid \
--volume=resolv,kind=host,source=/etc/resolv.conf \
--mount volume=resolv,target=/etc/resolv.conf \
@ -89,17 +86,17 @@ systemd:
--authentication-token-webhook \
--authorization-mode=Webhook \
--client-ca-file=/etc/kubernetes/ca.crt \
--cluster_dns=${k8s_dns_service_ip} \
--cluster_dns=${cluster_dns_service_ip} \
--cluster_domain=${cluster_domain_suffix} \
--cni-conf-dir=/etc/kubernetes/cni/net.d \
--exit-on-lock-contention \
--hostname-override=$${COREOS_DIGITALOCEAN_IPV4_PRIVATE_0} \
--kubeconfig=/etc/kubernetes/kubeconfig \
--lock-file=/var/run/lock/kubelet.lock \
--network-plugin=cni \
--node-labels=node-role.kubernetes.io/master \
--node-labels=node-role.kubernetes.io/controller="true" \
--pod-manifest-path=/etc/kubernetes/manifests \
--read-only-port=0 \
--register-with-taints=node-role.kubernetes.io/master=:NoSchedule \
--volume-plugin-dir=/var/lib/kubelet/volumeplugins
ExecStop=-/usr/bin/rkt stop --uuid-file=/var/cache/kubelet-pod.uuid
@ -128,7 +125,7 @@ storage:
contents:
inline: |
KUBELET_IMAGE_URL=docker://k8s.gcr.io/hyperkube
KUBELET_IMAGE_TAG=v1.12.1
KUBELET_IMAGE_TAG=v1.13.4
- path: /etc/sysctl.d/max-user-watches.conf
filesystem: root
contents:
@ -148,17 +145,14 @@ storage:
set -e
# Move experimental manifests
[ -n "$(ls /opt/bootkube/assets/manifests-*/* 2>/dev/null)" ] && mv /opt/bootkube/assets/manifests-*/* /opt/bootkube/assets/manifests && rm -rf /opt/bootkube/assets/manifests-*
BOOTKUBE_ACI="$${BOOTKUBE_ACI:-quay.io/coreos/bootkube}"
BOOTKUBE_VERSION="$${BOOTKUBE_VERSION:-v0.13.0}"
BOOTKUBE_ASSETS="$${BOOTKUBE_ASSETS:-/opt/bootkube/assets}"
exec /usr/bin/rkt run \
--trust-keys-from-https \
--volume assets,kind=host,source=$${BOOTKUBE_ASSETS} \
--volume assets,kind=host,source=/opt/bootkube/assets \
--mount volume=assets,target=/assets \
--volume bootstrap,kind=host,source=/etc/kubernetes \
--mount volume=bootstrap,target=/etc/kubernetes \
$${RKT_OPTS} \
$${BOOTKUBE_ACI}:$${BOOTKUBE_VERSION} \
quay.io/coreos/bootkube:v0.14.0 \
--net=host \
--dns=host \
--exec=/bootkube -- start --asset-dir=/assets "$@"

View File

@ -31,12 +31,9 @@ systemd:
contents: |
[Unit]
Description=Kubelet via Hyperkube
Requires=coreos-metadata.service
After=coreos-metadata.service
Wants=rpc-statd.service
[Service]
EnvironmentFile=/etc/kubernetes/kubelet.env
EnvironmentFile=/run/metadata/coreos
Environment="RKT_RUN_ARGS=--uuid-file-save=/var/cache/kubelet-pod.uuid \
--volume=resolv,kind=host,source=/etc/resolv.conf \
--mount volume=resolv,target=/etc/resolv.conf \
@ -62,16 +59,16 @@ systemd:
--authentication-token-webhook \
--authorization-mode=Webhook \
--client-ca-file=/etc/kubernetes/ca.crt \
--cluster_dns=${k8s_dns_service_ip} \
--cluster_dns=${cluster_dns_service_ip} \
--cluster_domain=${cluster_domain_suffix} \
--cni-conf-dir=/etc/kubernetes/cni/net.d \
--exit-on-lock-contention \
--hostname-override=$${COREOS_DIGITALOCEAN_IPV4_PRIVATE_0} \
--kubeconfig=/etc/kubernetes/kubeconfig \
--lock-file=/var/run/lock/kubelet.lock \
--network-plugin=cni \
--node-labels=node-role.kubernetes.io/node \
--pod-manifest-path=/etc/kubernetes/manifests \
--read-only-port=0 \
--volume-plugin-dir=/var/lib/kubelet/volumeplugins
ExecStop=-/usr/bin/rkt stop --uuid-file=/var/cache/kubelet-pod.uuid
Restart=always
@ -98,7 +95,7 @@ storage:
contents:
inline: |
KUBELET_IMAGE_URL=docker://k8s.gcr.io/hyperkube
KUBELET_IMAGE_TAG=v1.12.1
KUBELET_IMAGE_TAG=v1.13.4
- path: /etc/sysctl.d/max-user-watches.conf
filesystem: root
contents:
@ -116,7 +113,7 @@ storage:
--volume config,kind=host,source=/etc/kubernetes \
--mount volume=config,target=/etc/kubernetes \
--insecure-options=image \
docker://k8s.gcr.io/hyperkube:v1.12.1 \
docker://k8s.gcr.io/hyperkube:v1.13.4 \
--net=host \
--dns=host \
--exec=/kubectl -- --kubeconfig=/etc/kubernetes/kubeconfig delete node $(hostname)

View File

@ -44,12 +44,18 @@ resource "digitalocean_droplet" "controllers" {
ipv6 = true
private_networking = true
user_data = "${element(data.ct_config.controller_ign.*.rendered, count.index)}"
user_data = "${element(data.ct_config.controller-ignitions.*.rendered, count.index)}"
ssh_keys = ["${var.ssh_fingerprints}"]
tags = [
"${digitalocean_tag.controllers.id}",
]
lifecycle {
ignore_changes = [
"user_data",
]
}
}
# Tag to label controllers
@ -57,8 +63,16 @@ resource "digitalocean_tag" "controllers" {
name = "${var.cluster_name}-controller"
}
# Controller Container Linux Config
data "template_file" "controller_config" {
# Controller Ignition configs
data "ct_config" "controller-ignitions" {
count = "${var.controller_count}"
content = "${element(data.template_file.controller-configs.*.rendered, count.index)}"
pretty_print = false
snippets = ["${var.controller_clc_snippets}"]
}
# Controller Container Linux configs
data "template_file" "controller-configs" {
count = "${var.controller_count}"
template = "${file("${path.module}/cl/controller.yaml.tmpl")}"
@ -69,9 +83,9 @@ data "template_file" "controller_config" {
etcd_domain = "${var.cluster_name}-etcd${count.index}.${var.dns_zone}"
# etcd0=https://cluster-etcd0.example.com,etcd1=https://cluster-etcd1.example.com,...
etcd_initial_cluster = "${join(",", data.template_file.etcds.*.rendered)}"
k8s_dns_service_ip = "${cidrhost(var.service_cidr, 10)}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
etcd_initial_cluster = "${join(",", data.template_file.etcds.*.rendered)}"
cluster_dns_service_ip = "${cidrhost(var.service_cidr, 10)}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
}
}
@ -85,11 +99,3 @@ data "template_file" "etcds" {
dns_zone = "${var.dns_zone}"
}
}
data "ct_config" "controller_ign" {
count = "${var.controller_count}"
content = "${element(data.template_file.controller_config.*.rendered, count.index)}"
pretty_print = false
snippets = ["${var.controller_clc_snippets}"]
}

View File

@ -1,9 +1,14 @@
output "kubeconfig-admin" {
value = "${module.bootkube.kubeconfig-admin}"
}
output "controllers_dns" {
value = "${digitalocean_record.controllers.0.fqdn}"
}
output "workers_dns" {
value = "${digitalocean_record.workers.0.fqdn}"
# Multiple A and AAAA records with the same FQDN
value = "${digitalocean_record.workers-record-a.0.fqdn}"
}
output "controllers_ipv4" {

View File

@ -10,7 +10,7 @@ resource "null_resource" "copy-controller-secrets" {
}
provisioner "file" {
content = "${module.bootkube.kubeconfig}"
content = "${module.bootkube.kubeconfig-kubelet}"
destination = "$HOME/kubeconfig"
}
@ -78,7 +78,7 @@ resource "null_resource" "copy-worker-secrets" {
}
provisioner "file" {
content = "${module.bootkube.kubeconfig}"
content = "${module.bootkube.kubeconfig-kubelet}"
destination = "$HOME/kubeconfig"
}

View File

@ -92,3 +92,9 @@ variable "cluster_domain_suffix" {
type = "string"
default = "cluster.local"
}
variable "enable_reporting" {
type = "string"
description = "Enable usage or analytics reporting to upstreams (Calico)"
default = "false"
}

View File

@ -1,5 +1,5 @@
# Worker DNS records
resource "digitalocean_record" "workers" {
resource "digitalocean_record" "workers-record-a" {
count = "${var.worker_count}"
# DNS zone where record should be created
@ -11,6 +11,18 @@ resource "digitalocean_record" "workers" {
value = "${element(digitalocean_droplet.workers.*.ipv4_address, count.index)}"
}
resource "digitalocean_record" "workers-record-aaaa" {
count = "${var.worker_count}"
# DNS zone where record should be created
domain = "${var.dns_zone}"
name = "${var.cluster_name}-workers"
type = "AAAA"
ttl = 300
value = "${element(digitalocean_droplet.workers.*.ipv6_address, count.index)}"
}
# Worker droplet instances
resource "digitalocean_droplet" "workers" {
count = "${var.worker_count}"
@ -25,12 +37,16 @@ resource "digitalocean_droplet" "workers" {
ipv6 = true
private_networking = true
user_data = "${data.ct_config.worker_ign.rendered}"
user_data = "${data.ct_config.worker-ignition.rendered}"
ssh_keys = ["${var.ssh_fingerprints}"]
tags = [
"${digitalocean_tag.workers.id}",
]
lifecycle {
create_before_destroy = true
}
}
# Tag to label workers
@ -38,18 +54,19 @@ resource "digitalocean_tag" "workers" {
name = "${var.cluster_name}-worker"
}
# Worker Container Linux Config
data "template_file" "worker_config" {
template = "${file("${path.module}/cl/worker.yaml.tmpl")}"
vars = {
k8s_dns_service_ip = "${cidrhost(var.service_cidr, 10)}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
}
}
data "ct_config" "worker_ign" {
content = "${data.template_file.worker_config.rendered}"
# Worker Ignition config
data "ct_config" "worker-ignition" {
content = "${data.template_file.worker-config.rendered}"
pretty_print = false
snippets = ["${var.worker_clc_snippets}"]
}
# Worker Container Linux config
data "template_file" "worker-config" {
template = "${file("${path.module}/cl/worker.yaml.tmpl")}"
vars = {
cluster_dns_service_ip = "${cidrhost(var.service_cidr, 10)}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
}
}

View File

@ -11,9 +11,9 @@ Typhoon distributes upstream Kubernetes, architectural conventions, and cluster
## Features <a href="https://www.cncf.io/certification/software-conformance/"><img align="right" src="https://storage.googleapis.com/poseidon/certified-kubernetes.png"></a>
* Kubernetes v1.12.1 (upstream, via [kubernetes-incubator/bootkube](https://github.com/kubernetes-incubator/bootkube))
* Single or multi-master, workloads isolated on workers, [Calico](https://www.projectcalico.org/) or [flannel](https://github.com/coreos/flannel) networking
* On-cluster etcd with TLS, [RBAC](https://kubernetes.io/docs/admin/authorization/rbac/)-enabled, [network policy](https://kubernetes.io/docs/concepts/services-networking/network-policies/)
* Kubernetes v1.13.4 (upstream, via [kubernetes-incubator/bootkube](https://github.com/kubernetes-incubator/bootkube))
* Single or multi-master, [flannel](https://github.com/coreos/flannel) networking
* On-cluster etcd with TLS, [RBAC](https://kubernetes.io/docs/admin/authorization/rbac/)-enabled
* Ready for Ingress, Prometheus, Grafana, and other optional [addons](https://typhoon.psdn.io/addons/overview/)
## Docs

View File

@ -1,6 +1,6 @@
# Self-hosted Kubernetes assets (kubeconfig, manifests)
module "bootkube" {
source = "git::https://github.com/poseidon/terraform-render-bootkube.git?ref=81f19507faabf411db9c760d55f3d03f7d78f4c9"
source = "git::https://github.com/poseidon/terraform-render-bootkube.git?ref=953521dbba49eb6a39204f30a3978730eac01e11"
cluster_name = "${var.cluster_name}"
api_servers = ["${format("%s.%s", var.cluster_name, var.dns_zone)}"]
@ -11,6 +11,7 @@ module "bootkube" {
pod_cidr = "${var.pod_cidr}"
service_cidr = "${var.service_cidr}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
enable_reporting = "${var.enable_reporting}"
# Fedora
trusted_certs_dir = "/etc/pki/tls/certs"

View File

@ -19,24 +19,9 @@ write_files:
ETCD_PEER_CERT_FILE=/etc/ssl/certs/etcd/peer.crt
ETCD_PEER_KEY_FILE=/etc/ssl/certs/etcd/peer.key
ETCD_PEER_CLIENT_CERT_AUTH=true
- path: /etc/systemd/system/cloud-metadata.service
content: |
[Unit]
Description=Cloud metadata agent
[Service]
Type=oneshot
Environment=OUTPUT=/run/metadata/cloud
ExecStart=/usr/bin/mkdir -p /run/metadata
ExecStart=/usr/bin/bash -c 'echo "HOSTNAME_OVERRIDE=$(curl\
--url http://169.254.169.254/metadata/v1/interfaces/private/0/ipv4/address\
--retry 10)" > $${OUTPUT}'
[Install]
WantedBy=multi-user.target
- path: /etc/systemd/system/kubelet.service.d/10-typhoon.conf
content: |
[Unit]
Requires=cloud-metadata.service
After=cloud-metadata.service
Wants=rpc-statd.service
[Service]
ExecStartPre=/bin/mkdir -p /opt/cni/bin
@ -55,7 +40,7 @@ write_files:
--authentication-token-webhook \
--authorization-mode=Webhook \
--client-ca-file=/etc/kubernetes/ca.crt \
--cluster_dns=${k8s_dns_service_ip} \
--cluster_dns=${cluster_dns_service_ip} \
--cluster_domain=${cluster_domain_suffix} \
--cni-conf-dir=/etc/kubernetes/cni/net.d \
--exit-on-lock-contention \
@ -65,6 +50,7 @@ write_files:
--node-labels=node-role.kubernetes.io/master \
--node-labels=node-role.kubernetes.io/controller="true" \
--pod-manifest-path=/etc/kubernetes/manifests \
--read-only-port=0 \
--register-with-taints=node-role.kubernetes.io/master=:NoSchedule \
--volume-plugin-dir=/var/lib/kubelet/volumeplugins"
- path: /etc/systemd/system/kubelet.path
@ -89,11 +75,10 @@ bootcmd:
- [modprobe, ip_vs]
runcmd:
- [systemctl, daemon-reload]
- "atomic install --system --name=etcd quay.io/poseidon/etcd:v3.3.10"
- "atomic install --system --name=kubelet quay.io/poseidon/kubelet:v1.12.1"
- "atomic install --system --name=bootkube quay.io/poseidon/bootkube:v0.13.0"
- "atomic install --system --name=etcd quay.io/poseidon/etcd:v3.3.12"
- "atomic install --system --name=kubelet quay.io/poseidon/kubelet:v1.13.4"
- "atomic install --system --name=bootkube quay.io/poseidon/bootkube:v0.14.0"
- [systemctl, start, --no-block, etcd.service]
- [systemctl, enable, cloud-metadata.service]
- [systemctl, enable, kubelet.path]
- [systemctl, start, --no-block, kubelet.path]
users:

View File

@ -1,23 +1,8 @@
#cloud-config
write_files:
- path: /etc/systemd/system/cloud-metadata.service
content: |
[Unit]
Description=Cloud metadata agent
[Service]
Type=oneshot
Environment=OUTPUT=/run/metadata/cloud
ExecStart=/usr/bin/mkdir -p /run/metadata
ExecStart=/usr/bin/bash -c 'echo "HOSTNAME_OVERRIDE=$(curl\
--url http://169.254.169.254/metadata/v1/interfaces/private/0/ipv4/address\
--retry 10)" > $${OUTPUT}'
[Install]
WantedBy=multi-user.target
- path: /etc/systemd/system/kubelet.service.d/10-typhoon.conf
content: |
[Unit]
Requires=cloud-metadata.service
After=cloud-metadata.service
Wants=rpc-statd.service
[Service]
ExecStartPre=/bin/mkdir -p /opt/cni/bin
@ -34,7 +19,7 @@ write_files:
--authentication-token-webhook \
--authorization-mode=Webhook \
--client-ca-file=/etc/kubernetes/ca.crt \
--cluster_dns=${k8s_dns_service_ip} \
--cluster_dns=${cluster_dns_service_ip} \
--cluster_domain=${cluster_domain_suffix} \
--cni-conf-dir=/etc/kubernetes/cni/net.d \
--exit-on-lock-contention \
@ -43,6 +28,7 @@ write_files:
--network-plugin=cni \
--node-labels=node-role.kubernetes.io/node \
--pod-manifest-path=/etc/kubernetes/manifests \
--read-only-port=0 \
--volume-plugin-dir=/var/lib/kubelet/volumeplugins"
- path: /etc/systemd/system/kubelet.path
content: |
@ -65,8 +51,7 @@ bootcmd:
- [modprobe, ip_vs]
runcmd:
- [systemctl, daemon-reload]
- [systemctl, enable, cloud-metadata.service]
- "atomic install --system --name=kubelet quay.io/poseidon/kubelet:v1.12.1"
- "atomic install --system --name=kubelet quay.io/poseidon/kubelet:v1.13.4"
- [systemctl, enable, kubelet.path]
- [systemctl, start, --no-block, kubelet.path]
users:

View File

@ -50,6 +50,12 @@ resource "digitalocean_droplet" "controllers" {
tags = [
"${digitalocean_tag.controllers.id}",
]
lifecycle {
ignore_changes = [
"user_data",
]
}
}
# Tag to label controllers
@ -71,9 +77,9 @@ data "template_file" "controller-cloudinit" {
# etcd0=https://cluster-etcd0.example.com,etcd1=https://cluster-etcd1.example.com,...
etcd_initial_cluster = "${join(",", data.template_file.etcds.*.rendered)}"
ssh_authorized_key = "${var.ssh_authorized_key}"
k8s_dns_service_ip = "${cidrhost(var.service_cidr, 10)}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
ssh_authorized_key = "${var.ssh_authorized_key}"
cluster_dns_service_ip = "${cidrhost(var.service_cidr, 10)}"
cluster_domain_suffix = "${var.cluster_domain_suffix}"
}
}

View File

@ -1,9 +1,14 @@
output "kubeconfig-admin" {
value = "${module.bootkube.kubeconfig-admin}"
}
output "controllers_dns" {
value = "${digitalocean_record.controllers.0.fqdn}"
}
output "workers_dns" {
value = "${digitalocean_record.workers.0.fqdn}"
# Multiple A and AAAA records with the same FQDN
value = "${digitalocean_record.workers-record-a.0.fqdn}"
}
output "controllers_ipv4" {

Some files were not shown because too many files have changed in this diff Show More